A trio of scientists from the University of North Carolina, Chapel Hill recently published preprint artificial intelligence (AI) research showcasing how difficult it is to remove sensitive data from large language models (LLMs) such as OpenAI’s ChatGPT and Google’s Bard.
According to the researchers' paper, the task of “deleting” information from LLMs is possible, but it’s just as difficult to verify the information has been removed as it is to actually remove it.
The reason for this has to do with how LLMs are engineered and trained. The models are pretrained on databases and then fine-tuned to generate coherent outputs (GPT stands for “generative pretrained transformer”).
Once a model is trained, its creators cannot, for example, go back into the database and delete specific files in order to prohibit the model from outputting related results. Essentially, all the information a model is trained on exists somewhere inside its weights and parameters where they’re undefinable without actually generating outputs. This is the “black box” of AI.
A problem arises when LLMs trained on massive datasets output sensitive information such as personally identifiable information, financial records, or other potentially harmful and unwanted outputs.
Related: Microsoft to form nuclear power team to support AI: Report
In a hypothetical situation where an LLM was trained on sensitive banking information, for example, there’s typically no way for the AI’s creator to find those files and delete them. Instead, AI devs use guardrails such as hard-coded prompts that inhibit specific behaviors or reinforcement learning from human feedback (RLHF).
In an RLHF paradigm, human assessors engage models with the purpose of eliciting both wanted
Read more on cointelegraph.com