There's a lot of research and work going into getting rid of hallucination.
For those working programmatically, I would recommend looking at the
dspy package (https://github.com/stanfordnlp/dspy) which, under the hood,
does a lot of coercion and chain-of-thoughts to get to your intended
results.
Generally speaking, for RAG, there's 3 major factors you want to consider
around hallucination:
1. The contexts retrieved are factual, and does not conflict with each
other. [1][2]
2. Prompt engineering covers edge cases that you care about.
3. Whether you are asking for the LLM to reason against the information
or output factual information from the contexts given.
[1] There's an interesting paper on whether a RAG would respond with the
information used to train the original model or the information presented
in RAG here https://arxiv.org/abs/2404.10198
[2] Contexts and information retrieved is quite important, in my opinion;
otherwise, it's going to be garbage in, garbage out. This is why when my
company builds RAG systems for customers, a lot of experimentation and
compute goes into the retrieval to find the best configuration for the data.
Disclaimer: I currently work at pointable, a startup working on customized
RAG and RAG-like systems.
On Wed, May 8, 2024 at 12:43 PM Lee, Seong Heon <
[log in to unmask]> wrote:
> Hi,
>
> I agree. Hallucination seems big deal in adopting LLMs for research. I
> don’t see a perfect answer for this issue yet although AI engineers work
> hard to resolve it for future technology. I know that they use
> ‘temperature’ to control the degree of AI’s creativity, so that they
> determine how much AI responses are grounded on the user-provided documents.
>
> However with this limitation, LLMs are widely accepted as a research
> assistant. This is legit even though they do not guarantee 100% fact check.
> In my opinion, it is like that faculty get help from research assistants.
> Faculty is still responsibility to verify all contents of their own
> writing. But employing research assistants will certainly boost their work,
> especially in the beginning stage.
>
> Seong Heon Lee
>
> From: Code for Libraries <[log in to unmask]> on behalf of Lena G.
> Bohman <[log in to unmask]>
> Date: Wednesday, May 8, 2024 at 10:10 AM
> To: [log in to unmask] <[log in to unmask]>
> Subject: Re: [CODE4LIB] rag - retrieval-augmented generation
> External Message
>
>
> Hi all,
> I think this thread is highlighting that the main issue with using LLMs in
> library work is hallucinations. My impression is that at this point no one
> really knows how to correct that flaw, and since our work requires a high
> level of accuracy/truth, it really is a fatal flaw in our field.
>
> I am constantly telling researchers that they cannot use LLM for research
> where they cannot independently fact check the results. This makes them far
> less attractive to my researchers, since they really want LLMs to be able
> to do things they can't already do themselves...
>
> Lena
>
> Lena Bohman
> Senior Data Management and Research Impact Librarian
> Long Island Jewish - Forest Hills Liaison
> Donald and Barbara Zucker School of Medicine at Hofstra/Northwell
> [cid:cbe21533-1efd-4b3a-9506-1ed4e834a004]
> ________________________________
> From: Code for Libraries <[log in to unmask]> on behalf of
> Parthasarathi Mukhopadhyay <[log in to unmask]>
> Sent: Wednesday, May 8, 2024 12:57 PM
> To: [log in to unmask] <[log in to unmask]>
> Subject: Re: [CODE4LIB] rag - retrieval-augmented generation
>
> EXTERNAL MESSAGE
>
> Dear Eric
>
> Thanks for bringing the RAG pipeline to the attention of the community. I
> actually came to know about it from your earlier post on RAG dated March 1,
> 2024, and was trying to play with a RAG pipeline by using all open source
> tools like LlamaIndex-based PrivateGPT, Qdarnt as a vector database, and
> open source LLMs like mistral-7b-instruct-v0.2.Q4_K_M.gguf (as quantized
> GGUF formatted models are more friendly for a CPU-based system like my
> laptop), Orca, etc.
>
> Today I tried with the journal articles you referred to in your earlier
> post and using in your current system (around 135 articles, mainly from CRL
> and ITAL) to upload, index, and retrieve them in my local RAG pipeline. And
> then came a very thought-provoking post from Simon critically studying this
> new RAG system, which actually came into existence to reduce two big issues
> of LLM, like hallucinations and out-of-date non-contextual responses. It
> seems hallucination is an inherent feature of LLM, even when contextualized
> through a RAG pipeline.
>
> However, one interesting point to be mentioned here is the effect of prompt
> engineering on a RAG pipeline. When I ask the same questions as Simon did
> on the same set of documents in a similar kind of pipeline with prompt
> engineering, the result shows some differences (see additional system
> prompt in the snapshot):
>
> [image: image.png]
>
> Regards
>
> Parthasarathi
>
> Parthasarathi Mukhopadhyay
>
> Professor, Department of Library and Information Science,
>
> University of Kalyani, Kalyani - 741 235 (WB), India
>
>
> On Wed, May 8, 2024 at 10:07 PM Eric Lease Morgan <
> [log in to unmask]> wrote:
>
> > On May 8, 2024, at 11:20 AM, Simon Hunt <[log in to unmask]> wrote:
> >
> > > I thought you might be interested in a few tests I tried out- they
> reveal
> > > some interesting hallucinations and misalignment of expectations. Of
> > > course, I don't know the content of the 136 articles you used, so this
> > > might also demonstrate how the chatbot attempts to answer questions
> that
> > > fall outside of scope.
> > >
> > > My input:
> > >
> > >> Please recommend three recent articles that discuss how to catalog
> > musical
> > >> scores.
> > >
> > > It confidently gave me three articles that don't exist (that is, based
> on
> > > searching my own library catalog and Google Scholar), from three
> authors
> > > that don't exist (as far as I could tell), then provided four
> references
> > > that have nothing to do with cataloging musical scores.
> > >
> > > In a new session, I tried a more controversial topic:
> > >
> > >> List the ways that current classification systems reflect a culture of
> > >> white supremacy
> > >
> > > The answer suggests that it self-censored due to the sensitive topic (I
> > > assume there are guardrails behind the scenes). The titles and
> > publication
> > > dates of the references, while real, suggest to me that they aren't
> > likely
> > > to contain much information on the topic of white supremacy in
> > > classification systems (though again, without knowing the sources you
> > used,
> > > they might represent the closest matches).Finally, as a follow-up in
> the
> > same session, I asked
> > >
> > >> What are the most recent articles on the topic of classification and
> > white
> > >> supremacy?
> > >
> > > Like the first answer, the reply is decent, but if the articles
> > referenced
> > > below it actually discuss what the answer claims, the titles sure don't
> > > suggest it. The bot also loves the article *Cataloging Theory in Search
> > of
> > > Graph Theory and Other Ivory Towers* -- it also referenced that in a
> > > colleague's question about subject headings.
> > >
> > > In short, it seems like the effect RAG is having is to provide real
> > > articles as references, but it isn't clear how/if those articles have
> any
> > > content that lines up with the chatbot's output.
> > >
> > > --
> > > Simon Hunt
> > > Director, Automation, Indexing & Metadata
> >
> >
> > Simon, thank you for the feedback, and my short reply is, "Yes!"
> >
> > There are many characteristics going into the process of indexing
> > ("vectorizing") a collection and then providing a generative-AI inteface
> > against the index. Some of them include:
> >
> > * creating a collection - What set of content is to be queried? In this
> > case, I created a collection of 136 articles on cataloging.
> >
> > * curating the collection - This mean providing some context, and I
> > provided authors, titltes, dates, and file names. Curating the collection
> > really helps when it comes to addressing questions and supporting
> > information literacy issues.
> >
> > * indexing - This is the process of vectorizing each document and
> caching
> > the result. This process can be accomplished through the use of a model
> or
> > through the use of a tradtional database. The process is not trivial.
> >
> > * prompt engineering - On the surface, these chatbots seem to take
> > anything as input, but under the hood the inputs are reformulated to
> create
> > "prompts". Different models use different prompts. Many of the mis-steps
> > outlined above could be avoided by better prompt engineering on my part.
> >
> > * generation - My demonstrations use a model called Llama2 to formulate
> > the response. Other models are better at generating structured data like
> > JSON, CSV, etc. Other models are better at outputing software -- Python
> > scripts. I believe the results of my demonsdtration would be better if I
> > were to use ChatGPT, but I'm unwilling to spend the money; I like open
> > source software and making sure everything is computed locally, not
> > remotely.
> >
> > Alignment? RAG works like this:
> >
> > 1. vectorize ("index") content
> > 2. get query and vectorize it too
> > 3. identify content having a similar vector as the query
> > 4. give the generating model (ex: Llama2) both the query
> > as well as the similar content to create the response,
> > and the reponse works similarly to autosuggest on your
> > telephone, but only on steroids
> >
> > Simon, many of the things you outline can be improved, and my hopes is
> > that they will be. "Software is never done, and if it were, then it would
> > be called 'hardware'." Again, thank you.
> >
> > P.S. This morning I created a different chatbot, and this time it is
> > rooted in the works of Jane Austen:
> >
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fe7053a831a40f92a86.gradio.live%2F&data=05%7C02%7Cselee%40CHAPMAN.EDU%7C9ed78d2f5823491ecf4d08dc6f81c825%7C809929af2d2545bf9837089eb9cfbd01%7C0%7C0%7C638507850416777211%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=SQXn7bohv0yY6xE2koE%2Fkjb94lvgzovTHAD%2F%2FcXNbZQ%3D&reserved=0
> <
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fe7053a831a40f92a86.gradio.live%2F&data=05%7C02%7Cselee%40CHAPMAN.EDU%7C9ed78d2f5823491ecf4d08dc6f81c825%7C809929af2d2545bf9837089eb9cfbd01%7C0%7C0%7C638507850416788308%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=QAjaKfQ46Oeqf5NwVhen7H3lPm2vc1XskIObmJfbNUw%3D&reserved=0
> ><https://e7053a831a40f92a86.gradio.live/>
> >
> > --
> > Eric Morgan
> > University of Notre Dame
> >
> **** CAUTION: This email originated from outside of Hofstra University. Do
> not click links or open attachments unless you recognize the sender and
> know the content is safe. ****
>
> NOTE: This email originated from outside Chapman’s network. Do not click
> links or open attachments unless you recognize the sender and know content
> is safe.
>
--
Brian Wu
Email: [log in to unmask]
|