Document summarization and question-answering on document -LLM + LangChain
It is known that large language model (LLM) based chat can answer any query with satisfied accuracy. But when ask questions about the fact, in particular the fact is the latest reported and is not covered in the training data lifecycle. With high probability the answer is wrong. One method is to exploit retrieval augmented generation (RAG), and provide the latest documents to LLM as a context. Combining context documents and query will provide satisfied result. The following figure shows the overall flow of RAG based question and answering.
- PDF to text:
- Web document:
- Use WebBaseLoader in LangChain to download and extract text from the URL (In future, it may change to my own crawl if a lot of webpage downloaded.)
- Document chunk and index
- Document chunk is processed by first extracting paragraphs and then group paragraph untill the specified chunk size
- Index engine: In LangChain, many index engines are provided from commercial to open-source. Firstly, vector database such as FAISS and Chroma are exploited, in which vector embedding of documents is extracted using LLM (Here Google Gemma used). Unfortunately, after a few try, the accuracy of recall is very bad. Then I change to traditional indexing such BM25 or TFIDF. At least, when query words exist in the document, the accuracy looks good (Just OK. Semantic vector match will be investigated in fugure to address semathc gap between query question and indexed documents.)
- Prompt template for question & answer task :
qa_prompt_template_cfg = """Answer the question as precise as possible using the provided context. If the answer is
not contained in the context, say "answer not available in context" \n\n
Context: \n {context}?\n
Question: \n {question} \n
Answer:
"""
qa_prompt_template = PromptTemplate(
template = qa_prompt_template_cfg,
input_variables = ["context", "question"]
)
- LLM model: Google Gemma
- The UI and evaluate sample is in the following
- Test sample is a news article, providing a URL, taylor-swift-show-over-but-singapore-will-keep-looking-at-you
- In the first, it is to generate a summary for the news
- Then ask a question: how many fans?
Currently processing a large PDF, e.g. 30-page financial report, is very very slow in my 4070, and the document length is much longer than the context token limit (8192). Testing in the financial resport looks good, if the query words do not have semantic issue. The quality of RAG retrieved documents highly affect answer quality. For RAG based chat, building high recall retrieval system is critical.