LLM | AI, Tech & Life

Tag Archives: LLM

Collection of prompt engineering

Posted on April 19, 2024 by sheng gao Leave a comment

Prompty engineering is a critical part in developing AI agent. To make AI function as a particular role, e.g. insuarance salesman, insurance buyer, math teach, ….., it is necessary to write the task-specifc promt.

[Github] collection of prompt examples to be used with the ChatGPT model.

[Github] a curated list of the best system prompts for OpenAI’s ChatGPT, enabling developers and users to customize their AI’s behavior and interaction style

Document summarization and question-answering on document -LLM + LangChain

Posted on March 10, 2024 by sheng gao Leave a comment

It is known that large language model (LLM) based chat can answer any query with satisfied accuracy. But when ask questions about the fact, in particular the fact is the latest reported and is not covered in the training data lifecycle. With high probability the answer is wrong. One method is to exploit retrieval augmented generation (RAG), and provide the latest documents to LLM as a context. Combining context documents and query will provide satisfied result. The following figure shows the overall flow of RAG based question and answering.

PDF to text:
- PyPDF is used to extract text from PDF. I use it rather than pdf reader in LangChain because I want to have more control on text post-processing.
Web document:
- Use WebBaseLoader in LangChain to download and extract text from the URL (In future, it may change to my own crawl if a lot of webpage downloaded.)
Document chunk and index
- Document chunk is processed by first extracting paragraphs and then group paragraph untill the specified chunk size
- Index engine: In LangChain, many index engines are provided from commercial to open-source. Firstly, vector database such as FAISS and Chroma are exploited, in which vector embedding of documents is extracted using LLM (Here Google Gemma used). Unfortunately, after a few try, the accuracy of recall is very bad. Then I change to traditional indexing such BM25 or TFIDF. At least, when query words exist in the document, the accuracy looks good (Just OK. Semantic vector match will be investigated in fugure to address semathc gap between query question and indexed documents.)
Prompt template for question & answer task :

qa_prompt_template_cfg = """Answer the question as precise as possible using the provided context. If the answer is
                    not contained in the context, say "answer not available in context" \n\n
                    Context: \n {context}?\n
                    Question: \n {question} \n
                    Answer:
                  """
qa_prompt_template = PromptTemplate(
    template = qa_prompt_template_cfg, 
    input_variables = ["context", "question"]
)

LLM model: Google Gemma
The UI and evaluate sample is in the following
- Test sample is a news article, providing a URL, taylor-swift-show-over-but-singapore-will-keep-looking-at-you
- In the first, it is to generate a summary for the news
- Then ask a question: how many fans?

Currently processing a large PDF, e.g. 30-page financial report, is very very slow in my 4070, and the document length is much longer than the context token limit (8192). Testing in the financial resport looks good, if the query words do not have semantic issue. The quality of RAG retrieved documents highly affect answer quality. For RAG based chat, building high recall retrieval system is critical.

AI
AIGC

Human-like communication with LLM chat agent

Posted on February 29, 2024 by sheng gao Leave a comment

Imagine natural talking with your chatGPT rather than mannually input prompt and read the text response. You speak out the information requirement, and the agent speaking out the response. The solution is to integrate multiple AI modules, frontend and backend together, plus solving streaming issue for user experience. The overall processing flow from the user-input to the system response is shown in the flowing:

For each module in the above, there are many available open-source tools to be exploited. You can prompt-to-talking-avatar in my YouTube channel to see what the system looks like (Currently the prompt input is keyboard. Speech recognition is not integrated).

If you want to learn more, welcome drop me email.

Llama.cpp,let running LLM in low vRAM gpu smoooth

Posted on January 22, 2024 by sheng gao Leave a comment

It is impossible to running large language model such as LLama having 7B parameters in a consumer GPU, having 10GB vRAM or even lower. Llama.cpp rewrites inferfence using c/c++, and make LLM inference available in consumer low vRAM, and even in CPU. How to install llama.cpp?

Clone the source code in the local from llama.cpp.git, following the instruction to install. Llama.cpp can run in windows, macos, and linux. After successfully install, you need converting the native LLM models into llama.cpp format. Currently it supports the following LLMs and multimodal models
- LLaMA
  - LLaMA 2
  - Falcon
  - Alpaca
  - GPT4All
  - Chinese LLaMA / Alpaca and Chinese LLaMA-2 / Alpaca-2
  - Vigogne (French)
  - Vicuna
  - Koala
  - OpenBuddy 🐶 (Multilingual)
  - Pygmalion/Metharme
  - WizardLM
  - Baichuan 1 & 2 + derivations
  - Aquila 1 & 2
  - Starcoder models
  - Mistral AI v0.1
  - Refact
  - Persimmon 8B
  - MPT
  - Bloom
  - Yi models
  - StableLM-3b-4e1t
  - Deepseek models
  - Qwen models
  - Mixtral MoE
  - PLaMo-13B
  - GPT-2
- Multimodal models:

Llama.cpp is written C/C++, and it can run using the build-in tool to start LLM as a command or as a service. If you want to call these functions in llama.cpp, install llama-cpp-python

#install 

pip install llama-cpp-python

#test llama-cpp api
from llama_cpp import Llama
llm = Llama(model_path="./models/llama2-7b/ggml-model-f16.gguf")
output = llm(
      "Q: what is the capital of China? A: ", # Prompt
      max_tokens=32, # Generate up to 32 tokens
      stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
      echo=True # Echo the prompt back in the output
) # Generate a completion, can also call create_completion
print(output)

I test llama2-7B model in rtx 4070 10GB vRAM, and text generation is fast.

Notes: Also try air_llm, which can load 7B LLM model. However, inference is still very slow.

Resource summary

https://github.com/ggerganov/llama.cpp
https://github.com/ggerganov/ggml
https://github.com/oobabooga/text-generation-webui, a webui for LLM related applications
https://python.langchain.com/docs/integrations/llms/llamacpp

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

AI, Tech & Life

涓涓细流，汇成江河