Llama.cpp,let running LLM in low vRAM gpu smoooth

Posted on January 22, 2024


It is impossible to running large language model such as LLama having 7B parameters in a consumer GPU, having 10GB vRAM or even lower. Llama.cpp rewrites inferfence using c/c++, and make LLM inference available in consumer low vRAM, and even in CPU. How to install llama.cpp?

Llama.cpp is written C/C++, and it can run using the build-in tool to start LLM as a command or as a service. If you want to call these functions in llama.cpp, install llama-cpp-python

#install 

pip install llama-cpp-python

#test llama-cpp api
from llama_cpp import Llama
llm = Llama(model_path="./models/llama2-7b/ggml-model-f16.gguf")
output = llm(
"Q: what is the capital of China? A: ", # Prompt
max_tokens=32, # Generate up to 32 tokens
stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
echo=True # Echo the prompt back in the output
) # Generate a completion, can also call create_completion
print(output)

I test llama2-7B model in rtx 4070 10GB vRAM, and text generation is fast.

Notes: Also try air_llm, which can load 7B LLM model. However, inference is still very slow.

Resource summary