It is impossible to running large language model such as LLama having 7B parameters in a consumer GPU, having 10GB vRAM or even lower. Llama.cpp rewrites inferfence using c/c++, and make LLM inference available in consumer low vRAM, and even in CPU. How to install llama.cpp?
- Clone the source code in the local from llama.cpp.git, following the instruction to install. Llama.cpp can run in windows, macos, and linux. After successfully install, you need converting the native LLM models into llama.cpp format. Currently it supports the following LLMs and multimodal models
- LLaMA
- LLaMA 2
- Falcon
- Alpaca
- GPT4All
- Chinese LLaMA / Alpaca and Chinese LLaMA-2 / Alpaca-2
- Vigogne (French)
- Vicuna
- Koala
- OpenBuddy 🐶 (Multilingual)
- Pygmalion/Metharme
- WizardLM
- Baichuan 1 & 2 + derivations
- Aquila 1 & 2
- Starcoder models
- Mistral AI v0.1
- Refact
- Persimmon 8B
- MPT
- Bloom
- Yi models
- StableLM-3b-4e1t
- Deepseek models
- Qwen models
- Mixtral MoE
- PLaMo-13B
- GPT-2
-
Multimodal models:
- LLaMA
Llama.cpp is written C/C++, and it can run using the build-in tool to start LLM as a command or as a service. If you want to call these functions in llama.cpp, install llama-cpp-python
#install
pip install llama-cpp-python
#test llama-cpp api
from llama_cpp import Llama
llm = Llama(model_path="./models/llama2-7b/ggml-model-f16.gguf")
output = llm(
"Q: what is the capital of China? A: ", # Prompt
max_tokens=32, # Generate up to 32 tokens
stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
echo=True # Echo the prompt back in the output
) # Generate a completion, can also call create_completion
print(output)
I test llama2-7B model in rtx 4070 10GB vRAM, and text generation is fast.
Notes: Also try air_llm, which can load 7B LLM model. However, inference is still very slow.
Resource summary
- https://github.com/ggerganov/llama.cpp
- https://github.com/ggerganov/ggml
- https://github.com/oobabooga/text-generation-webui, a webui for LLM related applications
- https://python.langchain.com/docs/integrations/llms/llamacpp
FEATURED TAGS
computer program
javascript
nvm
node.js
Pipenv
Python
美食
AI
artifical intelligence
Machine learning
data science
digital optimiser
user profile
Cooking
cycling
green railway
feature spot
景点
work
technology
F1
中秋节
dog
setting sun
sql
photograph
Alexandra canal
flowers
bee
greenway corridors
programming
C++
passion fruit
sentosa
Marina bay sands
pigeon
squirrel
Pandan reservoir
rain
otter
Christmas
orchard road
PostgreSQL
fintech
sunset
thean hou temple in sungai lembing
海上日出
SQL optimization
pieces of memory
回忆
garden festival
ta-lib
backtrader
chatGPT
stable diffusion webui
draw.io
streamlit
LLM
prompt engineering
fastapi
stock trading
artificial-intelligence