The llama2 family of LLMs are typically trained and fine-tuned in PyTorch. Hence, they are typically distributed as PyTorch projects on Huggingface. However, when it comes to inference, we are much more interested in the GGUF model format for three reasons.
- Python is not a great stack for AI inference. We would like to get rid of PyTorch and Python dependency in production systems. GGUF can support very efficient zero-Python inference using tools like llama.cpp and WasmEdge.
- The llama2 models are trained with 16-bit floating point numbers as weights. It has been demonstrated that we can scale it down to 4-bit integers for inference without losing much power, but saving a large amount of computing resources (expensive GPU RAM in particular). This process is called quantization.
- The GGUF format is specifically designed for LLM inference. It supports LLM tasks like language encoding and decoding, making it faster and easier to use than PyTorch.
Download pre-made artifacts
Many Huggingface repos provide access to llama2 family models that are already quantized to the GGUF format. You can simply download those GGUF files. Here are reliable download links for standard llama2 models in GGUF.
|UUGF model file||7B||13B||70B|
Roll your own
Or, if you have a llama2 model you fine-tuned yourself, you can use llama.cpp to convert and quantize it to GGUF. First, check out the llama.cpp source code on Linux.
git clone https://github.com/ggerganov/llama.cpp cd llama.cpp
Use the convert.py utility to convert a PyTorch model to GGUF. You simply give it the directory containing your PyTorch files. The GGUF model file here is a full 16-bit floating point model. It is not yet quantized.
# Make sure that you have a llama2 PyTorch model in the models/Llama-2-7b-chat/ directory # convert the PyTorch model to GGUF in FP16 weights python convert.py models/Llama-2-7b-chat/ # The result GGUF file ls -al models/Llama-2-7b-chat/ggml-model-f16.gguf
Next, build the llama.cpp application.
mkdir build cd build cmake .. cmake --build . --config Release
Quantize the FP16 GGUF file using the
quantize command line tool you just built. The command below uses 5-bit k-quantization to create a new GGUF model file.
bin/quantize ../models/Llama-2-7b-chat/ggml-model-f16.gguf ../models/Llama-2-7b-chat/ggml-model-q5_k_m.gguf Q5_K_M
That's it. Now you can use the GGUF model file in your own applications, or share it with the world on Huggingface!