How do I create a GGUF model file?

Sep 22, 2023 • 2 minutes to read

The llama2 family of LLMs are typically trained and fine-tuned in PyTorch. Hence, they are typically distributed as PyTorch projects on Huggingface. However, when it comes to inference, we are much more interested in the GGUF model format for three reasons.

  • Python is not a great stack for AI inference. We would like to get rid of PyTorch and Python dependency in production systems. GGUF can support very efficient zero-Python inference using tools like llama.cpp and WasmEdge.
  • The llama2 models are trained with 16-bit floating point numbers as weights. It has been demonstrated that we can scale it down to 4-bit integers for inference without losing much power, but saving a large amount of computing resources (expensive GPU RAM in particular). This process is called quantization.
  • The GGUF format is specifically designed for LLM inference. It supports LLM tasks like language encoding and decoding, making it faster and easier to use than PyTorch.

Download pre-made artifacts

Many Huggingface repos provide access to llama2 family models that are already quantized to the GGUF format. You can simply download those GGUF files. Here are reliable download links for standard llama2 models in GGUF.

UUGF model file 7B 13B 70B
Base llama-2-7b.Q5_K_M.gguf llama-2-13b.Q5_K_M.gguf llama-2-70b.Q5_K_M.gguf
Chat llama-2-7b-chat.Q5_K_M.gguf llama-2-13b-chat.Q5_K_M.gguf llama-2-70b-chat.Q5_K_M.gguf

Roll your own

Or, if you have a llama2 model you fine-tuned yourself, you can use llama.cpp to convert and quantize it to GGUF. First, check out the llama.cpp source code on Linux.

git clone
cd llama.cpp

Use the utility to convert a PyTorch model to GGUF. You simply give it the directory containing your PyTorch files. The GGUF model file here is a full 16-bit floating point model. It is not yet quantized.

# Make sure that you have a llama2 PyTorch model in the models/Llama-2-7b-chat/ directory

# convert the PyTorch model to GGUF in FP16 weights
python models/Llama-2-7b-chat/

# The result GGUF file
ls -al models/Llama-2-7b-chat/ggml-model-f16.gguf

Next, build the llama.cpp application.

mkdir build
cd build
cmake ..
cmake --build . --config Release

Quantize the FP16 GGUF file using the quantize command line tool you just built. The command below uses 5-bit k-quantization to create a new GGUF model file.

bin/quantize ../models/Llama-2-7b-chat/ggml-model-f16.gguf ../models/Llama-2-7b-chat/ggml-model-q5_k_m.gguf Q5_K_M

That's it. Now you can use the GGUF model file in your own applications, or share it with the world on Huggingface!

A high-performance, extensible, and hardware optimized WebAssembly Virtual Machine for automotive, cloud, AI, and blockchain applications