Fast and Portable Llama2 Inference on the Heterogeneous Edge

Nov 09, 2023 • 12 minutes to read

The Rust+Wasm stack provides a strong alternative to Python in AI inference.

Compared with Python, Rust+Wasm apps could be 1/100 of the size, 100x the speed, and most importantly securely run everywhere at full hardware acceleration without any change to the binary code. Rust is the language of AGI.

We created a very simple Rust program to run inference on llama2 models at native speed. When compiled to Wasm, the binary application (only 2MB) is completely portable across devices with heterogeneous hardware accelerators. The Wasm runtime (WasmEdge) also provides a safe and secure execution environment for cloud environments. In fact, the WasmEdge runtime works seamlessly with container tools to orchestrate and execute the portable application across many different devices.

Chatting with llama2 models on my MacBook

This work is based on the llama.cpp project created by Georgi Gerganov. We adopted the original C++ program to run on Wasm. It works with the GGUF formatted model files.

Step 1. Install WasmEdge with the GGML plugin

Use the following command on Linux or Mac devices to install everything. See here for more details.

curl -sSf https://raw.githubusercontent.com/WasmEdge/WasmEdge/master/utils/install.sh | bash -s -- --plugins wasmedge_rustls wasi_nn-ggml

Step 2. Download a pre-built Wasm app and the model

curl -LO https://github.com/LlamaEdge/LlamaEdge/releases/latest/download/llama-chat.wasm

You should also download a llama2 model in GGUF format. The example below downloads the llama2 7B chat-tuned model quantized to 5-bit weights (see more here).

curl -LO https://huggingface.co/wasmedge/llama2/resolve/main/llama-2-7b-chat-q5_k_m.gguf

Step 3. Run it!

Use WasmEdge to run the wasm inference application, and pass the GGUF model to it. You can now enter a question to chat with the model.

wasmedge --dir .:. --nn-preload default:GGML:AUTO:llama-2-7b-chat-q5_k_m.gguf llama-chat.wasm

[USER]:

Here is a complete example of a conversation.

wasmedge --dir .:. --nn-preload default:GGML:AUTO:llama-2-7b-chat-q5_k_m.gguf llama-chat.wasm

[USER]:
Who is the "father of the atomic bomb"?
[ASSISTANT]:
The "father of the atomic bomb" is a term commonly associated with physicist J. Robert Oppenheimer. Oppenheimer was a leading figure in the development of the atomic bomb during World War II, serving as the director of the Manhattan Project, which was responsible for the development and deployment of the first nuclear weapons. He is often referred to as the "father of the atomic bomb" due to his significant contributions to the field of nuclear physics and his leadership in the development of the bomb.
[USER]:
Was he a communist?
[ASSISTANT]:
J. Robert Oppenheimer was not a communist. While he was associated with the Manhattan Project, which was a secret government project, and was involved in the development of the atomic bomb, he was not a member of the Communist Party or had any known political affiliations with communism. Oppenheimer was a physicist and a scientist, and his primary focus was on the development of nuclear weapons during World War II.

Configure the model behavior

There are command line options you can use to configure how to interact with the model.

~/workspace/llama-utils/chat$ wasmedge llama-chat.wasm -h
Usage: llama-chat.wasm [OPTIONS]

Options:
  -a, --model-alias <ALIAS>
          Model alias [default: default]
  -c, --ctx-size <CTX_SIZE>
          Size of the prompt context [default: 512]
  -n, --n-predict <N_PRDICT>
          Number of tokens to predict [default: 1024]
  -g, --n-gpu-layers <N_GPU_LAYERS>
          Number of layers to run on the GPU [default: 100]
  -b, --batch-size <BATCH_SIZE>
          Batch size for prompt processing [default: 512]
      --temp <TEMP>
          Temperature for sampling [default: 0.8]
      --repeat-penalty <REPEAT_PENALTY>
          Penalize repeat sequence of tokens [default: 1.1]
  -r, --reverse-prompt <REVERSE_PROMPT>
          Halt generation at PROMPT, return control.
  -s, --system-prompt <SYSTEM_PROMPT>
          System prompt message string [default: "[Default system message for the prompt template]"]
  -p, --prompt-template <TEMPLATE>
          Prompt template. [default: llama-2-chat] [possible values: llama-2-chat, codellama-instruct, mistral-instruct-v0.1, mistral-instruct, mistrallite, openchat, belle-llama-2-chat, vicuna-chat, vicuna-1.1-chat, chatml, baichuan-2, wizard-coder, zephyr, intel-neural, deepseek-chat, deepseek-coder]
      --log-prompts
          Print prompt strings to stdout
      --log-stat
          Print statistics to stdout
      --log-all
          Print all log information to stdout
  -h, --help
          Print help
  -V, --version
          Print version

For example, the following command specifies a context length of 2048 tokens and the max number of tokens in each response to 512. It also tells WasmEdge to print out statistics. The LLM response is streamed output by default. The program generates about 25 tokens per second on a low-end M2 macbook.

wasmedge --dir .:. --nn-preload default:GGML:AUTO:llama-2-7b-chat-q5_k_m.gguf llama-chat.wasm -c 2048 -n 512 --log-stat

[USER]:
Who is the "father of the atomic bomb"?

---------------- [LOG: STATISTICS] -----------------

llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size  = 1024.00 MB
llama_new_context_with_model: compute buffer total size = 630.14 MB
llama_new_context_with_model: max tensor size =   102.54 MB
[2023-11-10 17:52:12.768] [info] [WASI-NN] GGML backend: llama_system_info: AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |
 The "father of the atomic bomb" is a term commonly associated with physicist J. Robert Oppenheimer. Oppenheimer was the director of the Manhattan Project, the secret research and development project that produced the atomic bomb during World War II. He is widely recognized as the leading figure in the development of the atomic bomb and is often referred to as the "father of the atomic bomb."
llama_print_timings:        load time =   15643.70 ms
llama_print_timings:      sample time =       2.60 ms /    83 runs   (    0.03 ms per token, 31886.29 tokens per second)
llama_print_timings: prompt eval time =    7836.72 ms /    54 tokens (  145.12 ms per token,     6.89 tokens per second)
llama_print_timings:        eval time =    3198.24 ms /    82 runs   (   39.00 ms per token,    25.64 tokens per second)
llama_print_timings:       total time =   18852.93 ms

----------------------------------------------------

The next example shows it running on an Nvidia A10G machine at 50 tokens per second.

wasmedge --dir .:. --nn-preload default:GGML:AUTO:llama-2-7b-chat-q5_k_m.gguf llama-chat.wasm -c 2048 -n 512 --log-stat

[USER]:
Who is the "father of the atomic bomb"?

---------------- [LOG: STATISTICS] -----------------
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  =   86.04 MB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 35/35 layers to GPU
llm_load_tensors: VRAM used: 4474.93 MB
..................................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 1024.00 MB
llama_new_context_with_model: kv self size  = 1024.00 MB
llama_new_context_with_model: compute buffer total size = 630.14 MB
llama_new_context_with_model: VRAM scratch buffer: 624.02 MB
llama_new_context_with_model: total VRAM used: 6122.95 MB (model: 4474.93 MB, context: 1648.02 MB)
[2023-11-11 00:02:22.402] [info] [WASI-NN] GGML backend: llama_system_info: AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |

llama_print_timings:        load time =    2601.44 ms
llama_print_timings:      sample time =       2.63 ms /    84 runs   (    0.03 ms per token, 31987.81 tokens per second)
llama_print_timings: prompt eval time =     203.90 ms /    54 tokens (    3.78 ms per token,   264.84 tokens per second)
llama_print_timings:        eval time =    1641.84 ms /    83 runs   (   19.78 ms per token,    50.55 tokens per second)
llama_print_timings:       total time =    4254.95 ms

----------------------------------------------------

[ASSISTANT]:
The "father of the atomic bomb" is a term commonly associated with physicist J. Robert Oppenheimer. Oppenheimer was the director of the Manhattan Project, the secret research and development project that produced the first atomic bomb during World War II. He is widely recognized as the leading figure in the development of the atomic bomb and is often referred to as the "father of the atomic bomb."

LLM agents and apps

We have also created an OpenAI-compatible API server using Rust and WasmEdge. It allows you use any OpenAI-compatible developer tools, such as flows.network, to create LLM agents and apps. Learn more here.

(Llama on the edge. Image generated by Midjourney.)

Why not Python?

LLMs like llama2 are typically trained in Python (e.g. PyTorch, Tensorflow, and JAX). But to use Python for inference applications, which is about 95% of the computing in AI, would be a bad mistake.

  • Python packages have complex dependencies. They are difficult to set up and use.
  • Python dependencies are huge. A Docker image for Python or PyTorch is typically several GBs or even tens of GBs. That is especially problematic for AI inference on edge servers or on devices.
  • Python is a very slow language. Up to 35,000x slower than compiled languages such as C, C++, and Rust.
  • Because Python is slow, most of the actual workloads must be delegated to native shared libraries beneath the Python wrapper. That makes Python inference apps great for demos, but very hard to modify under the hood for business-specific needs.
  • The heavy dependency on native libraries, combined with complex dependency management, makes it very hard to port Python AI programs across devices while taking advantage of the device’s unique hardware features.

Commonly used Python packages in LLM toolchain are directly conflicting with each other.

Chris Lattner, of the LLVM, Tensorflow, and Swift language fame, gave a great interview on the This Week in Startup podcast. He discussed why Python is great for model training but the wrong choice for inference applications.

The advantages of Rust+Wasm

The Rust+Wasm stack provides a unified cloud computing infra that spans devices to edge cloud, on-prem servers, and the public cloud. It is a strong alternative to the Python stack for AI inference applications. No wonder Elon Musk said that Rust is the language of AGI.

  • Ultra lightweight. The inference application is just 2MB with all dependencies. It is less than 1% of the size of a typical PyTorch container.
  • Very fast. Native C/Rust speed in all parts of the inference application: pre-processing, tensor computation, and post-processing.
  • Portable. The same Wasm bytecode application can run on all major computing platforms with support for heterogeneous hardware acceleration.
  • Easy to set up, develop and deploy. There are no more complex dependencies. Build a single Wasm file using standard tools on your laptop and deploy it everywhere!
  • Safe and cloud-ready. The Wasm runtime is designed to isolate untrusted user code. The Wasm runtime can be managed by container tools and easily deployed on cloud-native platforms.

The Rust inference program

Our demo inference program is written in Rust and compiled into Wasm. The core Rust source code is very simple. It is only 40 lines of code. The Rust program manages the user input, tracks the conversation history, transforms the text into the llama2’s chat template, and runs the inference operations using the WASI NN API.

fn main() {
   let args: Vec<String> = env::args().collect();
   let model_name: &str = &args[1];

   let graph =
       wasi_nn::GraphBuilder::new(wasi_nn::GraphEncoding::Ggml, wasi_nn::ExecutionTarget::AUTO)
           .build_from_cache(model_name)
           .unwrap();
   let mut context = graph.init_execution_context().unwrap();

   let system_prompt = String::from("<<SYS>>You are a helpful, respectful and honest assistant. Always answer as short as possible, while being safe. <</SYS>>");
   let mut saved_prompt = String::new();

   loop {
       println!("Question:");
       let input = read_input();
       if saved_prompt == "" {
           saved_prompt = format!("[INST] {} {} [/INST]", system_prompt, input.trim());
       } else {
           saved_prompt = format!("{} [INST] {} [/INST]", saved_prompt, input.trim());
       }

       // Set prompt to the input tensor.
       let tensor_data = saved_prompt.as_bytes().to_vec();
       context
           .set_input(0, wasi_nn::TensorType::U8, &[1], &tensor_data)
           .unwrap();

       // Execute the inference.
       context.compute().unwrap();

       // Retrieve the output.
       let mut output_buffer = vec![0u8; 1000];
       let output_size = context.get_output(0, &mut output_buffer).unwrap();
       let output = String::from_utf8_lossy(&output_buffer[..output_size]).to_string();
       println!("Answer:\n{}", output.trim());

       saved_prompt = format!("{} {} ", saved_prompt, output.trim());
   }
}

To build the application yourself, just install the Rust compiler and its wasm32-wasi compiler target.

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
rustup target add wasm32-wasi

Then, check out the source project, and run the cargo command to build the Wasm file from the Rust source project.

# Clone the source project
git clone https://github.com/second-state/llama-utils
cd llama-utils/chat/

# Build
cargo build --target wasm32-wasi --release

# The result wasm file
cp target/wasm32-wasi/release/llama-chat.wasm .

Running in the cloud or on the edge

Once you have the Wasm bytecode file, you can deploy it on any device that supports the WasmEdge runtime. You just need to install the WasmEdge with the GGML plugin. We currently have GGML plugins for generic Linux and Ubuntu Linux — both on x86 and ARM CPUs and Nvidia GPUs, as well as Apple M1/M2/M3.

Based on llama.cpp, the WasmEdge GGML plugin will automatically take advantage of any hardware acceleration on the device to run your llama2 models. For example, if your device has Nvidia GPU, the installer will automatically install a CUDA-optimized version of the GGML plugin. For Mac devices, the Mac OS build of the GGML plugin uses the Metal API to run the inference workload on M1/M2/M3’s built-in neural processing engines. The Linux CPU build of the GGML plugin uses the OpenBLAS library to auto-detect and utilize the advanced computational features, such as AVX and SIMD, on modern CPUs.

That’s how we achieve portability across heterogeneous AI hardware and platforms without sacrificing performance.

What’s next

While the WasmEdge GGML tooling is usable (and indeed used by our cloud-native customers) today, it is still in its early stages. If you are interested in contributing to the open source projects and shaping the direction of future LLM inference infrastructure, here are some low-hanging fruits that you can potentially contribute to!

  • Add GGML plugins for more hardware and OS platforms. We are also interested in TPUs, ARM NPUs, and other specialized AI chips on Linux and Windows.
  • Support more llama.cpp configurations. We currently support passing some config options from Wasm to the GGML plugin. But we would like to support all the options GGML provides!
  • Support WASI NN APIs in other Wasm-compatible languages. We are specifically interested in Go, Zig, Kotlin, JavaScript, C and C++.

Other AI models

As a lightweight, fast, portable, and secure Python alternative, WasmEdge and WASI NN are capable of building inference applications around popular AI models beyond LLMs. For example,

  • The mediapipe-rs project provides Rust+Wasm APIs for Google’s mediapipe suite of Tensorflow models.
  • The WasmEdge YOLO project provides Rust+Wasm APIs to work with YOLOv8 PyTorch models.
  • The WasmEdge ADAS demo shows how to perform road segmentation in self-driving cars using an Intel OpenVINO model.
  • The WasmEdge Document AI project will provide Rust+Wasm APIs for a suite of popular OCR and document processing models.

Lightweight AI inference on the edge has just started!

Join the conversation and contribute to the WasmEdge discord. Discuss, learn, and share your insights.

LLMAI inferenceRustWebAssembly
A high-performance, extensible, and hardware optimized WebAssembly Virtual Machine for automotive, cloud, AI, and blockchain applications