Apr 18, 2024 • 3 minutes to read

Meta has just released its next generation of open-source LLM, Meta Llama 3. It is the SOTA of LLMs with better performance than the most capable close-source LLMs! Currently, the Llama3 8b and 70b models are available, and a massive 400b model is expected in the next several months. The Llama3 models were trained on a significantly larger dataset compared to its predecessor, Llama 2, resulting in improved capabilities like reasoning and code generation. Learn more about Meta Llama 3 release here.

In this article, taking Llama-3-8B as an example, we will cover

  • How to run Llama-3-8B on your own device
  • How to create an OpenAI-compatible API service for Llama-3-8B

We will use LlamaEdge (the Rust + Wasm stack) to develop and deploy applications for this model. There is no complex Python packages or C++ toolchains to install! See why we choose this tech stack.

To start quickly, you can use the following command line to run Llama-3-8b on your device. The command line will help you download the required software including the LLM runtime, model and the LLM inference app.

bash <(curl -sSfL '') --model llama-3-8b-instruct

Run Llama-3-8B on your own device

Step 1: Install WasmEdge via the following command line.

curl -sSf | bash -s -- --plugin wasi_nn-ggml

Step 2: Download the Llama-3-8B model GGUF file. Since the size of the model is 5.73 GB,it could take a while to download.

curl -LO

Step 3: Download a cross-platform portable Wasm file for the chat app. The application allows you to chat with the model on the command line. The Rust source code for the app is here.

curl -LO

That's it. You can chat with the model in the terminal by entering the following command.

wasmedge --dir .:. --nn-preload default:GGML:AUTO:Meta-Llama-3-8B-Instruct-Q5_K_M.gguf llama-chat.wasm -p llama-3-chat 

The portable Wasm app automatically takes advantage of the hardware accelerators (eg GPUs) I have on the device.

Create an OpenAI-compatible API service for Llama-3-8B

An OpenAI-compatible web API allows the model to work with a large ecosystem of LLM tools and agent frameworks such as, LangChain and LlamaIndex.

Download an API server app. It is also a cross-platform portable Wasm app that can run on many CPU and GPU devices.

curl -LO

Then, download the chatbot web UI to interact with the model with a chatbot UI.

curl -LO
tar xzf chatbot-ui.tar.gz
rm chatbot-ui.tar.gz

Next, use the following command lines to start an API server for the model. Then, open your browser to http://localhost:8080 to start the chat!

wasmedge --dir .:. --nn-preload default:GGML:AUTO:Meta-Llama-3-8B-Instruct-Q5_K_M.gguf \
  llama-api-server.wasm \
  --prompt-template llama-3-chat \
  --ctx-size 4096 \
  --model-name Llama-3-8B

From another terminal, you can interact with the API server using curl.

url -X POST http://localhost:8080/v1/chat/completions \
  -H 'accept:application/json' \
  -H 'Content-Type: application/json' \
  -d '{"messages":[{"role":"user", "content": "write a hello world in Rust"}], "model":"Llama-3-8B"}'

That’s all. WasmEdge is easiest, fastest, and safest way to run LLM applications. Give it a try!

