Getting Started with TinyLlama-1.1B-Chat-v0.3

TinyLlama is an open source effort to train a “small” LLM with only 1.1B parameters on a large corpus of data (3T tokens). It is meant to push the scaling-law envelop by compressing as much knowledge as possible into a small model file. The small size also translates to fast inference. If it is successful, it will be a great fit for edge devices and real time applications. It is right at the sweet spot of WasmEdge!

The TinyLlama-1.1B-Chat-v0.3 model is TinyLlama fine-tuned with the OpenAssistant dataset to follow conversations. At this check point (v0.3), the model is trained with 530B tokens (about 1/6 complete). It is already quite impressive from our testings!

In this article, we will cover

How to run TinyLlama-1.1B-Chat-v0.3 on your own device
How to create an OpenAI-compatible API service for TinyLlama-1.1B-Chat-v0.3

We will use the Rust + Wasm stack to develop and deploy applications for this model. There is no complex Python packages or C++ toolchains to install! See why we choose this tech stack.

Run the model on your own device

Step 1: Install WasmEdge via the following command line.

curl -sSf https://raw.githubusercontent.com/WasmEdge/WasmEdge/master/utils/install.sh | bash -s -- --plugins wasmedge_rustls wasi_nn-ggml

Step 2: Download the model GGUF file. It may take a long time, since the size of the model is several GBs.

curl -LO https://huggingface.co/second-state/TinyLlama-1.1B-Chat-v0.3-GGUF/blob/main/tinyllama-1.1b-chat-v0.3.Q5_K_M.gguf

Step 3: Download a cross-platform portable Wasm file for the chat app. The application allows you to chat with the model on the command line. The Rust source code for the app is here.

curl -LO https://github.com/LlamaEdge/LlamaEdge/releases/latest/download/llama-chat.wasm

That's it. You can chat with the model in the terminal by entering the following command.

wasmedge --dir .:. --nn-preload default:GGML:AUTO:tinyllama-1.1b-chat-v0.3.Q5_K_M.gguf llama-chat.wasm -p chatml

The portable Wasm app automatically takes advantage of the hardware accelerators (eg GPUs) I have on the device.

On my Mac M1 32G memory device, it clocks in at about 77.44 tokens per second.

[USER]: 
What is boomer humor?

[ASSISTANT]:
Boomer humor refers to the type of humor that was popular among the baby boomer generation, which includes people born between 1946 and 1964. It often involves references to cultural events, historical figures, and pop culture from the 1950s and 1960s. Boomer humor can be seen in TV shows, movies, and stand-up comedy from that era, and it continues to be a source of nostalgia and entertainment for many people today.

[USER]:

Create an OpenAI-compatible API service

An OpenAI-compatible web API allows the model to work with a large ecosystem of LLM tools and agent frameworks such as flows.network, LangChain and LlamaIndex.

Download an API server app. It is also a cross-platform portable Wasm app that can run on many CPU and GPU devices.

curl -LO https://github.com/LlamaEdge/LlamaEdge/releases/latest/download/llama-api-server.wasm

Then, use the following command lines to start an API server for the model.

wasmedge --dir .:. --nn-preload default:GGML:AUTO:tinyllama-1.1b-chat-v0.3.Q5_K_M.gguf llama-api-server.wasm -p chatml

From another terminal, you can interact with the API server using curl.

curl -X POST http://0.0.0.0:8080/v1/chat/completions -H 'accept:application/json' -H 'Content-Type: application/json' -d '{"messages":[{"role":"system", "content":"You are a helpful AI assistant"}, {"role":"user", "content":"What is the capital of France?"}], "model":"TinyLlama-1.1B-Chat-v0.3"}'

That’s all. WasmEdge is easiest, fastest, and safest way to run LLM applications. Give it a try!

Join the WasmEdge discord to ask questions or share insights.

No time to DIY? Book a Demo with us to enjoy your own LLMs across devices!