Getting Started with Wizard-Vicuna-13B

Wizard-Vicuna-13B is an impressive creation based on the Llama 2 platform and developed by MelodysDreamj. This model represents a significant advancement in the field of large language models (LLMs). It effectively combines the principles of WizardLM and VicunaLM, which includes the dataset from WizardLM and the conversation extension from ChatGPT, along with Vicuna's unique tuning method. This innovative combination results in a robust model capable of a wide range of applications.

In this article, we will cover

How to run Wizard-Vicuna-13B on your own device
How to create an OpenAI-compatible API service for Wizard-Vicuna-13B

We will use the Rust + Wasm stack to develop and deploy applications for this model. There is no complex Python packages or C++ toolchains to install! See why we choose this tech stack.

Run the model on your own device

Step 1: Install WasmEdge via the following command line.

curl -sSf https://raw.githubusercontent.com/WasmEdge/WasmEdge/master/utils/install.sh | bash -s -- --plugins wasmedge_rustls wasi_nn-ggml

Step 2: Download the model GGUF file. It may take a long time, since the size of the model is several GBs.

curl -LO https://huggingface.co/second-state/wizard-vicuna-13B-GGUF/resolve/main/wizard-vicuna-13b-ggml-model-q8_0.gguf

Step 3: Download a cross-platform portable Wasm file for the chat app. The application allows you to chat with the model on the command line. The Rust source code for the app is here.

curl -LO https://github.com/LlamaEdge/LlamaEdge/releases/latest/download/llama-chat.wasm

That's it. You can chat with the model in the terminal by entering the following command.

wasmedge --dir .:. --nn-preload default:GGML:AUTO:wizard-vicuna-13b-ggml-model-q8_0.gguf llama-chat.wasm -p vicuna-chat

The portable Wasm app automatically takes advantage of the hardware accelerators (eg GPUs) I have on the device.

[USER]:
What's the capital of France?
[ASSISTANT]:
The capital of France is Paris.
[USER]:
what about Norway?
[ASSISTANT]:
The capital of Norway is Oslo.
[USER]:
I have two apples, each costing 5 dollars. What is the total cost of these apples?
[ASSISTANT]:
The total cost of the two apples is 10 dollars.
[USER]:
What if I have 3 apples?
[ASSISTANT]:
If you have 3 apples, each costing 5 dollars, the total cost of the apples is 15 dollars.

Create an OpenAI-compatible API service

An OpenAI-compatible web API allows the model to work with a large ecosystem of LLM tools and agent frameworks such as flows.network, LangChain and LlamaIndex.

Download an API server app. It is also a cross-platform portable Wasm app that can run on many CPU and GPU devices.

curl -LO https://github.com/LlamaEdge/LlamaEdge/releases/latest/download/llama-api-server.wasm

Then, use the following command lines to start an API server for the model.

wasmedge --dir .:. --nn-preload default:GGML:AUTO:Wizard-Vicuna-13B.Q5_K_M.gguf llama-api-server.wasm -p vicuna-chat

From another terminal, you can interact with the API server using curl.

curl -X POST http://0.0.0.0:8080/v1/chat/completions -H 'accept:application/json' -H 'Content-Type: application/json' -d '{"messages":[{"role":"system", "content":"You are a helpful AI assistant"}, {"role":"user", "content":"What is the capital of France?"}], "model":"Wizard-Vicuna-13B-7B"}'

That’s all. WasmEdge is easiest, fastest, and safest way to run LLM applications. Give it a try!

Join the WasmEdge discord. Discuss, learn, and share your insights.