Getting Started with Qwen1.5-72B-Chat

Qwen1.5-72B-Chat,developed by Alibaba Cloud, according to its hugging face page, has the below improvements over the previous released Qwen model: Significant performance improvement in human preference for chat models; Multilingual support of both base and chat models; Stable support of 32K context length for models of all sizes. It surpasses GPT4 in 4 out of 10 benchmarks based on a photo on Qwen’s Github page.

In this article, taking Qwen1.5-72B-Chat as an example, we will cover

How to run Qwen1.5-72B-Chat on your own device
How to create an OpenAI-compatible API service for Qwen1.5-72B-Chat

The Qwen team released 6 model sizes, including 0.5B, 1.8B, 4B, 7B, 14B at the same time. You can also refer to this article to run other models with changing the model name on the command line.

We will use LlamaEdge (Rust + Wasm stack) to develop and deploy applications for this model. There is no complex Python packages or C++ toolchains to install! See why we choose this tech stack.

Run Qwen1.5-72B-Chat on your own device

Step 1: Install WasmEdge via the following command line.

curl -sSf https://raw.githubusercontent.com/WasmEdge/WasmEdge/master/utils/install.sh | bash -s -- --plugins wasmedge_rustls wasi_nn-ggml

Step 2: Download the Qwen1.5-72B-Chat model GGUF file. Since the size of the model is huge, it will take a long time. And we need to download two files and combine the two file into one file.

curl -LO https://huggingface.co/Qwen/Qwen1.5-72B-Chat-GGUF/resolve/main/qwen1_5-72b-chat-q5_k_m.gguf.a
curl -LO https://huggingface.co/Qwen/Qwen1.5-72B-Chat-GGUF/resolve/main/qwen1_5-72b-chat-q5_k_m.gguf.b
cat qwen1_5-72b-chat-q5_k_m.gguf.a qwen1_5-72b-chat-q5_k_m.gguf.b > qwen1_5-72b-chat-q5_k_m.gguf

Step 3: Download a cross-platform portable Wasm file for the chat app. The application allows you to chat with the model on the command line. The Rust source code for the app is here.

curl -LO https://github.com/second-state/LlamaEdge/releases/latest/download/llama-chat.wasm

That's it. You can chat with the model in the terminal by entering the following command.

wasmedge --dir .:. --nn-preload default:GGML:AUTO:qwen1_5-72b-chat-q5_k_m.gguf llama-chat.wasm -p chatml

The portable Wasm app automatically takes advantage of the hardware accelerators (eg GPUs) I have on the device.

[You]:
What is Sora?

[Bot]:
Sora is a fictional character from the anime and manga series "One Punch Man." Sora is known for his unique fighting style, which involves using his hands to manipulate objects and create powerful attacks. Sora has appeared in various forms throughout the series, including as a main character, a supporting character, or a side character.

Create an OpenAI-compatible API service for Qwen1.5-72B-Chat

An OpenAI-compatible web API allows the model to work with a large ecosystem of LLM tools and agent frameworks such as flows.network, LangChain and LlamaIndex.

Download an API server app. It is also a cross-platform portable Wasm app that can run on many CPU and GPU devices.

curl -LO https://github.com/second-state/LlamaEdge/releases/latest/download/llama-api-server.wasm

Then, download the chatbot web UI to interact with the model with a chatbot UI.

curl -LO https://github.com/second-state/chatbot-ui/releases/latest/download/chatbot-ui.tar.gz
tar xzf chatbot-ui.tar.gz
rm chatbot-ui.tar.gz

Next, use the following command lines to start an API server for the model. Then, open your browser to http://localhost:8080 to start the chat!

wasmedge --dir .:. --nn-preload default:GGML:AUTO:qwen1_5-72b-chat-q5_k_m.gguf llama-api-server.wasm -p chatml

From another terminal, you can interact with the API server using curl.

curl -X POST http://localhost:8080/v1/chat/completions \
  -H 'accept:application/json' \
  -H 'Content-Type: application/json' \
  -d '{"messages":[{"role":"system", "content": "You are a sentient, superintelligent artificial general intelligence, here to teach and assist me."}, {"role":"user", "content": "Write a short story about Goku discovering kirby has teamed up with Majin Buu to destroy the world."}], "model":"Qwen1.5-72B-Chat"}'

That’s all. WasmEdge is easiest, fastest, and safest way to run LLM applications. Give it a try!

Talk to us!

Join the WasmEdge discord to ask questions and share insights.

Any questions getting this model running? Please go to second-state/LlamaEdge to raise an issue or book a demo with us to enjoy your own LLMs across devices!