Getting Started with StableLM-2-Zephyr-1.6B

Stability AI’s StableLM-2-Zephyr-1.6B is a 1.6 billion parameter instruction tuned language model inspired by HugginFaceH4's Zephyr 7B training pipeline. The model is trained on a mix of publicly available datasets and synthetic datasets, utilizing Direct Preference Optimization (DPO), a method to fine-tune large language models without complex reward models or reinforcement learning, allowing them to directly learn from human preferences for better control and efficiency.

In this article, we will cover

How to run StableLM-2-Zephyr-1.6B on your own device
How to create an OpenAI-compatible API service for StableLM-2-Zephyr-1.6B

We will use LlamaEdge (the Rust + Wasm stack) to develop and deploy applications for this model. There is no complex Python packages or C++ toolchains to install! See why we choose this tech stack.

Run the model on your own device

Step 1: Install WasmEdge via the following command line.

curl -sSf https://raw.githubusercontent.com/WasmEdge/WasmEdge/master/utils/install.sh | bash -s -- --plugins wasmedge_rustls wasi_nn-ggml

Step 2: Download the StableLM-2-Zephyr-1.6B model GGUF file. It may take a long time, since the size of the model is several GBs.

curl -LO https://huggingface.co/second-state/stablelm-2-zephyr-1.6b-GGUF/resolve/main/stablelm-2-zephyr-1_6b-Q5_K_M.gguf

Step 3: Download a cross-platform portable Wasm file for the chat app. The application allows you to chat with the model on the command line. The Rust source code for the app is here.

curl -LO https://github.com/LlamaEdge/LlamaEdge/releases/latest/download/llama-chat.wasm

That's it. You can chat with the model in the terminal by entering the following command.

wasmedge --dir .:. --nn-preload default:GGML:AUTO:stablelm-2-zephyr-1_6b-Q5_K_M.gguf llama-chat.wasm -p stablelm-zephyr -r '<|endoftext|>' --temp 0.5 -c 1024

The portable Wasm app automatically takes advantage of the hardware accelerators (eg GPUs) I have on the device.

[You]:
Can you explain me what is AMD?

[Bot]:
AMD, or Advanced Micro Devices, Inc., is an American multinational semiconductor manufacturing company that designs, develops, and sells a wide range of advanced digital technology products, including processors, computer systems, graphics solutions, and data center hardware. The company's primary headquarters are in El Segundo, California.

AMD was formed through the merger of two major semiconductor companies: AMD (Advanced Micro Devices) and ATI Technologies Inc., on November 26, 2013, by combining these entities to form a leading technology and innovation company.

The main goal of AMD is to provide advanced computing solutions for various industries such as gaming, professional graphics, data center, and cloud infrastructure. They produce both CPUs (Central Processing Units) and GPUs (Graphics Processors), which are crucial components in modern personal computers, workstations, and gaming consoles.

AMD's products include the Ryzen series of CPUs that feature Zen cores architecture, the Navi GPU architecture for high-end gaming systems, and various data center solutions like EPYC processors and Radeon graphics cards. The company has a wide range of product lines focused on different segments in the technology market, making it one of the major players in the semiconductor industry.

🔥 “Small” LLMs are the ones that have 1-2B parameters (instead of 7-200B). They are still trained with trillions of words. The idea is to push the envelope on “information compression” to develop models that can be much faster and much smaller for specialized use cases, such as… pic.twitter.com/v1b4UFZTeJ
— wasmedge (@realwasmedge) January 31, 2024

Create an OpenAI-compatible API service

An OpenAI-compatible web API allows the model to work with a large ecosystem of LLM tools and agent frameworks such as flows.network, LangChain and LlamaIndex.

Download an API server app. It is also a cross-platform portable Wasm app that can run on many CPU and GPU devices.

curl -LO https://github.com/LlamaEdge/LlamaEdge/releases/latest/download/llama-api-server.wasm

Then, download the chatbot web UI to interact with the model with a chatbot UI.

curl -LO https://github.com/LlamaEdge/chatbot-ui/releases/latest/download/chatbot-ui.tar.gz
tar xzf chatbot-ui.tar.gz
rm chatbot-ui.tar.gz

Next, use the following command lines to start an API server for the model. Then, open your browser to http://localhost:8080 to start the chat!

wasmedge --dir .:. --nn-preload default:GGML:AUTO:stablelm-2-zephyr-1_6b-Q5_K_M.gguf llama-api-server.wasm -p stablelm-zephyr -r '<|endoftext|>' -c 1024

From another terminal, you can interact with the API server using curl.

curl -X POST http://localhost:8080/v1/chat/completions \
  -H 'accept:application/json' \
  -H 'Content-Type: application/json' \
  -d '{"messages":[{"role":"system", "content": "You are a sentient, superintelligent artificial general intelligence, here to teach and assist me."}, {"role":"user", "content": "Write a short story about Goku discovering kirby has teamed up with Majin Buu to destroy the world."}], "model":"StableLM-2-Zephyr-1.6B"}'

That’s all. WasmEdge is easiest, fastest, and safest way to run LLM applications. Give it a try!

Talk to us!

Join the WasmEdge discord to ask questions and share insights. Any questions getting this model running? Please go to second-state/LlamaEdge to raise an issue or book a demo with us to enjoy your own LLMs across devices!