Getting Started with Phi-3-mini-128k

May 27, 2024 • 4 minutes to read

The Phi-3-Mini-128K-Instruct is a cutting-edge model with 3.8 billion parameters, designed for lightweight yet powerful natural language processing tasks. Trained on the Phi-3 datasets, which include synthetic and filtered publicly available website data, this model prioritizes high-quality and reasoning-dense properties. It belongs to the Phi-3 family and comes in two variants: 4K and 128K, referring to the context length it can handle in tokens.

Following its initial training, the model underwent a rigorous post-training process involving supervised fine-tuning and direct preference optimization. This process aimed to enhance its ability to follow instructions and adhere to safety measures, ensuring reliable and secure interactions.

When evaluated against various benchmarks covering common sense, language understanding, mathematics, coding, long-term context, and logical reasoning, the Phi-3 Mini-128K-Instruct demonstrated robust and state-of-the-art performance, particularly notable among models with fewer than 13 billion parameters.

In this article, taking Phi-3-mini-128k as an example, we will cover

  • How to run Phi-3-mini-128k on your own device
  • How to create an OpenAI-compatible API service for Phi-3-mini-128k

You can also try out Phi-3-medium-128k following these steps, just by changing the model name from “Phi-3-mini-128k” to “Phi-3-medium-128k”.

We will use LlamaEdge (the Rust + Wasm stack) to develop and deploy applications for this model. There is no complex Python packages or C++ toolchains to install! See why we choose this tech stack.

Run Phi-3-mini-128k on your own device

Step 1: Install WasmEdge via the following command line.

curl -sSf | bash -s -- --plugin wasi_nn-ggml wasmedge_rustls

Step 2: Download the Phi-3-mini-128k model GGUF file.Since the size of the model is 2.82 GB so it could take a while to download.

curl -LO

Step 3: Download a cross-platform portable Wasm file for the chat app. The application allows you to chat with the model on the command line. The Rust source code for the app is here.

curl -LO

That's it. You can chat with the model in the terminal by entering the following command.

wasmedge --dir .:. --nn-preload default:GGML:AUTO:Phi-3-mini-128k-instruct-Q5_K_M.gguf \
  llama-chat.wasm \
  --prompt-template phi-3-chat \
  --ctx-size 128000

The portable Wasm app automatically takes advantage of the hardware accelerators (eg GPUs) I have on the device. Here is a trick question I asked it.

I have 5 apples today. I ate 3 apples last week. How many apples do I have now?

If you had 5 apples today and ate 3 apples last week, then according to the information provided, you still have 5 apples now. The action of eating apples last week doesn't affect the number of apples you currently have today.

The Phi-3-mini-128k model has great logical reasoning capability.

Create an OpenAI-compatible API service for Phi-3-mini-128k

An OpenAI-compatible web API allows the model to work with a large ecosystem of LLM tools and agent frameworks such as, LangChain and LlamaIndex.

Download an API server app. It is also a cross-platform portable Wasm app that can run on many CPU and GPU devices.

curl -LO

Then, download the chatbot web UI to interact with the model with a chatbot UI.

curl -LO
tar xzf chatbot-ui.tar.gz
rm chatbot-ui.tar.gz

Next, use the following command lines to start an API server for the model. Then, open your browser to http://localhost:8080 to start the chat!

wasmedge --dir .:. --nn-preload default:GGML:AUTO:Phi-3-mini-128k-instruct-Q5_K_M.gguf \
  llama-api-server.wasm \
  --prompt-template phi-3-chat \
  --ctx-size 128000 \
  --model-name phi-3-mini-128k

From another terminal, you can interact with the API server using curl.

url -X POST http://localhost:8080/v1/chat/completions \
  -H 'accept:application/json' \
  -H 'Content-Type: application/json' \
  -d '{"messages":[{"role":"user", "content": "write a hello world in Rust"}], "model":"Phi-3-mini-128k"}'

That’s all. WasmEdge is easiest, fastest, and safest way to run LLM applications. Give it a try!

Talk to us!

Join the WasmEdge discord to ask questions and share insights.

Any questions getting this model running? Please go to second-state/LlamaEdge to raise an issue or book a demo with us to enjoy your own LLMs across devices!

LLMAI inferenceRustWebAssembly
A high-performance, extensible, and hardware optimized WebAssembly Virtual Machine for automotive, cloud, AI, and blockchain applications