Getting Started with Neural-Chat-7B-v3-1

Dec 18, 2023 • 3 minutes to read

Neural-Chat-7B-v3-1 is a fine-tuned model based on Mistral-7B-v0.1 and trained on the Open-Orca/SlimOrca open-source dataset. The model underwent training between September and October 2023. It incorporates a Direct Preference Optimization (DPO) algorithm, highlighting its advanced fine-tuning and optimization capabilities.

In this article, we will cover

  • How to run Neural-Chat-7B-v3-1 on your own device
  • How to create an OpenAI-compatible API service for Neural-Chat-7B-v3-1

We will use the Rust + Wasm stack to develop and deploy applications for this model. There are no complex Python packages or C++ toolchains to install! See why we choose this tech stack.

Run the Neural-Chat-7B-v3-1 model on your own device

Step 1: Install WasmEdge via the following command line.

curl -sSf https://raw.githubusercontent.com/WasmEdge/WasmEdge/master/utils/install.sh | bash -s -- --plugins wasmedge_rustls wasi_nn-ggml

Step 2: Download the Neural-Chat-7B-v3-1 model GGUF file. It may take a long time, since the size of the model is several GBs.

curl -LO https://huggingface.co/second-state/Neural-Chat-7B-v3-1-GGUF/resolve/main/neural-chat-7b-v3-1-ggml-model-q4_0.gguf

Step 3: Download a cross-platform portable Wasm file for the chat app. The application allows you to chat with the model on the command line. The Rust source code for the app is here.

curl -LO https://github.com/LlamaEdge/LlamaEdge/releases/latest/download/llama-chat.wasm

That's it. You can chat with the model in the terminal by entering the following command.

wasmedge --dir .:. --nn-preload default:GGML:AUTO:neural-chat-7b-v3-1-ggml-model-q4_0.gguf llama-chat.wasm -p intel-neural

The portable Wasm app automatically takes advantage of the hardware accelerators (eg GPUs) I have on the device.

On my Mac M2 16G memory device, it clocks in at about 9 tokens per second.

[You]: 
What is Intel?

[Bot]:
Intel is an American multinational corporation and technology company that focuses on the design and manufacture of semiconductor chips, known as integrated circuits. It is also involved in various other technologies such as computer hardware, software, and services. Founded in 1968, Intel has played a significant role in the development of modern computing and has contributed to numerous innovations in the field.

[You]: 

Create an OpenAI-compatible API service for the Neural-Chat-7B-v3-1 model

An OpenAI-compatible web API allows the model to work with a large ecosystem of LLM tools and agent frameworks such as flows.network, LangChain and LlamaIndex.

Download an API server app. It is also a cross-platform portable Wasm app that can run on many CPU and GPU devices.

curl -LO https://github.com/LlamaEdge/LlamaEdge/releases/latest/download/llama-api-server.wasm

Then, use the following command lines to start an API server for the model.

wasmedge --dir .:. --nn-preload default:GGML:AUTO:neural-chat-7b-v3-1-ggml-model-q4_0.gguf llama-api-server.wasm -p intel-neural

From another terminal, you can interact with the API server using curl.

curl -X POST http://0.0.0.0:8080/v1/chat/completions -H 'accept:application/json' -H 'Content-Type: application/json' -d '{"messages":[{"role":"system", "content":"You are a helpful AI assistant"}, {"role":"user", "content":"What is the capital of France?"}], "model":"Neural-Chat-7B"}'

That’s all. WasmEdge is easiest, fastest, and safest way to run LLM applications. Give it a try!

Join the WasmEdge discord to ask questions and share insights. Any questions getting this model running? Please go to second-state/LlamaEdge to raise an issue or book a demo with us to enjoy your own LLMs across devices!

LLMAI inferenceRustWebAssembly
A high-performance, extensible, and hardware optimized WebAssembly Virtual Machine for automotive, cloud, AI, and blockchain applications