Getting Started with Mixtral-8x7B

• 4 minutes to read

Published on 1st Jan.

To quick start, you can run Mixtral-8x7B with just one single command on your own device. The command tool automatically downloads and installs the WasmEdge runtime, the model files, and the portable Wasm apps for inference.

When GPT4 first came out, the community speculated “how many billions of parameters” it had to achieve the amazing performance. But as it turned out, the innovation in GPT4 is not just “more parameters”. It is essentially 8 GPT 3.5 models working together. Each of these models are tuned for different tasks (ie an “expert”). This is called a “Mixture of Experts” (MoE).

The input text is dispatched to one of the 8 expert models based on the content and required tasks. The results are then evaluated by other expert models in the group to improve future question routing.

Mixtral 8x7B from Mistral AI is an open source MoE LLM based on 8 Mistral-7B models. With WasmEdge, you can create and run cross-platform applications for this LLM on any device including your own laptops, edge devices and servers.

We will cover:

  • Run Mixtral-8x7B on your own device
  • Create an OpenAI-compatible API service for Mixtral-8x7B

Run Mixtral-8x7B on your own device

Step 1: Install WasmEdge via the following command line.

curl -sSf | bash -s -- --plugin wasi_nn-ggml

Step 2: Download the Mixtral-8x7B-Instrcut-v0.1 GGUF file. It may take a long time, since the size of the model is 32.2 GB.

curl -LO

Step 3: Download a cross-platform portable Wasm file for the chat app. The application allows you to chat with the model on the command line. The Rust source code for the app is here.

curl -LO

That's it. You can chat with the model in the terminal by entering the following command.

wasmedge --dir .:. --nn-preload default:GGML:AUTO:mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf llama-chat.wasm -p mistral-instruct

The portable Wasm app automatically takes advantage of the hardware accelerators (eg GPUs) I have on the device.

What is the best place to watch the new year ball drop in New York City?

The most famous place to watch the New Year Ball Drop is in Times Square, New York City. However, it's important to note that this area is extremely crowded, so if you prefer a less chaotic environment, there are other options. You can watch the ball drop from nearby hotels like the Marriott Marquis or the Embassy Suites, which have rooms and restaurants with views of Times Square. You can also watch it from surrounding bars and restaurants or from special viewing parties. If you're not in New York City, the event is broadcasted live on television and online platforms.

Create an OpenAI-compatible API service for Mixtral-8x7B

An OpenAI-compatible web API allows the model to work with a large ecosystem of LLM tools and agent frameworks such as, LangChain and LlamaIndex.

Download an API server app. It is also a cross-platform portable Wasm app that can run on many CPU and GPU devices. The Rust source code for the app is here.

curl -LO

Then, download the chatbot web UI to interact with the model with a chatbot UI.

curl -LO
tar xzf chatbot-ui.tar.gz
rm chatbot-ui.tar.gz

Next, use the following command lines to start an API server for the model. Then, open your browser to http://localhost:8080 to start the chat!

wasmedge --dir .:. --nn-preload default:GGML:AUTO:mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf llama-api-server.wasm -p mistral-instruct

You can also interact with the API server using curl from another terminal .

curl -X POST http://localhost:8080/v1/chat/completions \
  -H 'accept:application/json' \
  -H 'Content-Type: application/json' \
  -d '{"messages":[{"role":"system", "content": "You are an AI programming assistant."}, {"role":"user", "content": "What is the capital of Paris?"}], "model":"mixtral-8x7b-instruct-v0.1"}'

That’s all. WasmEdge is easiest, fastest, and safest way to run LLM applications. Give it a try!

Join the WasmEdge discord to ask questions or share insights.

Reference: What is “Mixture of Experts” (MoE)?

“Mixture of Experts” (MoE) is a concept in machine learning and artificial intelligence where multiple specialized models or components (referred to as “experts”) are combined to improve overall performance. Each expert is designed to handle a specific subset of data or a particular type of task. A gating network assesses each input and determines which expert is most suitable for it. The outputs of the experts are then combined, often additively. This approach allows for specialized handling of diverse data or tasks within a single model framework, enhancing efficiency and effectiveness.

LLMAI inferenceRustWebAssembly
A high-performance, extensible, and hardware optimized WebAssembly Virtual Machine for automotive, cloud, AI, and blockchain applications