Getting Started with Nous-Hermes-2-Mixtral-8x7B SFT

Jan 16, 2024 • 4 minutes to read

To quick start, you can run Nous-Hermes-2-Mixtral-8x7B-SFT with just one single command on your own device. The command tool automatically downloads and installs the WasmEdge runtime, the model files, and the portable Wasm apps for inference.

The Nous Hermes 2 Mixtral 8x7B SFT is a supervised finetune-only version of the Nous Research model trained over the Mixtral 8x7B MoE LLM. It was trained on over 1,000,000 entries of primarily GPT-4 generated data, as well as other high-quality data from open datasets across the AI landscape, achieving state-of-the-art performance on a variety of tasks. This model excels at coding and uses ChatML as the prompt format, enabling a more structured system for engaging the LLM in multi-turn chat dialogue.

In this article, we will cover

  • How to run Nous-Hermes-2-Mixtral-8x7B on your own device
  • How to create an OpenAI-compatible API service for Nous-Hermes-2-Mixtral-8x7B

We will use LlamaEdge (the Rust + Wasm stack) to develop and deploy applications for this model. There is no complex Python packages or C++ toolchains to install! See why we choose this tech stack.

Run the model on your own device

Step 1: Install WasmEdge via the following command line.

curl -sSf https://raw.githubusercontent.com/WasmEdge/WasmEdge/master/utils/install.sh | bash -s -- --plugins wasmedge_rustls wasi_nn-ggml

Step 2: Download the Nous-Hermes-2-Mixtral-8x7B model GGUF file. It may take a long time, since the size of the model is several GBs.

curl -LO https://huggingface.co/second-state/Nous-Hermes-2-Mixtral-8x7B-SFT-GGUF/resolve/main/Nous-Hermes-2-Mixtral-8x7B-SFT-Q5_K_M.gguf

Step 3: Download a cross-platform portable Wasm file for the chat app. The application allows you to chat with the model on the command line. The Rust source code for the app is here.

curl -LO https://github.com/LlamaEdge/LlamaEdge/releases/latest/download/llama-chat.wasm

That's it. You can chat with the model in the terminal by entering the following command.

wasmedge --dir .:. --nn-preload default:GGML:AUTO:Nous-Hermes-2-Mixtral-8x7B-SFT-Q5_K_M.gguf llama-chat.wasm -p chatml

The portable Wasm app automatically takes advantage of the hardware accelerators (eg GPUs) I have on the device.


[You]:
Write a Rust program to check if the output is prime


[Bot]:
Here's a simple Rust program that takes an input from the user, checks whether it's divisible by any number other than 1 and itself (i.e., not a prime), and prints "prime" or "not prime":


use std::io;

fn main() {
    println!("Enter a positive integer: ");

    let mut input = String::new();

    io::stdin().read_line(&mut input)
        .expect("Failed to read line");

    let input: u32 = match input.trim().parse() {
        Ok(num) => num,
        Err(_) => {
            println!("Invalid input!");
            return;
        }
    };

    if is_prime(&input){
        println!("{} is prime!", input);
    } else{
        println!("{} is not prime.", input);
    }
}

fn is_prime(n: &u32) -> bool {
    if *n < 2 { return false; }
    for i in 2..=(*n as f64).sqrt() as u32 {
        if n % &i == 0 { return false; }
    }
    true
}


This program uses a helper function `is_prime`, which checks whether the number is prime or not using trial division. It iterates from 2 to the square root of the input, checking if it's divisible by any of these numbers. If none divide evenly, then it returns true (the input is prime).

Create an OpenAI-compatible API service

An OpenAI-compatible web API allows the model to work with a large ecosystem of LLM tools and agent frameworks such as flows.network, LangChain and LlamaIndex.

Download an API server app. It is also a cross-platform portable Wasm app that can run on many CPU and GPU devices.

curl -LO https://github.com/LlamaEdge/LlamaEdge/releases/latest/download/llama-api-server.wasm

Then, download the chatbot web UI to interact with the model with a chatbot UI.

curl -LO https://github.com/LlamaEdge/chatbot-ui/releases/latest/download/chatbot-ui.tar.gz
tar xzf chatbot-ui.tar.gz
rm chatbot-ui.tar.gz

Next, use the following command lines to start an API server for the model. Then, open your browser to http://localhost:8080 to start the chat!

wasmedge --dir .:. --nn-preload default:GGML:AUTO:Nous-Hermes-2-Mixtral-8x7B-SFT-Q5_K_M.gguf llama-api-server.wasm -p chatml

From another terminal, you can interact with the API server using curl.

curl -X POST http://localhost:8080/v1/chat/completions \
  -H 'accept:application/json' \
  -H 'Content-Type: application/json' \
  -d '{"messages":[{"role":"system", "content": "You are a sentient, superintelligent artificial general intelligence, here to teach and assist me."}, {"role":"user", "content": "Write a short story about Goku discovering kirby has teamed up with Majin Buu to destroy the world."}], "model":"Nous-Hermes-2-Mixtral-8x7B-SFT"}'

That’s all. WasmEdge is easiest, fastest, and safest way to run LLM applications. Give it a try!

Talk to us!

Join the WasmEdge discord to ask questions and share insights. Any questions getting this model running? Please go to second-state/LlamaEdge to raise an issue or book a demo with us to enjoy your own LLMs across devices!

LLMAI inferenceRustWebAssembly
A high-performance, extensible, and hardware optimized WebAssembly Virtual Machine for automotive, cloud, AI, and blockchain applications