Getting Started with Samantha-1.11-CodeLlama-34b

• 3 minutes to read

The Samantha-1.11-CodeLlama-34bmodel is trained on the CodeLlama-34b. This version of Samantha stands out for its coding capabilities and the ability to assist with homework, in addition to acting as a personal companion. The model has undergone training in areas such as philosophy, psychology, and personal relationships, distinguishing it from typical assistant models by also aspiring to be a friend and companion to users.

In this article, we will cover

  • How to run Samantha-1.11-CodeLlama-34b on your own device
  • How to create an OpenAI-compatible API service for Samantha-1.11-CodeLlama-34b

We will use the Rust + Wasm stack to develop and deploy applications for this model. There are no complex Python packages or C++ toolchains to install! See why we choose the Rust + Wasm tech stack.

Run the model on your own device

Step 1: Install WasmEdge via the following command line.

curl -sSf https://raw.githubusercontent.com/WasmEdge/WasmEdge/master/utils/install.sh | bash -s -- --plugin wasi_nn-ggml

Step 2: Download theSamantha-1.11-CodeLlama-34b model GGUF file. It may take a long time, since the size of the model is several GBs.

curl -LO https://huggingface.co/second-state/Samantha-1.11-CodeLlama-34B-GGUF/resolve/main/Samantha-1.11-CodeLlama-34b-ggml-model-q4_0.gguf

Step 3: Download a cross-platform portable Wasm file for the chat app. The application allows you to chat with the model on the command line. The Rust source code for the app is here.

curl -LO https://github.com/LlamaEdge/LlamaEdge/releases/latest/download/llama-chat.wasm

That's it. You can chat with the Samantha-1.11-CodeLlama-34b model in the terminal by entering the following command.

wasmedge --dir .:. --nn-preload default:GGML:AUTO:Samantha-1.11-CodeLlama-34b-ggml-model-q4_0.gguf llama-chat.wasm -p vicuna-chat -s 'You are a helpful AI assistant.'

The portable Wasm app automatically takes advantage of the hardware accelerators (eg GPUs) I have on the device. The first response is slow since the app need to load the large model file into memory. Subsequent responses will be instantaneous.

On my Mac M1 32G memory device, it clocks in at about 6.81 tokens per second.

[USER]: 

Tell me Python code for this nth fibonacci number.

[ASSISTANT]:

Here is the Python code for the nth Fibonacci number:

def fibonacci(n):
    if n <= 1:
        return n
    else:
        return fibonacci(n-1) + fibonacci(n-2)
        
n = int(input("Enter the value of n: "))
print("The", n, "th Fibonacci number is:", fibonacci(n))

Explanation:
- The function `fibonacci()` takes an integer `n’ as input and returns the nth Fibonacci number.
If `n' is less than or equal to 1, it returns `n` itself.
Otherwise, it recursively calls itself with 'n-1' and 'n-2' as arguments and returns the sum of the two results.l The function is called with the value of 'n' entered by the user and the result is printed.

[USER]: 

Create an OpenAI-compatible API service

An OpenAI-compatible web API allows the model to work with a large ecosystem of LLM tools and agent frameworks such as flows.network, LangChain and LlamaIndex.

Download an API server app. It is also a cross-platform portable Wasm app that can run on many CPU and GPU devices.

curl -LO https://github.com/LlamaEdge/LlamaEdge/releases/latest/download/llama-api-server.wasm

Then, use the following command lines to start an API server for the Samantha-1.11-CodeLlama-34b model.

wasmedge --dir .:. --nn-preload default:GGML:AUTO:Samantha-1.11-CodeLlama-34b.Q5_K_M.gguf llama-api-server.wasm -p vicuna-chat

From another terminal, you can interact with the API server using curl.

curl -X POST http://0.0.0.0:8080/v1/chat/completions -H 'accept:application/json' -H 'Content-Type: application/json' -d '{"messages":[{"role":"system", "content":"You are a helpful AI assistant"}, {"role":"user", "content":"Write a hello world program in Rust"}], "model":"Samantha-1.11-CodeLlama-34b"}'

That’s all. WasmEdge is the easiest, fastest, and safest way to run LLM applications. Give it a try!

Join the WasmEdge discord to ask questions and share insights.

No time to DIY? Book a Demo with us to enjoy your own LLMs across devices!

LLMAI inferenceRustWebAssembly
A high-performance, extensible, and hardware optimized WebAssembly Virtual Machine for automotive, cloud, AI, and blockchain applications