Getting Started with CausalLM

Nov 15, 2023 • 3 minutes to read

The CausalLM 14B model is based on the popular llama2 architecture but with Qwen 14B model weights. The Qwen models are developed by Alibaba to be English / Chinese bilingual LLMs. They perform very well in benchmarks compared with other models of similar sizes.

The CausalLM model is further SFT fine-tuned on an uncensored dataset with 1.3B tokens. So, it follows conversations and provides a solid basis for further fine-tuning with domain specific knowledge and styles.

In this article, we will cover

  • How to run CausalLM-14B on your own device
  • How to create an OpenAI-compatible API service for CausalLLM-14B

We will use the Rust + Wasm stack to develop and deploy applications for this model. There are no complex Python packages or C++ toolchains to install! See why we choose this tech stack.

Run the model on your own device

Step 1: Install WasmEdge via the following command line.

curl -sSf https://raw.githubusercontent.com/WasmEdge/WasmEdge/master/utils/install.sh | bash -s -- --plugins wasmedge_rustls wasi_nn-ggml

Step 2: Download the model GGUF file. It may take a long time since the size of the model is several GBs.

curl -LO https://huggingface.co/second-state/CausalLM-14B-GGUF/resolve/main/causallm_14b.Q5_1.gguf

Step 3: Download a cross-platform portable Wasm file for the chat app. The application allows you to chat with the model on the command line. The Rust source code for the app is here.

curl -LO https://github.com/LlamaEdge/LlamaEdge/releases/latest/download/llama-chat.wasm

That's it. You can chat with the model in the terminal by entering the following command.

wasmedge --dir .:. --nn-preload default:GGML:AUTO:causallm_14b.Q5_1.gguf llama-chat.wasm -chat-llama

The portable Wasm app automatically takes advantage of the hardware accelerators (eg GPUs) I have on the device.

[USER]: 
Who is Robert Oppenheimer?

[ASSISTANT]:
Robert Oppenheimer was a prominent American physicist who led the team that developed the atomic bomb during World War II. He is also known for his involvement in the Manhattan Project and his later advocacy for nuclear disarmament.ples is 15 dollars.

Create an OpenAI-compatible API service

An OpenAI-compatible web API allows the model to work with a large ecosystem of LLM tools and agent frameworks such as flows.network, LangChain and LlamaIndex.

Download an API server app. It is also a cross-platform portable Wasm app that can run on many CPU and GPU devices.

curl -LO https://github.com/LlamaEdge/LlamaEdge/releases/latest/download/llama-api-server.wasm

Then, use the following command lines to start an API server for the model.

wasmedge --dir .:. --nn-preload default:GGML:AUTO:causallm_14b.Q5_1.gguf llama-api-server.wasm -p chatml

From another terminal, you can interact with the API server using curl.

curl -X POST http://0.0.0.0:8080/v1/chat/completions -H 'accept:application/json' -H 'Content-Type: application/json' -d '{"messages":[{"role":"system", "content":"You are a helpful AI assistant"}, {"role":"user", "content":"What is the capital of France?"}], "model":"CausalLM-14B-GGUF"}

That’s all. WasmEdge is the easiest, fastest, and safest way to run LLM applications. Give it a try!

Join the WasmEdge discord to discuss and share your insights.

No time to DIY? Book a Demo with us to enjoy your own LLMs across devices!

LLMAI inferenceRustWebAssembly
A high-performance, extensible, and hardware optimized WebAssembly Virtual Machine for automotive, cloud, AI, and blockchain applications