Introducing the run-llm.sh, an all-in-one CLI app to run LLMs locally

The run-llm.sh script, developed by Second State, is a command-line tool designed to run a chat interface, and an OpenAI-compatible API server using open-source Large Language Models (LLMs) on your device. This CLI app automatically downloads and installs the WasmEdge runtime, the model files, and the portable Wasm apps for inference. Users simply need to follow the CLI prompts to select their desired options. You can access run-llm.sh here.

Get started with the run-llm.sh

bash <(curl -sSfL 'https://code.flows.network/webhook/iwYN1SdN3AmPgR5ao5Gt/run-llm.sh')

Follow the prompt to install the WasmEdge Runtime and download your favorite open-source LLM. Then, you will be asked to choose whether you want to chat with the model via the CLI or via a web interface.

CLI: Just stay in the terminal. You will see a [USER] prompt and you can ask questions now!

Web UI: After installing a local web app and a local web server (written in Rust and runs in WasmEdge), you will be asked to open http://127.0.0.1:8080 from your browser.

That’s it!

Behind the scene

The run-llm.sh script uses portable Wasm applications to run LLMs in the WasmEdge runtime. The applications are portable hence you can simply copy the wasm binary file to another device with a different CPU or GPU, and it will still work. There is a different wasm application for the CLI and web-based chat UI.

CLI

The llama-chat.wasm app provides a CLI-based chat interface for the LLM. It is written in simple Rust and you can find its source code here. You can download the Wasm app as follows. It is the same for all devices.

curl -LO https://github.com/LlamaEdge/LlamaEdge/releases/latest/download/llama-chat.wasm

The script uses the following command to run the Wasm application. The -p parameter indicates the chat template the model requires for formatting chat messages. You can find a list of models and their corresponding chat template names here.

wasmedge --dir .:. --nn-preload default:GGML:AUTO:llama-2-7b-chat.Q5_K_M.gguf llama-chat.wasm -p llama-2-chat

Web UI

The llama-api-server.wasm app creates a web server that supports an API-based or web-based chat interface for the LLM. It is written in simple Rust and you can find its source code here. You can download the Wasm app as follows. It is the same for all devices.

curl -LO https://github.com/LlamaEdge/LlamaEdge/releases/latest/download/llama-api-server.wasm

wasmedge --dir .:. --nn-preload default:GGML:AUTO:llama-2-7b-chat.Q5_K_M.gguf llama-api-server.wasm -p llama-2-chat

The tech stack

As we have seen, the run-llm.sh applications are written in Rust and compiled to Wasm for cross-platform deployment. It provides a strong alternative to Python-based AI inference. In this way, we don’t have complex Python packages or C++ toolchains to install.

The Rust program manages the user input, tracks the conversation history, transforms the text into the LLM’s specific chat template, and runs the inference operations using the WASI NN API. Rust is the language of the AGI. The Rust + WasmEdge stack provides a unified cloud computing infra that spans from IoT devices to the edge cloud to on-prem servers and to the public cloud. The key benefits are as follows.

Lightweight. The total runtime size is 30MB as opposed 4GB for Python and 350MB for Ollama.
Fast. Full native speed on GPUs.
Portable. Single cross-platform binary on different CPUs, GPUs and OSes.
Secure. Sandboxed and isolated execution on untrusted devices.
Container-ready. Supported in Docker, containerd, Podman, and Kubernetes.
OpenAI compatible. Seamlessly integrate into the OpenAI tooling ecosystem like langchain, Llamaindex, and flows.network.

Whether you're a developer, researcher, or just an AI enthusiast, run-llm.sh offers an efficient and accessible way to harness the power of cutting-edge language models right on your own device. Give it a try!