WasmEdge Provides a Better Way to Run LLMs on the Edge

Published on 2nd Jan 2024.

The Rust + Wasm tech stack provides a portable, lightweight, and high-performance alternative to Python for AI/LLM inference workloads. The WasmEdge runtime supports open-source LLMs through its GGML (i.e., llama.cpp) plugin. Rust developers only need to call the WASI-NN API in their applications to perform AI inference. Once compiled to Wasm, the application can run on any CPU, GPU, and OS that supports WasmEdge.

Recently, the WasmEdge team has updated its GGML plugin to llama.cpp version b1656.

This release supports new model architectures, such as the very popular Mixtral MoE models. It also brings significant performance and compatibility enhancements to existing llama2 family of models.

Through a WASI-NN-like API, the GGML plugin now returns the number of input and output tokens in each inference call. That is crucial for supporting chatbot, RAG, and agent applications that must actively manage context lengths.

The plugin also provides a new API for returning one output word/token at a time. That allows the Wasm application to perform the inference task asynchronously. It enables important use cases such as streaming returns of LLM responses.

How to install/update the WasmEdge with the GGML plugin

Run the following command in your terminal. The WasmEdge installer will install the latest WasmEdge with GGML plugin.

curl -sSf https://raw.githubusercontent.com/WasmEdge/WasmEdge/master/utils/install.sh | bash -s -- --plugins wasmedge_rustls wasi_nn-ggml

After you installed it successfully, you can use the application in second-state/llama-utils to run a large number of open source LLMs, including the recently released Mixtral 8*7B, on your own laptop or edge device. Specifically, the new GGML plugin supports features that were not available before.

The LLM response can be streamed back to a JavaScript web app, when you use the API server to interact with a model.
The numbers of input and output tokens are now accurately returned after each inference.
The LLM response is streamed output by default when you interact with the LLM via CLI.

Below is a video to show running Mixtral 87B with WasmEdge locally. For detailed instructions on how to run Mixtral 87B on your own devices, refer to this article.

Need a last minute plan for NYE? Let Mixtral-8x7B MoE LLM plan for you! Running on a Jetson IoT device.

You can take the same Wasm binary app to run it on your own MacBook or game PC or server. 2024 looks great for LLM on the edge.

Happy new year! pic.twitter.com/rSX5OAGKPn
— wasmedge (@realwasmedge) January 1, 2024

To quick start, you can also run Mixtral-8x7B with just one single command line on your own device. The command tool automatically downloads and installs the WasmEdge runtime, the model files, and the portable Wasm apps for inference. Learn more about run-llm.sh.

The latest version of the GGML plugin adds support for new LLM models and brings significant feature enhancements to WasmEdge.

What’s next

Given the rapid development of llama.cpp, we will also regularly release updates for the WasmEdge GGML plugin. Stay tuned for these updates.

We have a public roadmap for supporting LLM inference on WasmEdge, and we're keen to incorporate community feedback. Please visit this issue to share your suggestions or feedback. Your input is invaluable to us, and we look forward to hearing from you.

If you have any questions, you are welcome to create a GitHub issue and join our Discord server.