Getting Started with SmolLM3‑3B‑GGUF for Long‑Context Multilingual Reasoning

Jul 16, 2025 • 2 minutes to read

SmolLM3 is a compact 3 billion‑parameter transformer that delivers state‑of‑the‑art performance at the 3B–4B scale, supporting six major languages and extended contexts up to 128 000 tokens.

This powerful yet compact model offers capabilities comparable to 4B models, making it lightweight and suitable for edge devices. It excels in long-context reasoning, able to handle up to 128,000 tokens from documents, transcripts, or logs. Furthermore, its multilingual instruction-tuning for English, French, Spanish, German, Italian, and Portuguese makes it ideal for global applications.

Using the Rust + WasmEdge stack and LlamaEdge’s OpenAI‑compatible API, you can self‑host a powerful multilingual agent or summarizer entirely on and across your own hardwares. See why we choose this tech stack.

No extra Python or C++ toolchains needed—WasmEdge handles the runtime.

Step 1: Install the WasmEdge runtime

Download and install WasmEdge with the following command, just several MBs.

curl -sSf https://raw.githubusercontent.com/WasmEdge/WasmEdge/master/utils/install_v2.sh | bash -s -- -v 0.14.1

Step 2: Download SmolLM3‑3B‑GGUF

Download the quantized SmolLM3‑3B‑GGUF, which is 2.1 GB.

curl -LO https://huggingface.co/second-state/SmolLM3-3B-GGUF/resolve/main/SmolLM3-3B-Q5_K_M.gguf

Step 3: Download LlamaEdge API Server

It is a tiny cross-platform LLM inference and API server (also in MBs) that can run on many CPU and GPU devices.

curl -LO https://github.com/LlamaEdge/LlamaEdge/releases/latest/download/llama-api-server.wasm

Step 4: Run the SmolLM3‑3B

Next, use the following command lines to start a LlamaEdge API server for the SmolLM3‑3B model. LlamaEdge provides an OpenAI compatible API, and you can connect any chatbot client or agent to it!

wasmedge --dir .:. --nn-preload default:GGML:AUTO:SmolLM3-3B-Q5_K_M.gguf \
  llama-api-server.wasm \
  --prompt-template chatml \
  --model-name SmolLM3-3B \
  --ctx-size 128000

--ctx-size 128000 unlocks the full 128K context that SmolLM3-3B with. If you don’t have a machine with big memory, you can reduce the number here.

Next, let's send a API request to see if it works.

curl -X POST http://localhost:8080/v1/chat/completions \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{"messages":[{"role":"system", "content": "You are a helpful assistant. Try to be as brief as possible."}, {"role":"user", "content": "Where is the capital of Texas?"}]}'

If everything works well, you will see the following message.

{"id":"chatcmpl-c204cbe5-4969-4f7f-a866-f9ccb4803b02","object":"chat.completion","created":1752846316,"model":"SmolLM3-3B","choices":[{"index":0,"message":{"content":"The capital of Texas is Austin.","role":"assistant"},"finish_reason":"stop","logprobs":null}],"usage":{"prompt_tokens":91,"completion_tokens":13,"total_tokens":104}}%

Enjoy building with SmolLM3!

Join the WasmEdge discord to share insights. Any questions about getting this model running? Please go to second-state/LlamaEdge to raise an issue or book a demo with us to enjoy your own LLMs across devices!

See also: Run Qwen3 across your devices.

LLMAI inferenceRustWebAssemblyHugging Face
A high-performance, extensible, and hardware optimized WebAssembly Virtual Machine for automotive, cloud, AI, and blockchain applications