Gemma-3n-E2B-it for on‑device LLM applications

Gemma 3n was built hand‑in‑glove with some of the biggest mobile‑chip makers out there. It shares the same clever architecture that’ll power the next‑gen Gemini Nano—so you get rock‑solid, on‑device smarts without ever pinging the cloud.

Gemma‑3n‑E2B‑it is Google DeepMind’s newest edge‑first transformer model: a 4.46 B‑parameter MatFormer that behaves like a 2 B model in RAM, runs wholly offline on as little as 2 GB VRAM thanks to Per‑Layer Embeddings (PLE), and still delivers 32 000‑token context and multimodal I/O: ext + image + audio + video.

Quantized into GGUF, Gemma-3n slips seamlessly into the Rust + WasmEdge stack via LlamaEdge’s OpenAI‑compatible server—letting agents, chatbots, or vision pipelines live entirely on laptops, or Raspberry Pi, or Andriod phones going forward, devices. 2 GB VRAM footprint for E2B (and 3 GB for E4B) enables inference on mid‑tier GPUs and even modern phones. Benchmarked by Google as “the world’s best single‑accelerator model.” See why we choose this tech stack.

Quick‑start (Rust + WasmEdge)

Step 1 Install WasmEdge, which is only several MBs.

Download and install WasmEdge with the following command.

curl -sSf https://raw.githubusercontent.com/WasmEdge/WasmEdge/master/utils/install_v2.sh | bash -s -- -v 0.14.1

Step 2 Download the quantized model

We’ll download the gemma-3n-E2B-it-Q5_K_M, which is about 3 GB.

curl -LO https://huggingface.co/second-state/gemma-3n-E2B-it-GGUF/resolve/main/gemma-3n-E2B-it-Q5_K_M.gguf

See the full quant table on the model card if you need smaller Q4 or faster Q2 variants.

Step 3 Download the LlamaEdge API server

It is also a tiny cross-platform portable Wasm app that can run on many CPU and GPU devices.

curl -LO https://github.com/LlamaEdge/LlamaEdge/releases/latest/download/llama-api-server.wasm

LlamaEdge v0.23.1+ exposes an OpenAI‑compatible /v1/chat/completions endpoint.

Step 4 Run the Gemma‑3n‑E2B‑it

Next, use the following command lines to start a LlamaEdge API server for the Gemma‑3n‑E2B‑it model. LlamaEdge provides an OpenAI compatible API, and you can connect any chatbot client or agent to it!

wasmedge --dir .:. \
  --nn-preload default:GGML:AUTO:gemma-3n-E2B-it-Q5_K_M.gguf \
  llama-api-server.wasm \
  --prompt-template gemma-3 \
  --ctx-size 32000 \
  --model-name gemma-3n-E2B-it

--ctx-size 32000 unlocks the full 32 K context that Gemma‑3n with. If you don’t have a machine with big memory, you can reduce the number here.

Test a multimodal chat (text prompt)

curl -X POST http://localhost:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "gemma-3n-E2B-it",
    "messages": [
      {"role":"system","content":"You are a helpful assistant."},
      {"role":"user","content":"Tell me a story about open source."}
    ]
  }'

Below are the response from Gemma-3n-E2B.

{"id":"chatcmpl-f7d9a15f-1c1b-4bf1-95f8-653087851e0c","object":"chat.completion","created":1752676722,"model":"gemma-3n-E2B-it","choices":[{"index":0,"message":{"content":"Okay, here's a concise story about open source:\n\nAnya was tired of her company's clunky software. She couldn't find the feature she needed, and it constantly broke.  She started looking online and discovered \"LibreLang,\" a powerful language processing tool that many others used.  \n\nInstead of buying it, Anya decided to contribute back!  She fixed a bug, wrote documentation, and shared her improvements. Others saw her work, added their own, and LibreLang grew stronger. Soon, Anya's company adopted LibreLang, realizing collaboration and community-driven development was far more powerful than proprietary solutions.","role":"assistant"},"finish_reason":"stop","logprobs":null}],"usage":{"prompt_tokens":28,"completion_tokens":132,"total_tokens":160}}%

Join the WasmEdge discord to share insights. Any questions about getting this model running? Please go to second-state/LlamaEdge to raise an issue or book a demo with us to enjoy your own LLMs across devices!