Getting Started with Gemma-1.1-2b-it

Apr 11, 2024 • 3 minutes to read

Gemma-1.1-2b-it’s update includes performance improvements and various enhancements based on developer feedback. It addresses bugs and updates terms for greater flexibility. The improvements span across overall performance metrics and bug fixes, aiming to offer superior performance compared to similarly sized open model alternatives. For a detailed overview of the updates and improvements in Gemma 1.1 over the Gemma 1.0 model, please refer directly to the 2 tables in Gemma Model Card on Google AI.

  • How to run Gemma-1.1-2b-it on your own device
  • How to create an OpenAI-compatible API service for Gemma-1.1-2b-it

We will use LlamaEdge (the Rust + Wasm stack) to develop and deploy applications for this model. There is no complex Python packages or C++ toolchains to install! See why we choose this tech stack.

Run Gemma-1.1-2b-it on your own device

Step 1: Install WasmEdge via the following command line.

curl -sSf https://raw.githubusercontent.com/WasmEdge/WasmEdge/master/utils/install.sh | bash -s -- --plugin wasi_nn-ggml

Step 2: Download the Gemma-1.1-2b-it model GGUF file. Since the size of the model is 5.88G so it could take a while to download.

curl -LO https://huggingface.co/second-state/gemma-1.1-2b-it-GGUF/resolve/main/gemma-1.1-2b-it-Q5_K_M.gguf

Step 3: Download a cross-platform portable Wasm file for the chat app. The application allows you to chat with the model on the command line. The Rust source code for the app is here.

curl -LO https://github.com/LlamaEdge/LlamaEdge/releases/latest/download/llama-chat.wasm

That's it. You can chat with the model in the terminal by entering the following command.

wasmedge --dir .:. --nn-preload default:GGML:AUTO:gemma-1.1-2b-it-Q5_K_M.gguf llama-chat.wasm -p gemma-instruct -c 2048

The portable Wasm app automatically takes advantage of the hardware accelerators (eg GPUs) I have on the device.

Create an OpenAI-compatible API service for Gemma-1.1-2b-it

An OpenAI-compatible web API allows the model to work with a large ecosystem of LLM tools and agent frameworks such as flows.network, LangChain and LlamaIndex.

Download an API server app. It is also a cross-platform portable Wasm app that can run on many CPU and GPU devices.

curl -LO https://github.com/LlamaEdge/LlamaEdge/releases/latest/download/llama-api-server.wasm

Then, download the chatbot web UI to interact with the model with a chatbot UI.

curl -LO https://github.com/LlamaEdge/chatbot-ui/releases/latest/download/chatbot-ui.tar.gz
tar xzf chatbot-ui.tar.gz
rm chatbot-ui.tar.gz

Next, use the following command lines to start an API server for the model. Then, open your browser to http://localhost:8080 to start the chat!

wasmedge --dir .:. --nn-preload default:GGML:AUTO:gemma-1.1-2b-it-Q5_K_M.gguf \
  llama-api-server.wasm \
  --prompt-template gemma-instruct \
  --ctx-size 2048 \
  --model-name gemma-1.1-2b

From another terminal, you can interact with the API server using curl.

curl -X POST http://localhost:8080/v1/chat/completions \
  -H 'accept:application/json' \
  -H 'Content-Type: application/json' \
  -d '{"messages":[ {"role":"user", "content": "Write a short story about Goku discovering kirby has teamed up with Majin Buu to destroy the world."}], "model":"gemma-1.1-2b"}'

That’s all. WasmEdge is easiest, fastest, and safest way to run LLM applications. Give it a try!

Talk to us!

Join the WasmEdge discord to ask questions and share insights.

Any questions getting this model running? Please go to second-state/LlamaEdge to raise an issue or book a demo with us to enjoy your own LLMs across devices!

LLMAI inferenceRustWebAssembly
A high-performance, extensible, and hardware optimized WebAssembly Virtual Machine for automotive, cloud, AI, and blockchain applications