Building a Translation Agent on LlamaEdge

By MileyFu, CNCF Ambassador, DevRel and Founding Member of WasmEdge runtime.

Prof. Andrew Ng's agentic translation is a great demonstration on how to cooridnate multiple LLM “agents” to work on a single task. It allows multiple smaller LLMs (like Llama-3 or Gemma-2) to work gether and produce better results than a single large LLM (like ChatGPT).

The translation agent is a great fit for LlamaEdge, which provides a lightweight, embeddable, portable, and Docker-native AI runtime for many different types of models and hardware accelerators. With LlamaEdge, you can build and distribute translation apps with embedded LLMs and prompts that can run on edge devices.

Introduction to the LLM Translation Agent

This LLM Translation Agent is designed to facilitate accurate and efficient translation across multiple languages. It employs open source LLMs (Large Language Models) to provide high-quality translations. You can use your own fine-tuned models or any LLMs on Hugging Face like Meta's Llama 3.

For detailed commands on starting and running this agent, please visit GitHub - Second State/translation-agent.

To get started, clone the Translation Agent.

git clone https://github.com/second-state/translation-agent.git
    
cd translation-agent
git checkout use_llamaedge

Here, we run Llama-3-8B, Gemma-2-9B, and Phi-3-medium-128k locally and our Translation Agent on top of them respectively to showcase their translation quality. We test a simple translation task to see the results so as to compare their translation capabilities. You will need to install WasmEdge and the LlamaEdge API server to run those models across major GPU and CPU platforms.

curl -sSf https://raw.githubusercontent.com/WasmEdge/WasmEdge/master/utils/install_v2.sh | bash -s -- -v 0.13.5

curl -LO https://github.com/LlamaEdge/LlamaEdge/releases/latest/download/llama-api-server.wasm

You will also need the following configurations and prerequisites to run the agent app.

export OPENAI_BASE_URL="http://localhost:8080/v1"
export PYTHONPATH=${PWD}/src
export OPENAI_API_KEY="LLAMAEDGE"

pip install python-dotenv
pip install openai tiktoken icecream langchain_text_splitters

Demo 1: Running Translation Agents with Llama-3-8B

First, let's run the translation agent with Meta AI's popular Llama-3 model. We select the smallest Llama-3 model (the 8b model) for this demo. The translation task is from Chinese to English. Our source text is in Chinese, a brief intro to the ancient Chinese royal palace, the Forbidden City.

Step 1.1: Run Llama-3-8B on your own device

See detailed instructions here: Run Llama-3-8B on your own device

Download the Llama-3-8B model GGUF file. Since the size of the model is 5.73 GB. It can take a while to download.

curl -LO https://huggingface.co/second-state/Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct-Q5_K_M.gguf

Next, use the following command to start an API server for the model.

wasmedge --dir .:. --nn-preload default:GGML:AUTO:Meta-Llama-3-8B-Instruct-Q5_K_M.gguf \
  llama-api-server.wasm \
  --prompt-template llama-3-chat \
  --ctx-size 4096 \
  --model-name llama-3-8b

Step 1.2 Run the Translation Agent on top of Llama-3-8B

Find the examples/example_script.py file in your cloned agent repo and review its code. It tells the agent where to find your document and how to translate it. Change the model name to the one you are using, here we’re using llama-3-8b model; also change the source and target languages you want (here we put Chinese as the source language and English as the target language).

import os
import translation_agent as ta
        
if __name__ == "__main__":
    source_lang, target_lang, country = "Chinese", "English", "Britain"
    
    relative_path = "sample-texts/forbiddencity.txt"
    script_dir = os.path.dirname(os.path.abspath(__file__))
    
    full_path = os.path.join(script_dir, relative_path)
    
    with open(full_path, encoding="utf-8") as file:
        source_text = file.read()
    
    print(f"Source text:\n\n{source_text}\n------------\n")
    
    translation = ta.translate(
            source_lang=source_lang,
            target_lang=target_lang,
            source_text=source_text,
            country=country,
            model="llama-3-8b",
    )
    
    print(f"Translation:\n\n{translation}")

Then, you can find a examples/sample-texts folder in your cloned repo. Put your file you want to translate in this folder and get its path. Here because we named our source text forbiddencity.txt, the relative path to the document would be sample-texts/forbiddencity.txt.

Run the below commands to have your text file translated into English.

cd examples
python example_script.py

Wait for several minutes and you will have a fully translated version appear on your terminal screen.

Demo 2: Running Translation Agents with Gemma-2-9B

The benefit of running the Translation Agent with LlamaEdge is the ability for users to choose and embed different LLMs for different agentic tasks. To demonstrate this point, we will now change the translation agent LLM from Llama-3-8b to Google's Gemma-2-9b, which is of similar size but scores higher on many language-related benchmarks.

The translation task is the same as before. Our source text is in Chinese, a brief intro to the ancient Chinese royal palace, the Forbidden City. The translation target is English.

Step 2.1 Run Gemma-2-9B on your own device

See detailed instructions here: Run Gemma-2-9B on your own device

Download the Gemma-2-9B-it model GGUF file. Since the size of the model is 6.40G, it could take a while to download.

curl -LO https://huggingface.co/second-state/gemma-2-9b-it-GGUF/resolve/main/gemma-2-9b-it-Q5_K_M.gguf

Start an API server for the model.

wasmedge --dir .:. --nn-preload default:GGML:AUTO:gemma-2-9b-it-Q5_K_M.gguf \
  llama-api-server.wasm \
  --prompt-template gemma-instruct \
  --ctx-size 4096 \
  --model-name gemma-2-9b

Step 2.2 Run the Translation Agent to run on top of Gemma-2-9B

Find the examples/example_script.py file in your cloned agent repo and review its code. It tells the agent where to find your document and how to translate it. Change the model name to the one you are using, here we’re using gemma-2-9b model; also change the source and target languages you want (here we put Chinese as the source language and English as the target language).

import os  
import translation_agent as ta  
    
if __name__ == "__main__":
    source_lang, target_lang, country = "Chinese", "English", "Britain"
    
    relative_path = "sample-texts/forbiddencity.txt"
    script_dir = os.path.dirname(os.path.abspath(__file__))
    
    full_path = os.path.join(script_dir, relative_path)
    
    with open(full_path, encoding="utf-8") as file:
        source_text = file.read()
    
    print(f"Source text:\n\n{source_text}\n------------\n")
    
    translation = ta.translate(
            source_lang=source_lang,
            target_lang=target_lang,
            source_text=source_text,
            country=country,
            model="gemma-2-9b",
    )
    
    print(f"Translation:\n\n{translation}")

Run the below commands to have your text file translated into English.

cd examples    
python example_script.py

You can find the translated result in English here.

Demo 3: Running Translation Agents with Phi-3-Medium long context model

The Llama-3 and Gemma-2 models are great LLMs, but they have relatively small context windows. The agent requires all text to fit into the LLM context window, and that limits the size of articles they can translate. To fix this problem, we could select an open source LLM with a large context window. For this demo, we choose Microsoft's Phi-3-medium-128k model, which has a massive 128k (over 100 thousand words or the length of several books) context window.

We run a lengthy Chinese article on Forbidden City's collaboration with the Varsaille Palace through our Translation Agent powered by a Phi-3-medium-128k model we start locally.

Step 3.1: Run Phi-3-medium-128k on your own device

See detailed instructions here: Getting Started with Phi-3-mini-128k.

Download the Phi-3-Medium-128k model GGUF file.

curl -LO https://huggingface.co/second-state/Phi-3-medium-128k-instruct-GGUF/resolve/main/Phi-3-medium-128k-instruct-Q5_K_M.gguf

Run the following command to start an API server for the model with a 128k context window.

wasmedge --dir .:. --nn-preload default:GGML:AUTO:Phi-3-medium-128k-instruct-Q5_K_M.gguf \
  llama-api-server.wasm \
  --prompt-template phi-3-chat \
  --ctx-size 128000 \
  --model-name phi-3-medium-128k

Step 3.2 Clone and run the Translation Agent on top of Phi-3-medium-128k

Find the examples/example_script.py file in your cloned agent repo and review its code. It tells the agent where to find your document and how to translate it. Change the model name to the one you are using, here we’re using phi-3-medium-128k model; also change the source and target languages you want (here we put Chinese as the source language and English as the target language).

import os  
import translation_agent as ta  
    
if __name__ == "__main__":
    source_lang, target_lang, country = "Chinese", "English", "Britain"
    
    relative_path = "sample-texts/long_article.txt"
    script_dir = os.path.dirname(os.path.abspath(__file__))
    
    full_path = os.path.join(script_dir, relative_path)
    
    with open(full_path, encoding="utf-8") as file:
        source_text = file.read()
    
    print(f"Source text:\n\n{source_text}\n------------\n")
    
    translation = ta.translate(
            source_lang=source_lang,
            target_lang=target_lang,
            source_text=source_text,
            country=country,
            model="phi-3-medium-128k",
    )
    
    print(f"Translation:\n\n{translation}")

Then, you can find a examples/sample-texts folder in your cloned repo. Put your file you want to translate in this folder and get its path. Here because we named our source text long_article.txt, the relative path to the document would be sample-texts/long_article.txt.

cd examples
python example_script.py

The translated results were impressive, with the translation capturing the nuances and context of the original text with high fidelity.

Evaluation of Translation Quality

The three models, Llama-3-8B, Gemma-2-9B, and Phi-3-medium, have exhibited varying levels of performance in translating complex historical and cultural content from Chinese to English.

Llama-3-8B provides a translation that effectively captures the factual content but shows occasional stiffness in language, possibly indicating a direct translation approach that doesn't fully adapt idiomatic expressions. It does not keep section title and the format of the original text and left certain part untranslated.

In contrast, The translation by Gemma-2-9B is quite accurate and retains the original meaning of the short intro article of Forbidden city. Gemma-2-9B's translation exhibits a smooth and natural English flow, suggesting a sophisticated understanding of both the source language and the target language’s grammatical structures. The choice of words and sentence structures in Gemma-2-9B's output demonstrates a high degree of linguistic finesse, suggesting it might be well-suited for translating formal and historically nuanced texts.

The Phi-3-medium-128k model can translate book-length text from Chinese to English. It demonstrates robust capabilities in handling large volumes of complex content, suggesting advanced memory handling and contextual awareness. The quality of translation remains consistent even with increased text length, indicating Phi's utility in projects requiring extensive, detailed translations. But you can see it makes certain mistakes like mistaken “Wenhua Hall” as “also known as Forbidden City” in the first paragraph.

Overall, each model has its strengths, with Gemma-2-9B standing out for linguistic finesse and Phi-3-medium-128k for handling lengthy texts.

Conclusion

LlamaEdge provides an easy way to embed different open-source LLMs into your agentic applications to fully take advantage of their finetuned capabilities for specific tasks. The result application can be properly packaged and distributed as a single app that runs across major CPU and GPU devices.