58 Dispatches · 5 Desks · 68 Topics · 3 Series

Unofficial Azure Club

“IaaS, PaaS, Cloud Native, Kubernetes, Everything is possible in this website :)”



Casebook E06

Building an English Dictionary with an LLM

Part of AI Coding Systems

In my previous blog post, I mentioned the online dictionary feature in my English learning app. Today, I’ll share the implementation details and how I overcame several challenges using large language models (LLMs).

The Challenge

The requirement was straightforward: when a user selects a word in a passage and clicks “explain,” the app should display the word’s meaning. However, after exploring several dictionary solutions, I discovered that they had some drawbacks:

  • No official APIs from major providers like Google Dictionary
  • Paid services that would incur ongoing costs
  • Free services with rate limiting

I also explored open-source dictionaries like OPTED, but discovered it was based on the 1913 Webster’s Dictionary, making it outdated for modern vocabulary (words like “astronaut” weren’t included).

The LLM Approach

I decided to leverage LLMs to generate my English-to-English dictionary for the following reasons:

  1. LLMs are trained on massive text corpora containing word meanings
  2. I could extract these definitions efficiently in a one-time process
  3. Once generated, I’d have a permanent dictionary without recurring costs

Implementation Details

The overall process begins with a list of common English words. For each word, an LLM is used to generate its meanings.

Step 1: Finding a Common English Word List

I obtained a comprehensive list of common English words from the internet (Common-Words.txt).

Step 2: Creating the Dictionary Generator

With GitHub Copilot’s assistance, I developed code to:

  • Process each word in the list
  • Generate definitions categorized by parts of speech (noun, verb, adjective, etc.)
  • Format the output as JSON, with each word as a key and definitions grouped by part of speech

The prompt I used was very simple:

For each word in the provided list, generate definitions categorized by their respective parts of speech abbreviations (n., v., a., etc.). 
Format the output as JSON, with each word as a key and definitions grouped by part of speech.

WORDS: {words_str}

Step 3: Optimizing for Cost and Performance

To minimize costs while processing thousands of words, I implemented several optimizations:

  • Self-hosted LLM: The cost of using the OpenAI API to generate a dictionary varies depending on the model, ranging from $5 to $50 for creating a 20MB dictionary. However, this approach carries the risk of request rate limiting. Therefore, I opted for a self-hosted LLM, which offers greater flexibility in model selection. For instance, I used Phi-4 for English-to-English dictionaries and Qwen-2.5 for English-to-Chinese dictionaries.
  • Spot VMs: Leveraged discounted cloud instances for the generation task
  • vLLM: Used vLLM to accelerate batch processing
  • Concurrency: Set up concurrent LLM requests to efficiently utilize batch processing capabilities. From an A100 (80GB) VM (NC24ads), I was able to achieve ~500 tokens/s with a batch size of 16.
vLLM batch throughput
vLLM batch throughput

The Result

Here is the final result for my online dictionary:

Online dictionary result
Online dictionary result

This approach provided several advantages:

  1. Completeness: It now covers ~80,000 commonly used English words
  2. One-time cost: No ongoing API fees
  3. No rate limits: The dictionary is stored locally in my app
  4. Control: I could format the definitions exactly as needed for my UI

Post-processing

To maintain high quality standards for my English dictionary, I implemented a post-processing workflow for the dictionary file:

  1. Normalizing all Unicode characters throughout the file
  2. Converting all English words to lowercase for consistency
  3. Eliminating all words with empty definitions
  4. Removing problematic words containing whitespace, underscores, or hyphens
  5. Filtering out words that don’t appear in the Common-Words.txt reference list (using case-insensitive matching)
  6. Consolidating any duplicate words to ensure each entry is unique

This cleaning process ensures the dictionary remains error-free and optimally structured for reliable performance.

Conclusion

By leveraging LLMs to generate a dictionary, I created a cost-effective solution that provides high-quality definitions without the limitations of existing APIs. The initial investment in processing time and computing resources yielded a permanent asset for my English learning app.

This approach showcases how LLMs can serve as effective tools for creating a comprehensive and practical English dictionary.

More in AI Coding Systems

  1. My Vibe Coding Journey - Developing an AI-Powered English Learning Application
  2. ReAct Agents: Building From Scratch - Native Function Calling vs. Custom TAO Parsing
  3. Automated Prompt Optimization in DSPy: Mechanisms, Algorithms, and Observability
  4. One Month of Vibe Coding with GitHub Copilot
  5. GitHub Copilot CLI System Prompt