Casebook E06
Building an English Dictionary with an LLM
Part of AI Coding Systems
In my previous blog post, I mentioned the online dictionary feature in my English learning app. Today, I’ll share the implementation details and how I overcame several challenges using large language models (LLMs).
The Challenge
The requirement was straightforward: when a user selects a word in a passage and clicks “explain,” the app should display the word’s meaning. However, after exploring several dictionary solutions, I discovered that they had some drawbacks:
- No official APIs from major providers like Google Dictionary
- Paid services that would incur ongoing costs
- Free services with rate limiting
I also explored open-source dictionaries like OPTED, but discovered it was based on the 1913 Webster’s Dictionary, making it outdated for modern vocabulary (words like “astronaut” weren’t included).
The LLM Approach
I decided to leverage LLMs to generate my English-to-English dictionary for the following reasons:
- LLMs are trained on massive text corpora containing word meanings
- I could extract these definitions efficiently in a one-time process
- Once generated, I’d have a permanent dictionary without recurring costs
Implementation Details
The overall process begins with a list of common English words. For each word, an LLM is used to generate its meanings.
Step 1: Finding a Common English Word List
I obtained a comprehensive list of common English words from the internet (Common-Words.txt).
Step 2: Creating the Dictionary Generator
With GitHub Copilot’s assistance, I developed code to:
- Process each word in the list
- Generate definitions categorized by parts of speech (noun, verb, adjective, etc.)
- Format the output as JSON, with each word as a key and definitions grouped by part of speech
The prompt I used was very simple:
For each word in the provided list, generate definitions categorized by their respective parts of speech abbreviations (n., v., a., etc.).
Format the output as JSON, with each word as a key and definitions grouped by part of speech.
WORDS: {words_str}
Step 3: Optimizing for Cost and Performance
To minimize costs while processing thousands of words, I implemented several optimizations:
- Self-hosted LLM: The cost of using the OpenAI API to generate a dictionary varies depending on the model, ranging from $5 to $50 for creating a 20MB dictionary. However, this approach carries the risk of request rate limiting. Therefore, I opted for a self-hosted LLM, which offers greater flexibility in model selection. For instance, I used Phi-4 for English-to-English dictionaries and Qwen-2.5 for English-to-Chinese dictionaries.
- Spot VMs: Leveraged discounted cloud instances for the generation task
- vLLM: Used vLLM to accelerate batch processing
- Concurrency: Set up concurrent LLM requests to efficiently utilize batch processing capabilities. From an A100 (80GB) VM (NC24ads), I was able to achieve ~500 tokens/s with a batch size of 16.

The Result
Here is the final result for my online dictionary:

This approach provided several advantages:
- Completeness: It now covers ~80,000 commonly used English words
- One-time cost: No ongoing API fees
- No rate limits: The dictionary is stored locally in my app
- Control: I could format the definitions exactly as needed for my UI
Post-processing
To maintain high quality standards for my English dictionary, I implemented a post-processing workflow for the dictionary file:
- Normalizing all Unicode characters throughout the file
- Converting all English words to lowercase for consistency
- Eliminating all words with empty definitions
- Removing problematic words containing whitespace, underscores, or hyphens
- Filtering out words that don’t appear in the Common-Words.txt reference list (using case-insensitive matching)
- Consolidating any duplicate words to ensure each entry is unique
This cleaning process ensures the dictionary remains error-free and optimally structured for reliable performance.
Conclusion
By leveraging LLMs to generate a dictionary, I created a cost-effective solution that provides high-quality definitions without the limitations of existing APIs. The initial investment in processing time and computing resources yielded a permanent asset for my English learning app.
This approach showcases how LLMs can serve as effective tools for creating a comprehensive and practical English dictionary.
More in AI Coding Systems
- My Vibe Coding Journey - Developing an AI-Powered English Learning Application
- ReAct Agents: Building From Scratch - Native Function Calling vs. Custom TAO Parsing
- Automated Prompt Optimization in DSPy: Mechanisms, Algorithms, and Observability
- One Month of Vibe Coding with GitHub Copilot
- GitHub Copilot CLI System Prompt