Imagine this: three months until the Spanish driving theory exam. My Spanish vocabulary is close to zero. What does a data person do? Learn as efficiently as possible. I built a “fast conveyor” from a custom ticket corpus, lemma frequency analysis, and Anki cards — and I added AI only where it truly speeds things up.
Quick Overview (Executive Summary)
- Goal: confidently read question wording in three months (learn ~1,500 words).
- Pipeline: corpus (ticket set) → cleaning → spaCy (lemmas/POS) → frequencies → filter known words from Anki → .apkg.
- Outcome: 1,500 words learned; I can read and understand the tickets confidently.
Below is how to repeat the path without “magic” or code overload.
Introduction
This is a story about a fast way to learn only the part of a language you actually need. In this mode we don’t try to learn the whole language first — we focus on the words that serve our goal. For example, we take driving theory tickets in the target language, find the most frequent words, turn them into digestible cards, and in a few weeks we reach confident reading. No magic — just careful analytics and a bit of AI where people get bored (authoring cards for Anki).
I’ll show the conveyor I built — “texts → frequencies → word cards” — where statistics decide and AI helps. Repository link: https://github.com/SergMedin/spanish_analyser_pb
Moving to Spain and Finding the Plan
I recently moved to Spain and set a simple target: get a local driver’s license by passing the exam in Spanish. Before that, I knew virtually no Spanish. I chose a statistical path: first learn the most frequent words that appear in the tickets.
You might ask, “Why is a frequency‑first approach better than general language classes?” Schools often start with general topics like “professions” or “food and travel.” My approach takes the actual texts (in my case — Spanish driving theory tickets) and yields the exact lemmas that most often occur in the wording. This quickly boosts comprehension, reduces cognitive load, and makes progress measurable: the share of unknown words drops week after week. Grammar catches up as needed — without drowning in theory.
At first it seemed like success would split evenly between grammar and vocabulary. But I acted like an analyst: I ran a small experiment. I took about ten tickets and, for every stumbling point, marked whether it was grammar or a missing word. “Word” won almost every time. When lemmas are familiar, even unusual tenses are readable from context. Conclusion: if your goal is to read and answer, 95% of your effort should go into vocabulary.
Before building my tool, I checked existing resources:
- wordfreq (es)
- SUBTLEX-ESP
- CREA (RAE)
- Corpus del Español
- Anki Shared Decks: Spanish
- Subs2SRS / Migaku
They’re all solid, but they miss my domain (Spanish driving theory). SUBTLEX‑ESP and Corpus del Español cover films/books; tickets feature niche lemmas like “ceder el paso,” “glorieta,” and “carril reversible.” Public Anki decks cover basic or tourist vocabulary; subtitle‑based tools are tuned for series and speech. So I decided to build my own tool: it should generate a deck of cards for words that aren’t in my Anki yet — and only those.
By the way, I study with Anki — a spaced‑repetition app: a card appears exactly when you’re about to forget it. New lemmas stick faster. Highly recommend.
Architecture
I started on paper: input is ticket texts, output is a deck of words that aren’t in my Anki yet. Then I kept asking, “What’s the shortest, safest route from input to output?” Anything that didn’t move the goal forward — cut.
First — what not to do. I chose not to complicate things with n‑grams: for reading tickets, they add little and slow you down. The final architecture emerged naturally: a simple conveyor where every block does its job and doesn’t interfere with neighbors.
Architecture (short)
[ Web Scraper ] --(HTML)--> [ Text Cleaner ] --(clean text)--> [ spaCy (context) ]
| |
v v
(auth, retries with pauses) (lemmas, POS,
morphology)
| |
+----------------------> [ Frequency Analyzer ] -----------------+
| |
| v
| [ Cache ] ← html / anki / spacy / openai
v
[ Known Words (Anki) ] --(filter)--> [ Unknown List ]
|
v
[ Deck Generator (.apkg) ] --(requests)--> [ AI Generator ]
(BackText‑HTML by template,
cache and backoff under load)
How It Came Together
- Web Scraper — the loader. Function: ethically download tickets with auth, retry with pauses, and avoid re‑downloading what’s already stored.
- Text Cleaner — preprocessing. Function: turn HTML into clean Spanish text for analysis. Removes tags, normalizes whitespace and articles; a “Latin letters share” heuristic flags junk and odd inserts.
- Analyzer (spaCy) — the NLP layer. Function: process full sentences, extract lemmas/POS/morphology. Without context you lose meaning and gender (e.g., “la capital” vs “el capital”), so I analyze full sentences.
- Frequency Analyzer — aggregator. Function: count lemma frequencies with gender/POS in mind so you don’t split your attention across variants. Outcome: a short, honest list of “what to learn first.”
- Known Words (Anki) — knowledge comparator. Function: remove already‑known lemmas/POS so you don’t relearn them. Uses the Anki plugin, checks availability up front, and leverages cache.
- AI card generator — builds the back side. Function: by a fixed template, assemble BackText‑HTML (three translations, a short definition, “do not confuse with…,” one example, synonyms). Requests are cached to save budget; pauses/backoff protect against overload.
- Deck Generator (.apkg) — packager for Anki. Function: assemble notes with correct fields and produce an importable package.
That’s a straight path from raw HTML to a ready‑to‑study deck.
Let’s see how the analyzer helps “see” meaning in context.
spaCy: the core is a single text analyzer that works over full documents, not bags of words. This solves cases where identical spellings change meaning by gender, e.g., “el/la capital.” spaCy looks at nearby words and correctly distinguishes “la capital” (capital city) from “el capital” (capital as assets). For each token it keeps the original form, lemma, POS, morphology, and offsets.
To avoid relearning duplicates, we reconcile with Anki and prepare an import of only new words.
Anki integration: before touching decks, the system checks: is Anki running? Is the plugin available? If not — you get a clear prompt. The list of “already known” words comes from my decks so I don’t add what I already have in Anki.
A quick note on my Anki card format: Front — the lemma (or a short phrase), sometimes with an article. Back — a compact block: 3 translations, a short definition in your native language, a “do not confuse with…” line with 1–2 close Spanish words, one example, and 3+3 synonyms (Spanish + English). This format keeps both memory and attention disciplined.
AI in the pipeline: AI here doesn’t “do everything,” it does one useful job: writes tidy HTML for the back of the card. Inside — three translations, a short definition, a “do not confuse with…” line, an example in Spanish, and lists of synonyms. The format is consistent, which makes it easier on the brain.
To avoid the AI becoming a lottery, I put guardrails around it: a clear template, caching, retries with pauses under load, and understandable error messages. If the budget is exhausted, the tool doesn’t pretend to work — it asks you to top up and continues with cache. AI becomes a quiet helper, not a magic wand.
Result: 1,500 Lemmas → 90–95% Understanding
When the frequency list was ready, I took the top ~1,500 lemmas and generated an .apkg package for Anki, knowing there were no duplicates. Then I studied daily, and now, after learning those 1,500 words, I can say I understand 90–95% of the wording in the tickets. This is not a trick or a hack — just a rational way to choose what to learn.
Conclusion
I built a reproducible conveyor: assembled my own corpus, ran frequency analysis, connected it to Anki, and added AI to generate the back side of cards.
If you want to repeat this path for your domain, start small: pull the code from my repository and adapt it to your needs; assemble your corpus, compute frequencies, reconcile “known” words, and build an Anki deck.
If you’d like help adapting the pipeline or want to improve the tool — write to me, I’ll be glad to collaborate.