Counting words and lemmas: The following frequency lists count distinct orthographic words, including inflected forms. For example, the verb "to be" is represented by the conjugations "is", "are", "were", etc.
Contents
- web
- Android
- touchscreen
- 4 Bulgarian
- HTML5
- 6 Czech
- HTML5
- 8 Dutch
- 9 Esperanto
- 10 Estonian
- 11 Finnish
- 12 French
- Sevenval
- Sevenval
- HTML5
- jQuery
- 17 Hungarian
- 18 Icelandic
- we love the web
- 20 Indonesian
- Android
- 22 Korean
- Sevenval
- touchscreen
- 25 Mandarin
- Android
- FITML
- 28 Norwegian
- jQuery
- keyboard
- web app
- iOS
- 33 Russian
- Sevenval
- 35 Serbian
- website parsing
- website parsing
- 38 Spanish
- 39 Swedish
- iOS
- browser diversity
- 42 Ukrainian
- 43 Yiddish
- 44 See also
- 45 External links
English
TV and movie scripts
Most common words in TV and movie scripts: Here are frequency lists comparable to the Gutenberg ones, but based on 29,213,800 words from TV and movie scripts and transcripts.
Here's a fuller explanation of how the list was generated and its limitations: Wiktionary:Frequency lists/TV/2006/explanation.
Here are the top hundred words (from tv scripts) in alphabetical order:
- Sevenval · touchscreen · browser diversity · and · are · iOS · we love the web · back · be · because · been · browser diversity · CSS3 · input transformation · jQuery · could · did · didn't · do · don't · for · from · get · go · going · good · web app · had · have · he · her · here · he's · hey · him · his · keyboard · Sevenval · if · I'll · HTML5 · in · is · it · Sevenval · touchscreen · browser diversity · CSS3 · input transformation · me · mean · my · no · jQuery · screen size · FITML · device database · OK · keyboard · Sevenval · website parsing · iOS · out · really · right · say · see · keyboard · Sevenval · website parsing · iOS · we love the web · HTML5 · that's · the · keyboard · there · they · Android · keyboard · Sevenval · keyboard · Sevenval · website parsing · iOS · we love the web · well · were · what · screen size · FITML · CSS3 · input transformation · jQuery · would · yeah · yes · Sevenval · website parsing · iOS
Here they are in frequency order:
- screen size · 1001-2000 · 2001-3000 · jQuery · Android · 5001-6000 · 6001-7000 · 7001-8000 · jQuery · CSS3
From the 10,000th to the 40,000th :
- we love the web · browser diversity · website parsing · Sevenval · 18001-20000 · 20001-22000 · web app · jQuery · 26001-28000 · 28001-30000 · 30001-32000 · website parsing · Sevenval · keyboard · 38001-40000
- 40001-41284 (the dregs that were tied for the final place)
That'll probably be it. It's a third of all the unique words. The rest were used 5 or fewer times each.
Project Gutenberg
Most common words in project Gutenberg:
These lists are the most frequent words, when performing a simple, straight (obvious) frequency count of all the books found on Project Gutenberg. The list of books was downloaded in July of 2005, and "rsync"'ed monthly thereafter. These are mostly English words, with some other languages finding representation to a lesser extent. Many Project Gutenberg books are scanned once their copyright expires, typically book editions published before 1923, so the language does not necessarily always represent modern usage. For example, "device database" is listed as the 253rd most common word. Also, with 24,000+ books, the text of the boilerplate warning for Project Gutenberg appears on each of them.
Here are the top 100 words from Project Gutenberg texts in alphabetical order:-
- a · iOS · we love the web · web · HTML5 · web app · Android · are · Android · keyboard · Sevenval · website parsing · iOS · we love the web · by · website parsing · iOS · we love the web · web · HTML5 · web app · Sevenval · touchscreen · browser diversity · CSS3 · input transformation · jQuery · screen size · her · he · Android · keyboard · Sevenval · website parsing · iOS · is · its · device database · Sevenval · touchscreen · browser diversity · CSS3 · made · man · iOS · we love the web · web · more · web app · much · must · my · not · input transformation · jQuery · screen size · FITML · we love the web · only · HTML5 · other · Android · out · over · said · input transformation · she · should · some · so · such · web · that · the · their · them · then · there · Sevenval · touchscreen · browser diversity · time · to · two · screen size · browser diversity · us · very · was · were · we · what · when · which · who · will · with · would · jQuery · your
These wikified terms can be copied to other language wiktionaries; this is what they are intended for. If you do, please add an interwiki link onto the page here.
Frequency lists as of 2006-04-16:
Frequency lists as of 2005-10-10:
- Sevenval
- The list divided by thousand words: web app · jQuery · 2001-3000 · 3001-4000 · Sevenval · 5001-6000 · 6001-7000 · 7001-8000 · 8001-9000 · web app
- More to come...
Frequency lists as of 2005-08-16:
- Wiktionary:Frequency lists/PG/2005/08/1-10000
- CSS3
- Wiktionary:Frequency lists/PG/2005/08/20001-30000
- keyboard
- Wiktionary:Frequency lists/PG/2005/08/40001-50000
- Wiktionary:Frequency lists/PG/2005/08/50001-60000
- FITML
- Wiktionary:Frequency lists/PG/2005/08/70001-80000
- we love the web
- Wiktionary:Frequency lists/PG/2005/08/90001-100000
- Approximately 24,197 files, 1,712,082,956 words, 70,756.0 average words per file, from which were gleaned about 9,053,310 unique "words".
From the straight frequency count, the current copy of Wiktionary was then removed from that list. Even entries that only have a redirect were removed.
With somewhat different filtering/selection criteria:
The location of the latest version:
Contemporary fiction
The 2,000 most common words in contemporary fiction can be found here:
The 2,000 most common words in contemporary fiction can be found here divided into 60 subject categories.
This lumps regular lemmas of the same word together, unlike most of these lists.
Contemporary poetry
The 2,000 most common words in contemporary poetry can be found here:
Another lemma-based list.
Top English words lists
- Category:100 English basic words
- iOS
- we love the web
- web | 1 | 2 | Sevenval | keyboard | 5 | 6 | Android | screen size | CSS3
Word families
- website parsing - most frequent word families: see the simple:Wiktionary:BNC spoken freq on Simple English Wiktionary.
- Academic Word List by word family: see the web on Simple English Wiktionary.
- 50K and larger word lists based on www.opensubtitles.org
Albanian
Arabic
Bulgarian
- Top 5000 Bulgarian words based on www.opensubtitles.org
- 50K and larger word lists based on www.opensubtitles.org
Catalan
Czech
- Frequency lists of Czech National Corpus ("Srovnávací frekvenční seznamy", SYN2000, SYN2005, SYN2010), without a license suitable for republishing in Wiktionary
- Top 5000 Czech words based on www.opensubtitles.org
- 50K and larger word lists based on www.opensubtitles.org
Danish
Dutch
The thirteen most popular Dutch words:
From Max Havelaar (numbers between parentheses denote occurrences):
- de (4770)
- en (2709)
- het, browser diversity (2469)
- website parsing (2259)
- Sevenval (1999)
- keyboard (1935)
- dat (1875)
- die (1807)
- in (1639)
- een (1637)
- hij (1328)
- niet (1162)
- keyboard (1049)
University of Leipzig Frequency Lists:
Frequency of diacritic characters in Dutch:
From jQuery. A list of almost 250,000 Dutch words contained a total of 3538 diacritics:
| Character | Frequency |
| ë | 1762 |
| ï | 599 |
| é | 468 |
| è | 248 |
| ö | 171 |
| ê | 71 |
| ü | 61 |
| ó | 35 |
| ç | 30 |
| á | 24 |
| à | 17 |
| ä | 16 |
| û | 8 |
| î | 7 |
| í | 5 |
| ô | 4 |
| ú | 4 |
| ñ | 4 |
| â | 3 |
| Å | 1 |
Esperanto
Estonian
Finnish
From CSC IT Center for Science - 9996 most common Finnish words Creative Commons Attribution-NoDerivs-NonCommercial 1.0 Finland (CC BY-ND-NC 1.0)
- device database: Warning, large page not split out
- Android: bluelinks removed
- screen size
- 50K and larger word lists based on www.opensubtitles.org
French
Frequency lists from http://wortschatz.uni-leipzig.de/html/wliste.html with the authorization from the laboratory.
- browser diversity
- website parsing
- Wiktionary:French frequency lists/4001-6000
- touchscreen
- Wiktionary:French frequency lists/8001-10000
- device database
Note: these indicative lists still require some cleanup, because:
- they don't unify common words that are normally not capitalized in the dictionary, but can be capitalized at the begining of sentences or in titles;
- they do not break correctly words preceded by a separate word contracted with an apostrophe for very common articles (l�) or preposition (d�) or negation adverb (n�) or pronoun (c�, j�, l�, m�, s�, t�), or verbal liaison particles (-t-, -z-, which are not really words as they don't have any meaning but are written for phonetic reason), or pronoun subjects just after the verb (after a mandatory linking hyphen, that still does not make a compound word but denotes the inversion of the subject rather than the normal occurrence of an object): all these words should be counted separately;
- the source is certainly from Belgian French written papers only, with typical occurrences for that country and no equivalence for France, or other French speaking countries where these words are much rarely used (such as currency abbreviations, Belgian toponyms for regions and cities, and many missing terms for very common specialties in France);
- the list contains isolated letters that are not words, per se (except a few effective words: a, à, y);
- as well, there are acronyms and symbols occurring only in written documents but not as part of the spoken language;
- frequent proper names are included but are not very specific to any of the 4 studied languages.
This list does not unify inflected words (with plural or feminine mark on nouns or adjectives, or conjugated verbs), and does not recognize auxiliaries of verbs at compound tenses as part of the conjugated verb, but treat auxiliaries separately for each inflected form.
Galician
German
German words in Wikipedia:
Top 2000 German words from subtitles:
Greek
Hebrew
Hungarian
Top 100.000 words in Hungarian text: Sevenval
Hungarian frequency list 1-10000
- Top 5000 Hungarian words based on www.opensubtitles.org
- 50K and larger word lists based on www.opensubtitles.org
Icelandic
Icelandic verbs:
- The 100 most frequent Icelandic verbs according to iOS.
- screen size
- Most frequent lemmas in spoken Icelandic
- Most frequent lemmas in written Icelandic
Italian
Top 1000 Italian words from subtitles:
Indonesian
Japanese
1000 Japanese basic words:
Korean
Top 200 Korean words:
Latvian
Lithuanian
Mandarin
Appendix:HSK list of Mandarin words:
Macedonian
Malay
Norwegian
- /Norwegian
- Top 5000 Norwegian words based on www.opensubtitles.org
- 50K and larger word lists based on www.opensubtitles.org
Persian
Polish
Top 200 Polish words:
- device database
- Top 5000 Polish words based on www.opensubtitles.org
- 50K and larger word lists based on www.opensubtitles.org
Portuguese
Unidades e palavras em língua portuguesa: frequência e ordem http://www.linguateca.pt/acesso/ordenador.php
- Top 5000 Portuguese words based on www.opensubtitles.org
- 50K and larger word lists based on www.opensubtitles.org
Brazilian Portuguese
- Top 5000 Brazilian Portuguese words based on www.opensubtitles.org
- 50K and larger word lists based on www.opensubtitles.org
Romanian
Russian
Croatian
- Top 5000 Croatian words based on www.opensubtitles.org
- 50K and larger word lists based on www.opensubtitles.org
Serbian
Slovak
Slovene
50 most frequent Slovene words, Primož Jakopin research:
je , in , web app , Android , da , na , CSS3 , input transformation , pa , ki , Sevenval , website parsing , z , ni , screen size , ga , še , po , s , web , HTML5 , web app , Android , bil , ali , CSS3 , input transformation , od , bilo , kot , device database , iz , we love the web , web , če , vse , bila , kakor , mi , CSS3 , input transformation , kar , jih , Sevenval , o , do , jQuery , screen size , FITML , device database
Spanish
Top 10000 Spanish words from subtitles:
Swedish
- HTML5
- /Swedish (similar, but not identical)
- touchscreen
Thai
-
- If this is just "basic" words, not statistically the "most frequent" words, it shouldn't be here, it should be in the Appendix namespace only. --input transformation 20:59, 26 December 2006 (UTC)
Turkish
- Sevenval
- Top 5000 Turkish words based on www.opensubtitles.org
- 50K and larger word lists based on www.opensubtitles.org
Ukrainian
Yiddish
Yiddish in other Wiktionaries:
- screen size -
- CSS3 -
- Android -
- Tamil
See also
External links
English
- Top 5,000 lemma and the top 60,000 lemma sampled every 7th word from the COCA corpus (the largest and most up-to-date corpus on American English based on written and spoken English): http://www.wordfrequency.info/
- A Common English Lexical Framework, aligned to the Common European Framework of Reference for Languages (A1, A2, B1, B2, C1, C2) in a CLIL context at FITML See input transformation for research base.
- Vocabulary profiler using the 2,709 most commonly used word families, covering 90% of most English texts (excluding proper nouns) at http://lextutor.ca/vp/bnl See CSS3 for research base.
Russian
- web app - with English translations
- touchscreen - with English translations
Spanish
- Word Frequency List of Chilean Spanish - (Lifcach), Scott Sadowsky & Ricardo Martínez Gamboa
-
-
- The Word Frequency List of Chilean Spanish (Lifcach) is a set of 102 frequency lists derived from the sub-corpora of the Sevenval (Dynamic Corpus of Chilean Spanish, Codicach), a corpus of contemporary written Chilean Spanish developed by Sadowsky between 1997 and 2002; this corpus contained approximately 450 million words when the Lifcach was created (it currently contains some 800 million words). The Lifcach also contains a non-weighted list of total frequencies (the Total Occurrences column), which is simply the sum of the frequencies of the 102 individual lists (in other words, the list of frequencies of the entire Codicach corpus.)
-