(Back to NLP Resources)
This is a list of freely available linguistic databases; lexica, corpora, treebanks and other things. CLT members can edit entries at will; feel free to edit or add new links, but also to remove dead and outdated links. Or create categories to make the list easier to read.
Don't forget to look at the CLT toolkit first!
List of online resources, in no particular order:
-
Språkbanken is of course the first place to look: http://spraakbanken.gu.se. E.g., the SALDO Swedish lexicon: http://spraakbanken.gu.se/eng/saldo. Many other corpora and lexica are available on request
-
The Wikipedia "Treebank" entry has a long list of available treebanks: http://en.wikipedia.org/wiki/Treebank
-
NLTK contains linguistic resources of all kinds: http://www.nltk.org/data
-
The Linguistic Data Consortium has lots of resources, most of them are not free however: http://www.ldc.upenn.edu
-
LT World also links to lots of resources: http://www.lt-world.org/kb/resources-and-tools/language-data
-
All Wikimedia (including Wikipedia and Wiktionary) sources, with or without editing history: http://dumps.wikimedia.org. If you want a specific Wikipedia language nn (e.g., Swedish – sv), you can go directly to http://dumps.wikimedia.org/nnwiki
-
Corpora for training of semantic analysis: http://www.senseval.org
-
Corpora for training of information retrieval: http://trec.nist.gov
-
The CoNLL shared tasks: http://ifarm.nl/signll/conll
-
The Bible in 56 languages: http://homepages.inf.ed.ac.uk/s0787820/bible
-
The Quran Corpus: http://corpus.quran.com/download
-
The Oxford Text Archive: http://ota.ahds.ac.uk
-
Project Gutenberg: http://www.gutenberg.org
-
Projekt Runeberg (Swedish litterature): http://runeberg.org
-
A list of lots of Polish NLP resources: http://clip.ipipan.waw.pl/LRT
-
American National Corpus, open section: http://www.americannationalcorpus.org/OANC
-
JRC Acquis, Multilingual Parallel Corpus: http://optima.jrc.it/Acquis
-
The TIGER Treebank of German newspaper text: http://www.ims.uni-stuttgart.de/projekte/TIGER/TIGERCorpus. To download, go to the licence page
-
Persian corpora: http://ece.ut.ac.ir/dbrg/Hamshahri, http://ece.ut.ac.ir/dbrg/Bijankhan
-
LTAG Spinal Treebank: http://www.cis.upenn.edu/~xtag/spinal
-
Turin University Treebank: http://www.di.unito.it/~tutreeb
-
Copenhagen Dependency Treebank: http://code.google.com/p/copenhagen-dependency-treebank
-
OPUS, several open source parallel corpora: http://opus.lingfil.uu.se
-
OpenSubtitles, a multilingual corpus of movie subtitles: http://opus.lingfil.uu.se/OpenSubtitles.php
-
Talbanken 05, a Swedish treebank: http://stp.lingfil.uu.se/~nivre/research/Talbanken05.html
-
Turku Dependency Treebank: http://bionlp.utu.fi/fintreebank.html
-
TreebankWiki: http://treebankwiki.org
-
Synlex, folkets synonymlexikon (Swedish synonym lexicon): http://folkets2.nada.kth.se/synlex.html
-
Folkets lexikon (Swedish-English lexicon): http://folkets-lexikon.csc.kth.se/folkets/om.html
-
DBPedia, an ontology and knowledge base extracted from Wikipedia: http://dbpedia.org
-
A list of existing HPSG grammars: http://moin.delph-in.net/GrammarCatalogue
-
The LinGO Redwoods HPSG treebank: http://moin.delph-in.net/RedwoodsTop
-
WikiWoods, a treebank of English Wikipedia: http://moin.delph-in.net/WikiWoods
-
WeScience corpus and treebank: http://moin.delph-in.net/WeScience
-
VoxForge, a free speech corpus and acoustic model repository: http://www.voxforge.org
-
The MapTask corpus: http://groups.inf.ed.ac.uk/maptask
-
TalkBank corpus: http://talkbank.org
-
CHILDES corpus: http://childes.psy.cmu.edu
-
AMI and AMIDA Meeting corpora: http://corpus.amiproject.org, http://corpus.amidaproject.org/
-
The "20 newsgroups" corpus: http://people.csail.mit.edu/jrennie/20Newsgroups
-
Three large context-free grammars, which can be used for parser comparison: http://www.informatics.sussex.ac.uk/research/groups/nlp/carroll/cfg-resources
-
Susanne, small but comprehensive open-source treebank for English: http://www.grsampson.net/Resources.html
-
MOBY, mostly English lexica, and the complete works of Shakespeare: http://icon.shef.ac.uk/Moby
-
MULTEXT-East, Multilingual Text Tools and Corpora for Central and Eastern European Languages: http://nl.ijs.si/ME
-
The FraCaS textual inference test suite: http://www-nlp.stanford.edu/~wcmac/downloads
-
The GF FraCaS Treebank: http://projects.grammaticalframework.org/fracas
-
The Geoquery corpus: http://www.cs.utexas.edu/~ml/geo.html