Here you will find a collection of tools and data that may be applied to language research. Data resources, corpora, and lexical databases are also organized by language.
Software & Tools
ASV Online Toolbox – A collection of tools that can be used to explore written language data. Part of Wortschatz.
Coh-Metrix – A tool for evaluating the difficulty of written text.
DMDX – Experiment presentation software for PC.
PsychoPy – Open-source experiment presentation software written in Python. Includes a visual Builder interface and a Coder interface.
PsyScope X – Experiment presentation software for Mac.
Phonotactic Probability Calculator (University of Kansas).
Worldlikeness: A Web-based Tool for Typological Psycholinguistics – Online wordlikeness judgments for experimenters, participants, and researchers.
Resources by Language
Chinese Character Writing – latency, accuracy, duration, & lexical characteristics for 1600 written characters
Chinese Lexicon Project – 2500 Chinese characters
The Chinese Lexicon Project – 25000 traditional Chinese two-character compounds
SUBTLEX-CH – Subtitle frequencies – Chinese, includes part of speech information
Baldey: a database of auditory lexical decision – 5541 Dutch words, includes auditory stimuli and associated Praat textgrids.
SUBTLEX-NL – Subtitle frequencies – Dutch, includes part of speech information
ARC Nonword Database – Generate nonwords using specified lexical characteristics.
Children’s Printed Word Database (University of Essex)
Concreteness Ratings in English – 40 thousand English lemmas.
corpus.byu.edu – Series of corpora with site-based search functions, including: Corpus of Contemporary American English, NOW Corpus, BNC, Strathy Corpus (Canada).
English Lexicon Project – Database of lexical decision and naming reaction times and error rates. Visual lexical decision experiment uses 40481 words and and equal number of nonwords.
Form Priming Project: A behavioral database for masked form priming.
Irvine Phonotactic Online Dictionary (IPhOD) – Words & pseudowords, organized by phonemes (i.e., as search criteria, number of phonological neighbours, etc.).
LaDEC: Large database of English compounds – Database containing over 8000 English compounds along with psycholinguistic variables including family size, bigram frequency, sentiment (valence), and word frequency. Read the open access paper here.
MRC Psycholinguistic Database – Generate stimuli lists based on lexical characteristics.
The past tense inflection project (PTIP): speeded past tense inflections, imageability ratings, and past tense consistency measures for 2,200 verbs
Phonetic biases in voice key response time measurements – Voice response time for English monosyllables
Semantic Priming Project (Montana State University) – Naming & lexical decision priming
SUBTLEX-UK – Subtitle frequencies – British English, includes part of speech information
SUBTLEX-US – Subtitle frequencies – American English
WordNet – Lexical database organized by word senses, synsets, and word relations.
Lexique – French corpus, including word frequencies, phonological and grammatical information, and more. Version 3.
MEGALEX – French database of vidual and auditory lexical decision responses
The Developmental Lexicon Project – Visual word recognition in German for 1152 words across 800 children.
SUBTLEX-DE – Subtitle frequencies – German
SUBTLEX-GR – Subtitle frequencies – Greek
The Malay Lexicon Project – lexical and behavioural information for 9,592 Malay words
SUBTLEX-PL – Subtitle frequencies – Polish, includes part of speech information
SUBTLEX-PT – Subtitle frequencies – Portuguese
SUBTLEX-PT-BR – Subtitle frequencies – Brazilian Portuguese
SUBTLEX-ESP – Subtitle frequencies – Spanish
Celex Lexical Database (WebCelex) – Online version of Celex. English, Dutch, German.
CLEARPOND Database – Cross-Linguistic Easy-Access Resource for Phonological and Orthographic Neighborhood Densities. Dutch, English, French, German, Spanish.
Corpora Collection, Leipzig University – searchable in 252 languages. Part of Wortschatz.
- Dutch Lexicon Projects 1 & 2
- British Lexicon Project
- Dutch Crowdsourcing Project – word recognition times
- English Crowdsourcing Project
- Spanish Crowdsourcing Project
crr.ugent.be Megastudy listing – Comprehensive listing of word-related megastudies.
International Picture Naming Project – Studies, database of results, stimuli available for use.
The Language Goldmine – Collection of links to linguistic databases and datasets that are available online. Includes 235 data sources, searchable by keywords/tags.
University of Oxford Text Archives (OTA) – Corpora in Arabic, Chinese, English, French, German, Scots, Swedish, Welsh.
Wortschatz/Leipzig Corpora Collection – A project maintained by the Natural Language Processing Group at the Institute of Computer Science at Leipzig University. Includes searchable corpora for over 250 languages and tools for corpus linguistic analysis.