Here you will find a collection of tools and data that may be applied to language research. Data resources, corpora, and lexical databases are also organized by language.
Software & Tools
Coh-Metrix – A tool for evaluating the difficulty of written text.
DMDX – Experiment presentation software for PC.
PsychoPy3 – Open-source experiment presentation software written in Python. Includes a visual Builder interface and a Coder interface.
PsyScope X – Experiment presentation software for Mac.
Phonotactic Probability Calculator (University of Kansas).
Testable – Online experiment platform (requires account)
Worldlikeness: A Web-based Tool for Typological Psycholinguistics – Online wordlikeness judgments for experimenters, participants, and researchers.
Resources by Language
Chinese
Chinese Character Writing – latency, accuracy, duration, & lexical characteristics for 1600 written characters
Chinese Lexicon Project – 2500 Chinese characters
The Chinese Lexicon Project – 25000 traditional Chinese two-character compounds
Chinese Single-Character Word Database – crr.ugent.be also hosts naming latencies for this database, see Liu et al. (2007).
MELD-SCH (MEgastudy of Lexical Decision in Simplified CHinese)
Traditional Chinese Psycholinguistic Database
SUBTLEX-CH – Subtitle frequencies – Chinese, includes part of speech information
Dutch
Age of Acquisition Norms & Concreteness norms for 30,000 Dutch Words
Baldey: a database of auditory lexical decision – 5541 Dutch words, includes auditory stimuli and associated Praat textgrids.
Speeded word fragment completion megastudy
SUBTLEX-NL – Subtitle frequencies – Dutch, includes part of speech information
English
Affective ratings for nearly 14 thousand English words
ARC Nonword Database – Generate nonwords using specified lexical characteristics.
Age of Acquisition Norms (English)
The Auditory English Lexicon Project: A multi-talker, multi-region psycholinguistic database of 10,170 spoken words and nonwords article
The Calgary semantic decision project: concrete/abstract decision data for 10,000 English words
Children’s Printed Word Database (University of Essex)
CompLex: An eye-movement database of compound word reading in English
Concreteness Ratings in English – 40 thousand English lemmas.
English-Corpora.org (formerly corpus.byu.edu) – Series of corpora with site-based search functions, including: Corpus of Contemporary American English, NOW Corpus, BNC, Strathy Corpus (Canada).
Disyllabic nonword reading database
English Lexicon Project – Database of lexical decision and naming reaction times and error rates. Visual lexical decision experiment uses 40481 words and and equal number of nonwords.
Form Priming Project: A behavioral database for masked form priming.
Irvine Phonotactic Online Dictionary (IPhOD) – Words & pseudowords, organized by phonemes (i.e., as search criteria, number of phonological neighbours, etc.).
The kilo-word ERP database (lexical decision) article data
LaDEC: Large database of English compounds – Database containing over 8000 English compounds along with psycholinguistic variables including family size, bigram frequency, sentiment (valence), and word frequency. Read the open access paper here.
A behavioral database for masked form priming article data
The Massive Auditory Lexical Decision (MALD) database article data
MRC Psycholinguistic Database – Generate stimuli lists based on lexical characteristics.
The Natural Stories Corpus article data
The Provo Corpus: A large eye-tracking corpus with predictability norms article data
The past tense inflection project (PTIP): speeded past tense inflections, imageability ratings, and past tense consistency measures for 2,200 verbs
Phonetic biases in voice key response time measurements – Voice response time for English monosyllables
Reading aloud megastudy article data
Reading time data for evaluating broad-coverage models of English sentence processing
Response times in CVC naming study at WSU (Treiman et al., 1995)
Semantic Priming Project (Montana State University) – Naming & lexical decision priming
The Seidenberg and Waters (1989) Mega-study
SUBTLEX-UK – Subtitle frequencies – British English, includes part of speech information
SUBTLEX-US – Subtitle frequencies – American English
Sentence Completion Norms – online interface to data from Completion norms for 3085 English sentence contexts (Peelle, Miller, Rogers, Spehar, Sommers, & Van Engen). Article Database
University of South Florida Free Association Norms
WordNet – Lexical database organized by word senses, synsets, and word relations.
French
Lexique – French corpus, including word frequencies, phonological and grammatical information, and more. Version 3.
MEGALEX – French database of vidual and auditory lexical decision responses
German
The Developmental Lexicon Project – Visual word recognition in German for 1152 words across 800 children.
dlexDB – German word frequencies. See also The Digital Dictionary of German (www.dwds.de).
SUBTLEX-DE – Subtitle frequencies – German
Greek
SUBTLEX-GR – Subtitle frequencies – Greek
Malay
The Malay Lexicon Project – lexical and behavioural information for 9,592 Malay words
The Malay Lexicon Project 2 (MLP2) – augmentation of the Malay Lexicon Project with morphological breakdowns & morphological properties. Article Data. Maziyah Mohamed, Yap, Chee & Jared (2022).
Polish
SUBTLEX-PL – Subtitle frequencies – Polish, includes part of speech information
Portuguese
SUBTLEX-PT – Subtitle frequencies – Portuguese
SUBTLEX-PT-BR – Subtitle frequencies – Brazilian Portuguese
Spanish
SUBTLEX-ESP – Subtitle frequencies – Spanish
Russian
Russian Sentence Corpus – Eye-movement data in Cyrillic script. Data on OSF.
Multiple languages
Celex Lexical Database (WebCelex) – Online version of Celex. English, Dutch, German.
Child Language Data Exchange System (CHILDES)
CLEARPOND Database – Cross-Linguistic Easy-Access Resource for Phonological and Orthographic Neighborhood Densities. Dutch, English, French, German, Spanish.
Corpora Collection, Leipzig University – searchable in 252 languages. Part of Wortschatz.
- Dutch Lexicon Projects 1 & 2
- British Lexicon Project
- Dutch Crowdsourcing Project – word recognition times
- English Crowdsourcing Project
- Spanish Crowdsourcing Project
crr.ugent.be Megastudy listing – Comprehensive listing of word-related megastudies.
GECO, the Ghent Eye-Tracking Corpus – measurements from bilingual and monolingual readers while completing a novel. Data available from crr.ugent.be.
International Picture Naming Project – Studies, database of results, stimuli available for use.
The Language Goldmine – Collection of links to linguistic databases and datasets that are available online. Includes 235 data sources, searchable by keywords/tags.
Norms of Valence, Arousal, Dominance, and Age of Acquisition in English and Dutch.
Small World of Words – Word Associations project
University of Oxford Text Archives (OTA) – Corpora in Arabic, Chinese, English, French, German, Scots, Swedish, Welsh.
Wortschatz/Leipzig Corpora Collection – A project maintained by the Natural Language Processing Group at the Institute of Computer Science at Leipzig University. Includes searchable corpora for over 250 languages and tools for corpus linguistic analysis.