Here you will find a collection of tools and data that may be applied to language research. Data resources, corpora, and lexical databases are also organized by language, here: Resources by Language.
Software & Tools
ASV Online Toolbox – A collection of tools that can be used to explore written language data. Part of Wortschatz.
Coh-Metrix – A tool for evaluating the difficulty of written text.
DMDX – Experiment presentation software for PC.
GloVe (Global Vectors for Word Representation) – unsupervised learning algorithm for obtaining vector representations for words.
LexOps – R package & shiny app for stimuli generation
PsychoPy – Open-source experiment presentation software written in Python. Includes a visual Builder interface and a Coder interface.
PsyScope X – Experiment presentation software for Mac.
Phonotactic Probability Calculator (University of Kansas).
Worldlikeness: A Web-based Tool for Typological Psycholinguistics – Online wordlikeness judgments for experimenters, participants, and researchers.
Data Resources
Affective ratings for nearly 14 thousand English words
AffectVec – Database of 70000 English words associated ranked for intensity for 200 emotions.
Age of Acquisition Norms & Concreteness norms for 30,000 Dutch Words
Age of Acquisition Norms (English)
ARC Nonword Database – Generate nonwords using specified lexical characteristics.
ASL-LEX – Database of more than 2000 ASL signs with lexical and phonological information for each. Web interface and download available.
ASL Signbank – Collection of ASL signs with ID glosses. Designed as an annotation tool for ASL videos.
The Auditory English Lexicon Project: A multi-talker, multi-region psycholinguistic database of 10,170 spoken words and nonwords article
Baldey: a database of auditory lexical decision – 5541 Dutch words, includes auditory stimuli and associated Praat textgrids.
The Calgary semantic decision project: concrete/abstract decision data for 10,000 English words
Chinese Character Writing – latency, accuracy, duration, & lexical characteristics for 1600 written characters
Chinese Lexicon Project – 2500 Chinese characters
The Chinese Lexicon Project – 25000 traditional Chinese two-character compounds
Chinese Single-Character Word Database – crr.ugent.be also hosts naming latencies for this database, see Liu et al. (2007).
CLEARPOND Database – Cross-Linguistic Easy-Access Resource for Phonological and Orthographic Neighborhood Densities. Dutch, English, French, German, Spanish.
CompLex: An eye-movement database of compound word reading in English
Concreteness Ratings in English – 40 thousand English lemmas.
Concreteness ratings for 62 thousand English multiword expressions
Conversation Corpus (Reece et al., 2022) – 7+ million word conversation corpus (audio, video, transcripts) with extensive additional information (e.g., facial expressions, post-conversation surveys) included.
- Dutch Lexicon Projects 1 & 2
- British Lexicon Project
- Dutch Crowdsourcing Project – word recognition times
- English Crowdsourcing Project
- Spanish Crowdsourcing Project
crr.ugent.be Megastudy listing – Comprehensive listing of word-related megastudies.
The Developmental Lexicon Project – Visual word recognition in German for 1152 words across 800 children.
Disyllabic nonword reading database
Dutch Word Association Database
Emo-Lex/NRC Word-Emotion Association Lexicon – Database of English words rated for negative and positive sentiments and their associations with 8 emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust)
English-German Database of Idiom Norms (DIN) – 300 English idioms and descriptive norms for English and German (L1 & L2 users).
English Lexicon Project – Database of lexical decision and naming reaction times and error rates. Visual lexical decision experiment uses 40481 words and and equal number of nonwords.
Form Priming Project: A behavioral database for masked form priming.
GECO, the Ghent Eye-Tracking Corpus – measurements from bilingual and monolingual readers while completing a novel. Data available from crr.ugent.be.
The Glasgow Norms: Ratings of 5,500 words on nine scales – ratings for 5,553 English words on arousal, valence, dominance, concreteness, imageability, familiarity, age of acquisition, semantic size, and gender association (link goes to paper).
Grievance dictionary: Language use in the context of grievance-fueled violence threat
International Picture Naming Project – Studies, database of results, stimuli available for use.
IRIS – Instruments and data for research in language studies
The kilo-word ERP database (lexical decision) article data
The Language Goldmine – Collection of links to linguistic databases and datasets that are available online. Includes 235 data sources, searchable by keywords/tags.
LaDEC: Large database of English compounds – Database containing over 8000 English compounds along with psycholinguistic variables including family size, bigram frequency, sentiment (valence), and word frequency. Read the open access paper here.
A behavioral database for masked form priming article data
Lancaster Sensorimotor Norms – article data
L2: Norms for 2,668 English words for valence and arousal by non-native speakers
The Malay Lexicon Project – lexical and behavioural information for 9,592 Malay words
The Massive Auditory Lexical Decision (MALD) database article data
MEGALEX – French database of vidual and auditory lexical decision responses
MELD-SCH (MEgastudy of Lexical Decision in Simplified CHinese)
MRC Psycholinguistic Database – Generate stimuli lists based on lexical characteristics.
The Natural Stories Corpus article data
Norms of Valence, Arousal, Dominance, and AoA in English and Dutch.
The Provo Corpus: A large eye-tracking corpus with predictability norms article data
The past tense inflection project (PTIP): speeded past tense inflections, imageability ratings, and past tense consistency measures for 2,200 verbs
Phonetic biases in voice key response time measurements – Voice response time for English monosyllables
Reading aloud megastudy article data
Reading time data for evaluating broad-coverage models of English sentence processing
Response times in CVC naming study at WSU (Treiman et al., 1995)
Russian Sentence Corpus – Eye-movement data in Cyrillic script. Data on OSF.
Semantic Priming Project (Montana State University) – Naming & lexical decision priming
The Seidenberg and Waters (1989) Mega-study
Speeded word fragment completion megastudy
Traditional Chinese Psycholinguistic Database
University of South Florida Free Association Norms
Wortschatz/Leipzig Corpora Collection – A project maintained by the Natural Language Processing Group at the Institute of Computer Science at Leipzig University. Includes searchable corpora for over 250 languages and tools for corpus linguistic analysis.
Corpora & Lexical Databases
Celex Lexical Database (WebCelex) – Online version of Celex. English, Dutch, German.
Child Language Data Exchange System (CHILDES)
Children’s Printed Word Database (University of Essex)
Corpora Collection, Leipzig University – searchable in 252 languages. Part of Wortschatz.
corpus.byu.edu – Series of corpora with site-based search functions, including: Corpus of Contemporary American English, NOW Corpus, BNC, Strathy Corpus (Canada).
dlexDB – German word frequencies. See also The Digital Dictionary of German (www.dwds.de).
Irvine Phonotactic Online Dictionary (IPhOD) – Words & pseudowords, organized by phonemes (i.e., as search criteria, number of phonological neighbours, etc.).
Lexique – French corpus, including word frequencies, phonological and grammatical information, and more. Version 3.
LOCO: the 88-million word language of conspiracy corpus
MorphoLex-en – Lexical database for ~70k English words with morphological variables
Multi-LEX: A database of multi-word frequencies for French and English
SUBTITLE FREQUENCIES
SUBTLEX-CH – Chinese, includes part of speech information
SUBTLEX-DE – German
SUBTLEX-ESP – Spanish
SUBTLEX-GR – Greek
SUBTLEX-NL – Dutch, includes part of speech information
SUBTLEX-PL – Polish, includes part of speech information
SUBTLEX-PT – Portuguese
SUBTLEX-PT-BR – Brazilian Portuguese
SUBTLEX-UK – British English, includes part of speech information
SUBTLEX-US – American English
University of Oxford Text Archives (OTA) – Corpora in Arabic, Chinese, English, French, German, Scots, Swedish, Welsh.
WordNet – Lexical database organized by word senses, synsets, and word relations.