Resources – Words in the World

Here you will find a collection of tools and data that may be applied to language research. Data resources, corpora, and lexical databases are also organized by language.

Software & Tools

Resources by Language

Software & Tools

Coh-Metrix – A tool for evaluating the difficulty of written text.

DMDX – Experiment presentation software for PC.

PsychoPy3 – Open-source experiment presentation software written in Python. Includes a visual Builder interface and a Coder interface.

PsyScope X – Experiment presentation software for Mac.

Phonotactic Probability Calculator (University of Kansas).

Testable – Online experiment platform (requires account)

Worldlikeness: A Web-based Tool for Typological Psycholinguistics – Online wordlikeness judgments for experimenters, participants, and researchers.

Resources by Language

Chinese

Chinese Character Writing – latency, accuracy, duration, & lexical characteristics for 1600 written characters

Chinese Lexicon Project – 2500 Chinese characters

The Chinese Lexicon Project – 25000 traditional Chinese two-character compounds

Chinese Single-Character Word Database – crr.ugent.be also hosts naming latencies for this database, see Liu et al. (2007).

MELD-SCH (MEgastudy of Lexical Decision in Simplified CHinese)

Traditional Chinese Psycholinguistic Database

SUBTLEX-CH – Subtitle frequencies – Chinese, includes part of speech information

Dutch

Age of Acquisition Norms & Concreteness norms for 30,000 Dutch Words

Baldey: a database of auditory lexical decision – 5541 Dutch words, includes auditory stimuli and associated Praat textgrids.

Speeded word fragment completion megastudy

SUBTLEX-NL – Subtitle frequencies – Dutch, includes part of speech information

English

Affective ratings for nearly 14 thousand English words

ARC Nonword Database – Generate nonwords using specified lexical characteristics.

Age of Acquisition Norms (English)

The Auditory English Lexicon Project: A multi-talker, multi-region psycholinguistic database of 10,170 spoken words and nonwords article

British National Corpus (BNC)

The Calgary semantic decision project: concrete/abstract decision data for 10,000 English words

Children’s Printed Word Database (University of Essex)

CompLex: An eye-movement database of compound word reading in English

Concreteness Ratings in English – 40 thousand English lemmas.

English-Corpora.org (formerly corpus.byu.edu) – Series of corpora with site-based search functions, including: Corpus of Contemporary American English, NOW Corpus, BNC, Strathy Corpus (Canada).

Disyllabic nonword reading database

English Lexicon Project – Database of lexical decision and naming reaction times and error rates. Visual lexical decision experiment uses 40481 words and and equal number of nonwords.

Form Priming Project: A behavioral database for masked form priming.

Irvine Phonotactic Online Dictionary (IPhOD) – Words & pseudowords, organized by phonemes (i.e., as search criteria, number of phonological neighbours, etc.).

The kilo-word ERP database (lexical decision) article data

LaDEC: Large database of English compounds – Database containing over 8000 English compounds along with psycholinguistic variables including family size, bigram frequency, sentiment (valence), and word frequency. Read the open access paper here.

A behavioral database for masked form priming article data

The Massive Auditory Lexical Decision (MALD) database article data

MRC Psycholinguistic Database – Generate stimuli lists based on lexical characteristics.

The Natural Stories Corpus article data

The Provo Corpus: A large eye-tracking corpus with predictability norms article data

The past tense inflection project (PTIP): speeded past tense inflections, imageability ratings, and past tense consistency measures for 2,200 verbs

Phonetic biases in voice key response time measurements – Voice response time for English monosyllables

Reading aloud megastudy article data

Reading time data for evaluating broad-coverage models of English sentence processing

Response times in CVC naming study at WSU (Treiman et al., 1995)

Semantic Priming Project (Montana State University) – Naming & lexical decision priming

The Seidenberg and Waters (1989) Mega-study

SUBTLEX-UK – Subtitle frequencies – British English, includes part of speech information

SUBTLEX-US – Subtitle frequencies – American English

Sentence Completion Norms – online interface to data from Completion norms for 3085 English sentence contexts (Peelle, Miller, Rogers, Spehar, Sommers, & Van Engen). Article Database

University of South Florida Free Association Norms

WordNet – Lexical database organized by word senses, synsets, and word relations.

French

French Lexicon Project

Lexique – French corpus, including word frequencies, phonological and grammatical information, and more. Version 3.

MEGALEX – French database of vidual and auditory lexical decision responses

German

The Developmental Lexicon Project – Visual word recognition in German for 1152 words across 800 children.

dlexDB – German word frequencies. See also The Digital Dictionary of German (www.dwds.de).

SUBTLEX-DE – Subtitle frequencies – German

Greek

SUBTLEX-GR – Subtitle frequencies – Greek

Malay

The Malay Lexicon Project – lexical and behavioural information for 9,592 Malay words

The Malay Lexicon Project 2 (MLP2) – augmentation of the Malay Lexicon Project with morphological breakdowns & morphological properties. Article Data. Maziyah Mohamed, Yap, Chee & Jared (2022).

Polish

SUBTLEX-PL – Subtitle frequencies – Polish, includes part of speech information

Portuguese

SUBTLEX-PT – Subtitle frequencies – Portuguese

SUBTLEX-PT-BR – Subtitle frequencies – Brazilian Portuguese

Spanish

SUBTLEX-ESP – Subtitle frequencies – Spanish

Russian

Russian Sentence Corpus – Eye-movement data in Cyrillic script. Data on OSF.

Multiple languages

Celex Lexical Database (WebCelex) – Online version of Celex. English, Dutch, German.

Child Language Data Exchange System (CHILDES)

CLEARPOND Database – Cross-Linguistic Easy-Access Resource for Phonological and Orthographic Neighborhood Densities. Dutch, English, French, German, Spanish.

Corpora Collection, Leipzig University – searchable in 252 languages. Part of Wortschatz.

crr.ugent.be Lexicon Projects

Dutch Lexicon Projects 1 & 2
British Lexicon Project
Dutch Crowdsourcing Project – word recognition times
English Crowdsourcing Project
Spanish Crowdsourcing Project

crr.ugent.be Megastudy listing – Comprehensive listing of word-related megastudies.

GECO, the Ghent Eye-Tracking Corpus – measurements from bilingual and monolingual readers while completing a novel. Data available from crr.ugent.be.

Google Books Ngram Viewer

International Picture Naming Project – Studies, database of results, stimuli available for use.

The Language Goldmine – Collection of links to linguistic databases and datasets that are available online. Includes 235 data sources, searchable by keywords/tags.

Norms of Valence, Arousal, Dominance, and Age of Acquisition in English and Dutch.

Small World of Words – Word Associations project

University of Oxford Text Archives (OTA) – Corpora in Arabic, Chinese, English, French, German, Scots, Swedish, Welsh.

Wiktionary Frequency Lists

Wortschatz/Leipzig Corpora Collection – A project maintained by the Natural Language Processing Group at the Institute of Computer Science at Leipzig University. Includes searchable corpora for over 250 languages and tools for corpus linguistic analysis.