Resources

Here you will find a collection of tools and data that may be applied to language research. Data resources, corpora, and lexical databases are also organized by language.

Software & Tools

ASV Online Toolbox – A collection of tools that can be used to explore written language data. Part of Wortschatz.

Coh-Metrix – A tool for evaluating the difficulty of written text.

DMDX – Experiment presentation software for PC.

PsychoPy – Open-source experiment presentation software written in Python. Includes a visual Builder interface and a Coder interface.

PsyScope X – Experiment presentation software for Mac.

Phonotactic Probability Calculator (University of Kansas).

Worldlikeness: A Web-based Tool for Typological Psycholinguistics – Online wordlikeness judgments for experimenters, participants, and researchers.

Resources by Language

Chinese

Chinese Character Writing – latency, accuracy, duration, & lexical characteristics for 1600 written characters

Chinese Lexicon Project – 2500 Chinese characters

The Chinese Lexicon Project – 25000 traditional Chinese two-character compounds

Chinese Single-Character Word Databasecrr.ugent.be also hosts naming latencies for this database, see Liu et al. (2007).

MELD-SCH (MEgastudy of Lexical Decision in Simplified CHinese)

Traditional Chinese Psycholinguistic Database

SUBTLEX-CH – Subtitle frequencies – Chinese, includes part of speech information

Dutch

Age of Acquisition Norms & Concreteness norms for 30,000 Dutch Words

Baldey: a database of auditory lexical decision – 5541 Dutch words, includes auditory stimuli and associated Praat textgrids.

Dutch Word Association Database

Speeded word fragment completion megastudy

SUBTLEX-NL – Subtitle frequencies – Dutch, includes part of speech information

English

Affective ratings for nearly 14 thousand English words

ARC Nonword Database – Generate nonwords using specified lexical characteristics.

Age of Acquisition Norms (English)

The Auditory English Lexicon Project: A multi-talker, multi-region psycholinguistic database of 10,170 spoken words and nonwords article

British National Corpus (BNC)

The Calgary semantic decision project: concrete/abstract decision data for 10,000 English words

Children’s Printed Word Database (University of Essex)

CompLex: An eye-movement database of compound word reading in English

Concreteness Ratings in English – 40 thousand English lemmas.

corpus.byu.edu – Series of corpora with site-based search functions, including: Corpus of Contemporary American English, NOW Corpus, BNC, Strathy Corpus (Canada).

Disyllabic nonword reading database

English Lexicon Project – Database of lexical decision and naming reaction times and error rates. Visual lexical decision experiment uses 40481 words and and equal number of nonwords.

Form Priming Project: A behavioral database for masked form priming.

Irvine Phonotactic Online Dictionary (IPhOD) – Words & pseudowords, organized by phonemes (i.e., as search criteria, number of phonological neighbours, etc.).

The kilo-word ERP database (lexical decision) article data

LaDEC: Large database of English compounds – Database containing over 8000 English compounds along with psycholinguistic variables including family size, bigram frequency, sentiment (valence), and word frequency. Read the open access paper here.

A behavioral database for masked form priming article data

The Massive Auditory Lexical Decision (MALD) database article data

MRC Psycholinguistic Database – Generate stimuli lists based on lexical characteristics.

The Natural Stories Corpus article data

The Provo Corpus: A large eye-tracking corpus with predictability norms article data

The past tense inflection project (PTIP): speeded past tense inflections, imageability ratings, and past tense consistency measures for 2,200 verbs

Phonetic biases in voice key response time measurements – Voice response time for English monosyllables

Reading aloud megastudy article data

Reading time data for evaluating broad-coverage models of English sentence processing

Response times in CVC naming study at WSU (Treiman et al., 1995)

Semantic Priming Project (Montana State University) – Naming & lexical decision priming

The Seidenberg and Waters (1989) Mega-study

SUBTLEX-UK – Subtitle frequencies – British English, includes part of speech information

SUBTLEX-US – Subtitle frequencies – American English

University of South Florida Free Association Norms

WordNet – Lexical database organized by word senses, synsets, and word relations.

French

French Lexicon Project

Lexique  – French corpus, including word frequencies, phonological and grammatical information, and more. Version 3.

MEGALEX – French database of vidual and auditory lexical decision responses

German

The Developmental Lexicon Project – Visual word recognition in German for 1152 words across 800 children.

dlexDB – German word frequencies. See also The Digital Dictionary of German (www.dwds.de).

SUBTLEX-DE – Subtitle frequencies – German

Greek

SUBTLEX-GR – Subtitle frequencies – Greek

Malay

The Malay Lexicon Project – lexical and behavioural information for 9,592 Malay words

Polish

SUBTLEX-PL – Subtitle frequencies – Polish, includes part of speech information

Portuguese

SUBTLEX-PT – Subtitle frequencies – Portuguese

SUBTLEX-PT-BR – Subtitle frequencies – Brazilian Portuguese

Spanish

SUBTLEX-ESP – Subtitle frequencies – Spanish

Russian

Russian Sentence Corpus – Eye-movement data in Cyrillic script. Data on OSF.

Multiple languages

Celex Lexical Database (WebCelex) – Online version of Celex. English, Dutch, German.

Child Language Data Exchange System (CHILDES)

CLEARPOND Database – Cross-Linguistic Easy-Access Resource for Phonological and Orthographic Neighborhood Densities. Dutch, English, French, German, Spanish.

Corpora Collection, Leipzig University – searchable in 252 languages. Part of Wortschatz.

crr.ugent.be Lexicon Projects

  • Dutch Lexicon Projects 1 & 2
  • British Lexicon Project
  • Dutch Crowdsourcing Project – word recognition times
  • English Crowdsourcing Project
  • Spanish Crowdsourcing Project

crr.ugent.be Megastudy listing – Comprehensive listing of word-related megastudies.

GECO, the Ghent Eye-Tracking Corpus – measurements from bilingual and monolingual readers while completing a novel. Data available from crr.ugent.be.

Google Books Ngram Viewer

International Picture Naming Project – Studies, database of results, stimuli available for use.

The Language Goldmine – Collection of links to linguistic databases and datasets that are available online. Includes 235 data sources, searchable by keywords/tags.

Norms of Valence, Arousal, Dominance, and Age of Acquisition in English and Dutch.

University of Oxford Text Archives (OTA) – Corpora in Arabic, Chinese, English, French, German, Scots, Swedish, Welsh.

Wiktionary Frequency Lists

Wortschatz/Leipzig Corpora Collection – A project maintained by the Natural Language Processing Group at the Institute of Computer Science at Leipzig University. Includes searchable corpora for over 250 languages and tools for corpus linguistic analysis.

Comments are closed.