One of my research interests is to create new resources that can be used for research in linguistics and NLP. Here you can find some of them.
If you use any of these resources in your research, please refer to its respective description paper available in pdf.
A Portuguese historical corpus containing texts from the 16th to the early 20th century, lemmatized and annotated with POS tags. The corpus is available to download and through a graphical CQPWeb-based interface. From May 2014, thanks to Diana Santos (University of Oslo), Colonia is also available at Linguateca. From October 2014, thanks to Eckhard Bick (University of Southern Denmark), a version of Colonia tagged using the PALAVRAS parsing system is available through CorpusEye. From August 2017, thanks to Rachael Tatman, Colonia is available at Kaggle.
CompLex is an English multi-domain corpus compiled for lexical complexity annotated with a five-point Likert scale. It was the official dataset of the Lexical Complextity Prediction (LCP) shared task at SemEval 2021.
DSL Corpus Collection (DSLCC) DSLCC pdf
A collection of journalistic corpora written in closely related languages and language varieties. The dataset has been used in the DSL Shared Tasks in 2014, 2015, 2016, and 2017.
LIdioms: A Multilingual Linked Idioms Data Set in Five Different Languages LIdioms pdf
This is a multilingual linked idioms data set in five different languages (English, Portuese, Italian, German, Russian). Currently being expanded to other languages.
NLI-PT: A Portuguese Native Language Identification Dataset NLI-PT pdf
A collection of 1,868 student essays written by learners of European Portuguese, native speakers of the following L1s: Chinese, English, Spanish, German, Russian, French, Japanese, Italian, Dutch, Tetum, Arabic, Polish, Korean, Romanian, and Swedish.
OFTD is an offensive language dataset for Greek annotated following the OLID guidelines. OGTD was used in the OffensEval 2020: Multilingual Offensive Language Identification in Social Media (SemEval 2020 - Task 12) shared task.
Offensive Language Identification Dataset (OLID) OLID
OLID contains a collection of annotated tweets using a hierarchical annotation model that encompasses following three levels: A: Offensive Language Detection; B: Categorization of Offensive Language; C: Offensive Language Target Identification. OLID was used in the OffensEval 2019: Identifying and Categorizing Offensive Language in Social Media (SemEval 2019 - Task 6) shared task.
Semi-Supervised Offensive Language Identification Dataset (SOLID) SOLID
SOLID contains over 9 million tweets annotated following OLID's three-level taxonomy. SOLID was used in the OffensEval 2020: Multilingual Offensive Language Identification in Social Media (SemEval 2020 - Task 12) shared task.
Frequency lists from comparable Spanish corpora Word Unigrams POS and Morphology pdf
These two frequency lists were produced to compare linguistic features of four Spanish varieties (Argentina, Mexico, Peru, and Spain) as described in this 2013 paper.