Skip to content

List of resources and tools developed with focus on Portuguese.

Notifications You must be signed in to change notification settings

ajdavidl/Portuguese-NLP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

6891855 · Feb 11, 2025
Feb 11, 2025

Repository files navigation

Portuguese-NLP

List of resources and tools developed with focus on Portuguese.

Datasets

  • #PraCegoVer - multi-modal dataset with Portuguese captions based on posts from Instagram.
  • 18th-century Portuguese medical texts
  • AG_news pt - automatic translation of the AG's corpus of news articles.
  • Alpaca data pt-br - Stanford Alpaca dataset translated into Brazilian Portuguese using the Helsinki-NLP/opus-mt-tc-big-en-pt model.
  • AspectBR - Aspect-based annotated dataset of web consumer reviews.
  • ASSIN - a dataset with semantic similarity score and entailment annotations. (HuggingFace)
  • ASSIN 2 - sequence of ASSIN. (HuggingFace)
  • Automated Essay Score (AES) ENEM Dataset - Benchmark for automatic essay scoring in Portuguese (HuggingFace)
  • Aya Dataset PT - CohereForAI Aya Dataset filtrado para português (PT).
  • BlogSet-BR - a collection of posts gathered from Blogspot platform written by Brazillian users.
  • BLUEX - A benchmark based on Brazilian Leading Universities Entrance eXams.
  • BoolQ - tradução automática do BoolQ.
  • br-quad-2.0 - Stanford Question Answering Dataset (SQuAD) 2.0 translated to Brazilian Portuguese (PT-BR) language.
  • Brands.Br - a Portuguese Reviews Corpus
  • Brazilian Court Decisions - collection of 4043 Ementa (summary) court decisions and their metadata from the Tribunal de Justiça de Alagoas (TJAL), the State Supreme Court of Alagoas (Brazil).
  • Brazilian E-Commerce - Brazilian E-Commerce Public Dataset by Olist store.
  • Brazilian Headlines Sentiments - Dataset containing sentiment analysis of Brazilian news agencies headlines.
  • Brazilian Portuguese Literature Corpus - 3.7 million word corpus of Brazilian literature published between 1840-1908.
  • Brazilian Portuguese Narrative Essays Dataset - Dataset for Automatic Essay Scoring of Brazilian Portuguese Narrative Essays.
  • Brazilian Portuguese Sentiment Analysis Datasets.
  • Brazilian TCU's judgments - Judgments of Federal Court of Accounts - Brazil (TCU).
  • BrWaC - Brazilian Portuguese Web as Corpus.
  • BrWac2Wiki - a dataset for multi-document summarization in Portuguese.
  • B2W-Reviews01 - product reviews.
  • Canarim - A Large-Scale Dataset of Web Pages in the Portuguese Language (huggingface)
  • Carolina - Corpus Geral do Português Brasileiro Contemporâneo (huggingface).
  • Capes - parallel corpus of theses and dissertations abstracts in English and Portuguese.
  • CC100-Portuguese - Created by Conneau & Wenzek et al. at 2020. This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository.
  • CETENFolha - news from the newspaper Folha de S. Paulo.
  • CHAVE - collection for Information Retrieval and Question Answering.
  • CINTIL Corpus - a linguistically interpreted corpus of Portuguese.
  • ClinicalNER - Clinical Named Entity Recognition in Portuguese.
  • Complexidade Textual para Estágios Escolares do Sistema Educacional Brasileiro.
  • CORAA - dataset for Automatic Speech Recognition.
  • CORAA SER - Emotion Recognition from Brazilian Portuguese Informal Spontaneous Speech.
  • CrawlPT_dedup - CrawlPT (deduplicated) is composed by three corpora: brWaC, C100-PT, OSCAR-2301.
  • CSTNews - a corpus with 50 clusters of news texts with their multi-document summaries, as well as several discourse and semantic annotations.
  • C-ORAL-BRASIL - This project is dedicated to the study of Brazilian Portuguese spontaneous speech and, more broadly, to the compilation of spoken corpora.
  • DANTEStocks - Corpus of stock market tweets written in Brazilian Portuguese and annotated with named entities according to HAREM's taxonomy.
  • DEEPAGÉ - Answering Questions in Portuguese about the Brazilian Environment.
  • DNLT-BP - Datasets of Neuropsychological Language Tests in Brazilian Portuguese.
  • ENEM Challenge - Consists of the writing of an essay and an objective part containing 180 multiple choice questions.
  • ENEM-2022 and ENEM-2023 - These projects encompass all multiple-choice questions from the last two editions of the Exame Nacional do Ensino Médio (ENEM), the main standardized entrance examination adopted by Brazilian universities.
  • Essay-BR - Essay-BR: a corpus of essays for the Brazilian Portuguese language.
  • Extended Essay-BR - Extended version of the Essay-BR corpus.
  • FACTCK.BR - A dataset to study Fake News in Portuguese.
  • FactNews - dataset to predict sentence-level factuality of news reporting.
  • fake voices - deepfakes in Brazilian Portuguese created with XTTS model.
  • Fake.Br - aligned true and fake news written in Brazilian Portuguese (Hugginface).
  • Central_de_fatos - (Huggingface).
  • FakeNewsSet - (HuggingFace).
  • Fakepedia-Corpus - fake news dataset.
  • FakeRecogna - dataset comprised of real and fake news (Huggingface).
  • FakeWhatsApp.Br - An annotated Corpus of WhatsApp messages in PT-BR for automatic detection of textual misinformation.
  • FKTC - FaKe news Text Collections.
  • Floresta Sintá(c)tica - treebank for Portuguese.
  • HAREM first - evaluation contest for named entity recognizers in Portuguese.
  • HAREM second - evaluation contest for named entity recognizers in Portuguese.
  • HateBR - large-scale expert annotated corpus of Brazilian Instagram comments for hate speech and offensive language detection on the web and social media.
  • Historical Portuguese Corpora - tools and resources for manipulation of historical corpora and management of historical dictionaries.
  • IMDB pt - Tradução atomática do IMBD.
  • InferBR - Natural Language Inference dataset.
  • Iudicium Textum Dataset - contains legal documents created by Brazilian Federal Supreme Court in its integral composition (paper).
  • LeNER-Br - a Dataset for Named Entity Recognition in Brazilian Legal Text.
  • LegalPT_dedup - LegalPT (deduplicated) aggregates the maximum amount of publicly available legal data in Portuguese.
  • Lex2Kids - lexicon in Portuguese most heard by children.
  • Mac-Morpho - Brazilian Portuguese texts annotated with part-of-speech tags.
  • MilkQA - a dataset of dense questions for the task of answer selection.
  • Minutes of Central Bank of Brazil - Minutes of the Monetary Policy Committee of the Central Bank of Brazil.
  • NER in Brazilian Portuguese tweets - Twitter messages in pt-br annotated for the entities PER, LOC and ORG.
  • NERDE - Documents from CADE's jurisprudence annotated for the entities ORG, PER, TEMPO, LOC, LEG (legislation), DOCS (documents), VALOR.
  • News-Crawl-PT - Monolingual News Crawl used for WMT.
  • News of the site Folha de São Paulo - news of the Brazilian Newspaper Folha de São Paulo.
  • News published in Brazil - news compilation of the Globo group.

Multilingual datasets

  • A Multilingual Dataset for Investigating Stereotypes and Negative Attitudes Towards Migrant Groups in Large Language Models
  • askD - ELI5 dataset adapted on Medical Questions (AskDocs) subreddit.
  • English-Portuguese Sentences - English-Portuguese Sentences from the Tatoeba Project.
  • EUR-Lex - multilingual corpus in all the official languages of the European Union.
  • Europarl - European Parliament Proceedings Parallel Corpus 1996-2011.
  • Europarl-ST - Multilingual Speech Translation Corpus, that contains paired audio-text samples for Speech Translation, constructed using the debates carried out in the European Parliament in the period between 2008 and 2012.
  • mc4 - multilingual colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset.
  • mfaq - multilingual corpus of Frequently Asked Questions parsed from the Common Crawl.
  • MKQA - Multilingual Knowledge Questions & Answers (github).
  • MQA - multilingual corpus of Questions and Answers (MQA) parsed from the Common Crawl.
  • MMARCO - Multilingual version of the MS MARCO passage ranking dataset.
  • mRobust - Multilingual version of the TREC 2004 Robust passage ranking dataset
  • MultiCoNER - a large multilingual dataset for Named Entity Recognition.
  • MuST-C - multilingual speech translation corpus.
  • OpenSubtitles - collection of translated movie subtitles.
  • OSCAR - Open Super-large Crawled Aggregated coRpus.
  • Tatoeba - a large database of sentences and translations.
  • TED2020 - contains a crawl of nearly 4000 TED and TED-X transcripts from July 2020.
  • TSAR-2022-Shared-Task - TSAR2022 Shared Task on Lexical Simplification.
  • WikiANN - multilingual named entity recognition dataset consisting of Wikipedia articles annotated with LOC (location), PER (person), and ORG (organisation) tags in the IOB2 format.
  • WikiLingua - Multilingual abstractive summarization dataset extracted from WikiHow.
  • WikiMatrix - Parallel Sentences in 1620 Language Pairs from Wikipedia.
  • Wikiner - Learning multilingual named entity recognition from Wikipedia.
  • WikiNEuRal - Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER (EMNLP 2021).
  • Wikipedia - Wikipedia dataset containing cleaned articles of all languages.
  • XFORMAL - A Benchmark for Multilingual Formality Style Transfer.
  • XLSUM - 1.35 million professionally annotated article-summary pairs from BBC.

Lexicon

  • BATS-PT - manual translation of the lexicographic portion of the Bigger Analogy Test Set (BATS) to Portuguese
  • br.ispell - Ispell dictionary for brazilian portuguese (github).
  • Conceptnet - an open, multilingual knowledge graph.
  • DicSin - Dictionary of synonyms and antonyms.
  • lexiconPT - R package that provides lexicons for Portuguese Text Analysis.
  • lexicons - Dictionaries of names, surnames, acronyms and it's extensions, stop-words, etc.
  • LIWC - Linguistic Inquiry and Word Count (dictionary)
  • Onto.PT - Ontologia Lexical para o Português.
  • OpenWordnet-PT - an open access wordnet for Portuguese (site).
  • OpLexicon - a sentiment lexicon for the Portuguese language.
  • palavras - Word list of Brazillian Portuguese.
  • PAPEL.
  • pt-br - Wordlist, verbs, conjugations, term frequencies.
  • PT-LKB - Large Portuguese Lexical-Semantic Knowledge Base
  • PULO - Portuguese Unified Lexical Ontology.
  • SentiLex-PT - a sentiment lexicon for Portuguese.
  • Stopwords - Portuguese stopwords collection.
  • Tep2.
  • Unitex-PB - lexical resources.
  • VaLexPB - a lexicon of Brazilian Portuguese verb valences.
  • VerbNet.Br 1.0 - verbal lexicon of Brazilian Portuguese.
  • wikidict-dsl-pt - Wikidata Bilingual DSL Dictionaries.
  • Wordnetaffectbr - vocabulary of emotions words.
  • Wordnet.Br - Portuguese WordNet.

Models

  • Albertina PT-BR - It is an encoder of the BERT family for the Portuguese language - the American variant from Brazil.
  • Albertina PT-PT - It is an encoder of the BERT family for the Portuguese language - the European variant from Portugal.
  • Alpaca-LoRA-PTBR - Low-Rank LLaMA Instruct-Tuning.
  • BART - BART pre-treinado em português.
  • BERTimbau - BERTimbau Base is a pretrained BERT model for Brazilian Portuguese that achieves state-of-the-art performances on three downstream NLP tasks: Named Entity Recognition, Sentence Textual Similarity and Recognizing Textual Entailment (Github).
  • BioBERTpt - fine-tuned BERT models trained on the clinical domain for Portuguese language (Github).
  • Bode - a fine-tuned LLaMA 2-based model for Portuguese prompts (13b).
  • Cabrita - A portuguese finetuned instruction LLaMA (Github).
  • DeBERTinha - A DeBERTa V3 XSmall adapted to the Brazilian Portuguese language (Github).
  • Electra - Electra model trained on BRWAC.
  • FinBERT-PT-BR - a pre-trained NLP model to analyze sentiment of Brazilian Portuguese financial texts.
  • Gervasio-PT-BR - It is a decoder of the GPT family for the Portuguese language - the American variant from Brazil.
  • Gervasio-PT-PT - It is a decoder of the GPT family for the Portuguese language - the European variant from Portugal.
  • GlórIA 1.3B - A Portuguese European-focused Large Language Model (HuggingFace)
  • GPT2 small - GPorTuguese-2 (Portuguese GPT-2 small) is a state-of-the-art language model for Portuguese based on the GPT-2 small model.
  • GPT-Neo small - a finetuned version from GPT-Neo 125M by EletheurAI to Portuguese language.
  • GPT2-Bio-PT - a biomedical finetuned version from GPorTuguese-2 (Github).
  • NERDE-base - BERTimbau finetuned to NER on Judicial Documents.
  • roberta-pt-br
  • RoBERTaCrawlPT-base - RoBERTaCrawlPT-base is a generic Portuguese Masked Language Model pretrained from scratch from the CrawlPT corpora
  • RoBERTaLexPT-base - Portuguese Masked Language Model pretrained from scratch from the LegalPT and CrawlPT corpora
  • Sabiá - Sabiá-7B is Portuguese language model developed by Maritaca AI.
  • Sabiá 2 - Language model trained on Portuguese text, especially in the Brazilian domain.
  • T5 - T5 model on Brazilian Portuguese data.
  • tgf-xlm-roberta-base-pt-br - a fine-tuned version of xlm-roberta-base on the BrWac dataset (Github).
  • Wav2vec - Fine-tuned facebook/wav2vec2-large-xlsr-53 on Portuguese using the train and validation splits of Common Voice 6.1.

Multilingual Models

  • Bloom - BigScience Large Open-science Open-access Multilingual Language Model.
  • mBert - Pretrained model on the top 104 languages with the largest Wikipedia using a masked language modeling (MLM) objective.
  • mDeBERTa - improves the BERT and RoBERTa models.
  • mGPT - Multilingual GPT model. An autoregressive GPT-like model.
  • mMiniLM - mMiniLM-L6-v2 Reranker finetuned on mMARCO
  • mT5 - Multilingual T5. A massively multilingual pre-trained text-to-text transformer.
  • XLM-RoBERTa - XLM-RoBERTa model pre-trained on 2.5TB of filtered CommonCrawl data containing 100 languages.
  • LaBSE - Language-agnostic BERT Sentence Encoder (LaBSE) is a BERT-based model trained for sentence embedding for 109 languages.

Word Embeddings

  • fastText - Multi-lingual word vectors.
  • LASER - Language-Agnostic SEntence Representations.
  • NILC-Embeddings - Word embeddings trained in Portuguese by USP.
  • MUSE - Multilingual Unsupervised and Supervised Embeddings.
  • word vectors - Pre-trained word vectors of 30+ languages.

Metrics

  • Coh-Metrix-Port - an adaptation of the Coh-Metrix text analysis tool to the Brazilian Portuguese language.
  • NILC-Metrix - it gathers the metrics developed over more than a decade in NILC Lab.

Leaderboards

  • Open PT LLM Leaderboard - Open PT LLM Leaderboard aims to provide a benchmark for the evaluation of Large Language Models (LLMs) in the Portuguese language across a variety of tasks and datasets.

Frameworks

Institutions

Tools

  • Apertium-por - Apertium linguistic data for Portuguese.
  • Autocorrect - Spelling corrector in python.
  • BrGram - Computational grammar fragment of Brazilian Portuguese in the LFG formalism implemented in XLE.
  • Dicio API - Portuguese dictionary API.
  • dict-pt-br - dictionary for Brazilian Portuguese.
  • Languagetool - Style and Grammar Checker for 25+ Languages.
  • LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language.
  • LexML Parser - parser for legal documents.
  • LX parser - statistical constituency parser for Portuguese.
  • metaphone-ptbr - Metaphone algorithm for the Portuguese language.
  • mlconjug3 - a Python library to conjugate verbs in Portuguese and other languages.
  • MorphoBr - Resources for morphological analysis of Portuguese.
  • OpCluster - Automatic extraction and clustering of fine-grained opinions.
  • Phonemizer - Simple text to phones converter for multiple languages.
  • PorGram - Open source computational grammar for Portuguese in the HPSG formalism.
  • pymetaphone-br - Metaphone algorithm package for the Portuguese language.
  • pysentimiento - Multilingual toolkit for Sentiment Analysis and Social NLP tasks.
  • pyspellchecker - Multilingual Spell Checking.
  • RBAMR - A Rule-Based AMR Parser for Portuguese.
  • Verbecc - Complete Conjugation of any Verb using Machine Learning for French, Spanish, Portuguese, Italian and Romanian.

Other lists

Other links

Visitor Badge

About

List of resources and tools developed with focus on Portuguese.

Topics

Resources

Stars

Watchers

Forks