The DANTEStocks Corpus: an analysis of the distribution of Universal Dependencies-based Part-of-Speech tags

Ariani Di Felippo; Norton Trevisan Roman; Thiago Alexandre Salgueiro Pardo; Lucas Panta de Moura

doi:10.25189/rabralin.v22i2.2119

The DANTEStocks Corpus: an analysis of the distribution of Universal Dependencies-based Part-of-Speech tags

Ariani Di Felippo,

Norton Trevisan Roman,

Thiago Alexandre Salgueiro Pardo,

Lucas Panta de Moura

V. 22, N. 2 (2023)
Submitted: Nov 20, 2022
Published: Sep 9, 2024
DOI 10.25189/rabralin.v22i2.2119

PDF

Abstract

In the research area of Natural Language Processing (NLP), Part-of-Speech (PoS) tagging is one of the first processes applied to input data (speech or written text). It is responsible for assigning a proper part-of-speech (or word class) to each word in a text. When it comes to User-Generated Content (UGC) (e.g., tweets), however, there are additional challenges that undermine current approaches to PoS tagging, and which call for NLP resources. These, however, have so far focused on UCG orthographic and lexical phenomena only (e.g., truncated word, graphical stretching, etc.), letting aside PoS itself. To help fill in this gap, in this article we characterise DANTEStocks - a corpus of stock market tweets annotated with morphosyntactic information - in terms of the distribution of the PoS tags present in it. With this effort, we intend to provide researchers a starting point for other investigations, along with a benchmark against which to compare other corpora. Specifically, correctly characterising the corpus according to the PoS tags may support the investigation of the syntactic relations called dependencies, since some of them usually co-occur with specific PoS tags.

References

ANCHIÊTA, R. T.; PARDO, T. A. S. Análise Semântica com base em AMR para o Português. LinguaMÁTICA, v. 14, n. 1, p. 33-48. 2022.

BARBERO, C. CQL Grammars for Lexical and Semantic Information Extraction for Portuguese and Italian. In: IN-TERNATIONAL CONFERENCE ON COMPUTATIONAL PROCESSING OF THE PORTUGUESE LANGUAGE, 15., 2022, Fortaleza/Brazil. Anais [...]. 2022. p. 376-386.

CABRAL, B.; SOUZA, M.; CLARO, D. B. PortNOIE: A neural framework for open information extraction for the Portu-guese language. In: INTERNATIONAL CONFERENCE ON COMPUTATIONAL PROCESSING OF THE PORTUGUESE LANGUAGE, 15., 2022, Fortaleza/Brazil. Anais [...]. 2022. p. 243-255.

DA SILVA, E. H.; PARDO, T. A. S. ROMAN, N. T.; DI FELLIPO, A. Universal Dependencies for Tweets in Brazilian Portu-guese: Tokenization and Part of Speech Tagging. In: ENCONTRO NACIONAL DE INTELIGÊNCIA ARTIFICIAL E COM-PUTACIONAL (ENIAC), 18., 2021, Evento Online. Anais [...]. 2021. p. 434-445.

DA SILVA, F. J. V.; ROMAN, N. T.; CARVALHO, A. M. B. R. Stock market tweets annotated with emotions. Corpora, v. 15, N. 3, p. 343-354. 2020.

DE MARNEFFE, M. C.; MANNING, C. D.; NIVRE, J.; ZEMAN, D. Universal dependencies. Computational Linguistics, v. 47, n. 2, p. 255-308. 2021.

DE SOUZA, R. C. C.; LOPES, H. Portuguese POS Tagging Using BLSTM Without Handcrafted Features. In: IBEROAMERICAN CONGRESS ON PROGRESS IN PATTERN RECOGNITION, IMAGE ANALYSIS, COMPUTER VISION, AND APPLICATIONS, 24., 2019, Havana/Cuba. Anais [...]. 2019. p. 120-130.

DI FELIPPO, A.; POSTALI, C.; CEREGATTO, G.; GAZANA, L. S.; DA SILVA, E. H.; ROMAN, N. T.; PARDO, T. A. S. Descrição Preliminar do Corpus DANTEStocks: Diretrizes de Segmentação para Anotação segundo Universal Dependencies. In: SIMPÓSIO BRASILEIRO DE TECNOLOGIA DA INFORMAÇÃO E DA LINGUAGEM HUMANA, 13., 2021, Evento Onli-ne. Anais [...]. 2021. p. 335-343.

DI FELIPPO, A.; POSTALI, C.; CEREGATTO, G.; GAZANA, L. S.; ROMAN, N. T. Diretrizes de anotação de POS Tags em tweets do mercado financeiro: Orientações para anotação em língua portuguesa segundo a abordagem Universal Dependencies (UD). 2022. Relatório Técnico do ICMC - Instituto de Ciências Matemáticas e de Computação, Uni-versidade de São Paulo, São Carlos, 2022.

DURAN, M.S. Manual de Anotação de Relações de Dependência - Versão Revisada e Estendida: Orientações para anotação de relações de dependência sintática em Língua Portuguesa, seguindo as diretrizes da abordagem Univer-sal Dependencies (UD). 2022. Relatório Técnico do ICMC - Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo, São Carlos, 2022.

EISENSTEIN, J. What to do about bad language on the internet. In: CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2019, Atlanta/USA. Anais [...]. 2019. p. 359-369.

FONSECA, E. R.; ROSA, J. L. G.; ALUÍSIO, S. M. Evaluating word embeddings and a revised corpus for part-of-speech tagging in Portuguese. Journal of the Brazilian Computer Society, v. 21, n. 2, p. 1-14. 2015.

LIU, Y.; ZHU, Y.; CHE, W.; QIN, B.; SCHNEIDER, N.; SMITH, N. A. Parsing tweets into universal dependencies. In: AN-NUAL CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LIN-GUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 16., 2018, New Orleans/USA. Anais [...]. 2018. p. 965-975.

LYDDY, F.; FARINA, F.; HANNEY, J.; FARRELL, L.; O’NEILL, N. K. An analysis of language in university students’ text messages. Journal of Computer-Mediated Communication, v. 19, n. 3, p. 546-561. 2014.

MACHADO, M. T.; PARDO, T. A. S.; RUIZ, E. E. S.; DI FELIPPO, A.; VARGAS, F. Implicit opinion aspect clues in Portu-guese texts: analysis and categorization. In: INTERNATIONAL CONFERENCE ON COMPUTATIONAL PROCESSING OF THE PORTUGUESE LANGUAGE, 15., 2022, Fortaleza/Brazil. Anais [...]. 2022. p. 68-78.

MELERO, M.; COSTA-JUSSÀ, M. R.; DOMINGO, J.; MARQUINA, M.; QUIXAL, M. Holaaa!! writin like u talk is kewl but kinda hard 4 nlp. In: INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 8., 2012, Is-tanbul/Turkey. Anais [...]. 2012. p. 3794-3800.

NIVRE, J.; DE MARNEFFE, M.; GINTER, F.; HAJIC, J.; MANNING, C. D.; PYYSALO, S.; TYERS, S. S. F. M.; ZEMAN, D. Uni-versal dependencies v2: An evergrowing multilingual treebank collection. In: INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 12., 2020, Marseille/France. Anais [...], 2020. p. 4034-4043.

PARDO, T. A. S.; DURAN, M. S.; LOPES, L.; DI FELIPPO, A.; ROMAN, N. T.; NUNES, M. G. V. Porttinari - a Large Multi-genre Treebank for Brazilian Portuguese. In: SIMPÓSIO BRASILEIRO DE TECNOLOGIA DA INFORMAÇÃO E DA LIN-GUAGEM HUMANA, 13., 2021, Evento Online. Anais [...]. 2021. p. 1-10.

SANGUINETTI, M.; BOSCO, C.; CASSIDY, L.; ÇETINOGLU, Ö.; CIGNARELLA, A. T.; LYNN, T.; REHBEIN, I.; RUPPENHO-FER, J.; SEDDAH, D.; ZELDES, A. Treebanking user-generated content: A proposal for a unified representation in universal dependencies. In: INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 12., 2020, Marseille/France. Anais [...], 2020. p. 5240-5250.

SANGUINETTI, M.; BOSCO, C.; CASSIDY, L.; ÇETINOGLU, Ö.; CIGNARELLA, A. T.; LYNN, T., REHBEIN, I.; RUPPENHO-FER, J.; SEDDAH, D.; ZELDES, A. Treebanking user-generated content: a UD based overview of guidelines, corpora and unified recommendations. Language Resources & Evaluation. 2022.

SANGUINETTI, M.; BOSCO, C.; LAVELLI, A.; MAZZEI, A.; ANTONELLI, O.; TAMBURINI, F. PoSTWITA-UD: an Italian Twitter treebank in Universal Dependencies. In: INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 11., 2018, Miyazaki/Japan. Anais [...], 2018. p. 1768-1775.

STRAKA, M. UDPipe 2.0 prototype at CoNLL 2018 UD shared task. In: CoNLL 2018 SHARED TASK: MULTILINGUAL PARSING FROM RAW TEXT TO UNIVERSAL DEPENDENCIES, 2018. Brussels/Belgium. Proceeding […], 2018, p. 197–207.

SENO, E.; CASELI, H.; INÁCIO, M.; ANCHIÊTA, R.; RAMISCH, R. XPTA: um parser AMR para o Português baseado em uma abordagem entre línguas. LinguaMÁTICA, v. 14, n. 1, p. 49-68. 2022.

VOSKAKI, R.; TZIAFA, E.; IOANNIDOU, K. Description of predicative nouns in a modern greek financial corpus. In: INTERNATIONAL SYMPOSIUM ON THEORETICAL AND APPLIED LINGUISTICS, 21. 2016, New Orleans/USA. Anais [...], 2016. p. 488-503.

WU, S.; DREDZE, M. Beto, Bentz, Becas: The Surprising Cross-lingual Effectiveness of BERT. In: CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, 2019, Hong Kong/China. Anais [...], 2019. p. 833-844.

PDF

Authorship

Ariani Di Felippo

Graduated in Literature from the Federal University of São Carlos (2000), master's (2004) and doctorate (2008) in Linguistics and Portuguese Language from the Universidade Estadual Paulista Júlio de Mesquita Filho. Since 2009, she has been a professor at the Department of Letters at the Federal University of São Carlos, currently holding the position of Associate Professor (Level 4). He completed a post-doctoral internship between Sep/2015 and Sep/2016 at the Department of Computing and Information Science at the University of Pennsylvania (UPenn) (Philadelphia/USA) in which he addressed issues related to coreference with application in Automatic Multi-document Summarization. He is currently working on topics related to syntax and syntactic processing in Portuguese.

Universidade Federal de São Carlos

https://orcid.org/0000-0002-4566-9352

Norton Trevisan Roman

Possui bacharelado em Física pela Universidade Estadual de Campinas (1998), mestrado e doutorado em Ciência da Computação, também pela Universidade Estadual de Campinas (2001 e 2007, respectivamente). Atualmente é Professor Livre-Docente e Pesquisador da EACH/USP, na área de Inteligência Artificial (com ênfase em Lingüística Computacional), atuando principalmente nos seguintes temas: tratamento computacional de sentimento e emoção, tratamento de diálogos, sumarização automática e aplicações de técnicas de Inteligência Artificial ao mercado financeiro.

University of São Paulo

https://orcid.org/0000-0002-0563-2045

Thiago Alexandre Salgueiro Pardo

Possui graduação em Bacharelado em Ciência da Computação pela Universidade Federal de São Carlos (1999), mestrado em Ciência da Computação pela Universidade Federal de São Carlos (2002) e doutorado em Ciências da Computação e Matemática Computacional pela Universidade de São Paulo (2005), onde também realizou estágio de pós-doutorado (2005). Atualmente é professor associado da Universidade de São Paulo. Tem experiência na área de Inteligência Artificial, atuando principalmente nos temas de processamento de linguagem natural, ou linguística computacional, mais especificamente nas áreas de análise de sentimentos e mineração de opiniões, de modelagem sintática, semântica e discursiva e de métodos de parsing.

University of São Paulo

https://orcid.org/0000-0003-2111-1319

Lucas Panta de Moura

Estudante de Sistemas de Informação na Universidade de São Paulo. Tem experiência na área de Processamento de Linguagem Natural e Aprendizado de Máquina.

University of São Paulo

https://orcid.org/0000-0002-0383-6159