The DANTEStocks Corpus: an analysis of the distribution of Universal Dependencies-based Part-of-Speech tags

Ariani Di Felippo,
Norton Trevisan Roman,
Thiago Alexandre Salgueiro Pardo,
Lucas Panta de Moura

Abstract

In the research area of Natural Language Processing (NLP), Part-of-Speech (PoS) tagging is one of the first processes applied to input data (speech or written text). It is responsible for assigning a proper part-of-speech (or word class) to each word in a text. When it comes to User-Generated Content (UGC) (e.g., tweets), however, there are additional challenges that undermine current approaches to PoS tagging, and which call for NLP resources. These, however, have so far focused on UCG orthographic and lexical phenomena only (e.g., truncated word, graphical stretching, etc.), letting aside PoS itself. To help fill in this gap, in this article we characterise DANTEStocks - a corpus of stock market tweets annotated with morphosyntactic information - in terms of the distribution of the PoS tags present in it. With this effort, we intend to provide researchers a starting point for other investigations, along with a benchmark against which to compare other corpora. Specifically, correctly characterising the corpus according to the PoS tags may support the investigation of the syntactic relations called dependencies, since some of them usually co-occur with specific PoS tags.

References

ANCHIÊTA, R. T.; PARDO, T. A. S. Análise Semântica com base em AMR para o Português. LinguaMÁTICA, v. 14, n. 1, p. 33-48. 2022.

BARBERO, C. CQL Grammars for Lexical and Semantic Information Extraction for Portuguese and Italian. In: IN-TERNATIONAL CONFERENCE ON COMPUTATIONAL PROCESSING OF THE PORTUGUESE LANGUAGE, 15., 2022, Fortaleza/Brazil. Anais [...]. 2022. p. 376-386.

CABRAL, B.; SOUZA, M.; CLARO, D. B. PortNOIE: A neural framework for open information extraction for the Portu-guese language. In: INTERNATIONAL CONFERENCE ON COMPUTATIONAL PROCESSING OF THE PORTUGUESE LANGUAGE, 15., 2022, Fortaleza/Brazil. Anais [...]. 2022. p. 243-255.

DA SILVA, E. H.; PARDO, T. A. S. ROMAN, N. T.; DI FELLIPO, A. Universal Dependencies for Tweets in Brazilian Portu-guese: Tokenization and Part of Speech Tagging. In: ENCONTRO NACIONAL DE INTELIGÊNCIA ARTIFICIAL E COM-PUTACIONAL (ENIAC), 18., 2021, Evento Online. Anais [...]. 2021. p. 434-445.

DA SILVA, F. J. V.; ROMAN, N. T.; CARVALHO, A. M. B. R. Stock market tweets annotated with emotions. Corpora, v. 15, N. 3, p. 343-354. 2020.

DE MARNEFFE, M. C.; MANNING, C. D.; NIVRE, J.; ZEMAN, D. Universal dependencies. Computational Linguistics, v. 47, n. 2, p. 255-308. 2021.

DE SOUZA, R. C. C.; LOPES, H. Portuguese POS Tagging Using BLSTM Without Handcrafted Features. In: IBEROAMERICAN CONGRESS ON PROGRESS IN PATTERN RECOGNITION, IMAGE ANALYSIS, COMPUTER VISION, AND APPLICATIONS, 24., 2019, Havana/Cuba. Anais [...]. 2019. p. 120-130.

DI FELIPPO, A.; POSTALI, C.; CEREGATTO, G.; GAZANA, L. S.; DA SILVA, E. H.; ROMAN, N. T.; PARDO, T. A. S. Descrição Preliminar do Corpus DANTEStocks: Diretrizes de Segmentação para Anotação segundo Universal Dependencies. In: SIMPÓSIO BRASILEIRO DE TECNOLOGIA DA INFORMAÇÃO E DA LINGUAGEM HUMANA, 13., 2021, Evento Onli-ne. Anais [...]. 2021. p. 335-343.

DI FELIPPO, A.; POSTALI, C.; CEREGATTO, G.; GAZANA, L. S.; ROMAN, N. T. Diretrizes de anotação de POS Tags em tweets do mercado financeiro: Orientações para anotação em língua portuguesa segundo a abordagem Universal Dependencies (UD). 2022. Relatório Técnico do ICMC - Instituto de Ciências Matemáticas e de Computação, Uni-versidade de São Paulo, São Carlos, 2022.

DURAN, M.S. Manual de Anotação de Relações de Dependência - Versão Revisada e Estendida: Orientações para anotação de relações de dependência sintática em Língua Portuguesa, seguindo as diretrizes da abordagem Univer-sal Dependencies (UD). 2022. Relatório Técnico do ICMC - Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo, São Carlos, 2022.

EISENSTEIN, J. What to do about bad language on the internet. In: CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2019, Atlanta/USA. Anais [...]. 2019. p. 359-369.

FONSECA, E. R.; ROSA, J. L. G.; ALUÍSIO, S. M. Evaluating word embeddings and a revised corpus for part-of-speech tagging in Portuguese. Journal of the Brazilian Computer Society, v. 21, n. 2, p. 1-14. 2015.

LIU, Y.; ZHU, Y.; CHE, W.; QIN, B.; SCHNEIDER, N.; SMITH, N. A. Parsing tweets into universal dependencies. In: AN-NUAL CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LIN-GUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 16., 2018, New Orleans/USA. Anais [...]. 2018. p. 965-975.

LYDDY, F.; FARINA, F.; HANNEY, J.; FARRELL, L.; O’NEILL, N. K. An analysis of language in university students’ text messages. Journal of Computer-Mediated Communication, v. 19, n. 3, p. 546-561. 2014.

MACHADO, M. T.; PARDO, T. A. S.; RUIZ, E. E. S.; DI FELIPPO, A.; VARGAS, F. Implicit opinion aspect clues in Portu-guese texts: analysis and categorization. In: INTERNATIONAL CONFERENCE ON COMPUTATIONAL PROCESSING OF THE PORTUGUESE LANGUAGE, 15., 2022, Fortaleza/Brazil. Anais [...]. 2022. p. 68-78.

MELERO, M.; COSTA-JUSSÀ, M. R.; DOMINGO, J.; MARQUINA, M.; QUIXAL, M. Holaaa!! writin like u talk is kewl but kinda hard 4 nlp. In: INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 8., 2012, Is-tanbul/Turkey. Anais [...]. 2012. p. 3794-3800.

NIVRE, J.; DE MARNEFFE, M.; GINTER, F.; HAJIC, J.; MANNING, C. D.; PYYSALO, S.; TYERS, S. S. F. M.; ZEMAN, D. Uni-versal dependencies v2: An evergrowing multilingual treebank collection. In: INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 12., 2020, Marseille/France. Anais [...], 2020. p. 4034-4043.

PARDO, T. A. S.; DURAN, M. S.; LOPES, L.; DI FELIPPO, A.; ROMAN, N. T.; NUNES, M. G. V. Porttinari - a Large Multi-genre Treebank for Brazilian Portuguese. In: SIMPÓSIO BRASILEIRO DE TECNOLOGIA DA INFORMAÇÃO E DA LIN-GUAGEM HUMANA, 13., 2021, Evento Online. Anais [...]. 2021. p. 1-10.

SANGUINETTI, M.; BOSCO, C.; CASSIDY, L.; ÇETINOGLU, Ö.; CIGNARELLA, A. T.; LYNN, T.; REHBEIN, I.; RUPPENHO-FER, J.; SEDDAH, D.; ZELDES, A. Treebanking user-generated content: A proposal for a unified representation in universal dependencies. In: INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 12., 2020, Marseille/France. Anais [...], 2020. p. 5240-5250.

SANGUINETTI, M.; BOSCO, C.; CASSIDY, L.; ÇETINOGLU, Ö.; CIGNARELLA, A. T.; LYNN, T., REHBEIN, I.; RUPPENHO-FER, J.; SEDDAH, D.; ZELDES, A. Treebanking user-generated content: a UD based overview of guidelines, corpora and unified recommendations. Language Resources & Evaluation. 2022.

SANGUINETTI, M.; BOSCO, C.; LAVELLI, A.; MAZZEI, A.; ANTONELLI, O.; TAMBURINI, F. PoSTWITA-UD: an Italian Twitter treebank in Universal Dependencies. In: INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 11., 2018, Miyazaki/Japan. Anais [...], 2018. p. 1768-1775.

STRAKA, M. UDPipe 2.0 prototype at CoNLL 2018 UD shared task. In: CoNLL 2018 SHARED TASK: MULTILINGUAL PARSING FROM RAW TEXT TO UNIVERSAL DEPENDENCIES, 2018. Brussels/Belgium. Proceeding […], 2018, p. 197–207.

SENO, E.; CASELI, H.; INÁCIO, M.; ANCHIÊTA, R.; RAMISCH, R. XPTA: um parser AMR para o Português baseado em uma abordagem entre línguas. LinguaMÁTICA, v. 14, n. 1, p. 49-68. 2022.

VOSKAKI, R.; TZIAFA, E.; IOANNIDOU, K. Description of predicative nouns in a modern greek financial corpus. In: INTERNATIONAL SYMPOSIUM ON THEORETICAL AND APPLIED LINGUISTICS, 21. 2016, New Orleans/USA. Anais [...], 2016. p. 488-503.

WU, S.; DREDZE, M. Beto, Bentz, Becas: The Surprising Cross-lingual Effectiveness of BERT. In: CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, 2019, Hong Kong/China. Anais [...], 2019. p. 833-844.