Annotating alphanumeric expressions in clinical narratives

Carlos Antônio de Souza Perini,
Ana Luisa dos Anjos Resende Guimarães

Abstract

This paper aims to analyze expressions containing numerals in clinical narrative texts in order to identify potential challenges for their computational processing and to elaborate guidelines for their annotation according to the Universal Dependencies (UD) project. Using a corpus of 1,000 clinical narratives, tokens composed of at least one numeral in numeral format were selected and classified according to the format of their presentation and their eventual annotation following the UD guidelines. Occurrences of tokens belonging to the ten most frequent classes in the corpus were studied and guidelines for the annotation of these classes were elaborated. These guidelines were recorded and will later be used to compile an annotation guide for clinical narrative treebank projects.

References

DURAN, M.S.; LOPES, L.; PARDO, T.A.S. Descrição de numerais segundo modelo Universal Dependencies e sua anotação no português. Anais do XIII Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana (STIL 2021), v. 13, p. 344–352, 29 nov. 2021. DOI: https://doi.org/10.5753/stil.2021.17814. Acesso em: 13 set. 2022

DURAN, M.S. Manual de Anotação de PoS tags: Orientações para anotação de etiquetas morfossintáticas em Língua Portuguesa, seguindo as diretrizes da abordagem Universal Dependencies (UD). Relatório Técnico do ICMC 440. Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo. São Carlos-SP, Outubro, 55p. 2021. Disponível em: https://drive.google.com/le/d/1BddPswn-_Ioo-A5GsldA1cO1kqbcCahb/view?usp=sharing. Acesso em 16 nov. 2022

DURAN, M.S. Manual de Anotação de Relações de Dependência - Versão Revisada e Estendida: Orientações para anotação de relações de dependência sintática em Língua Portuguesa, seguindo as diretrizes da abordagem Universal Dependencies (UD). Relatório

Técnico do ICMC 440. Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo. São Carlos-SP, Outubro, 166p. 2022. Disponível em: https://drive.google.com/le/d/1ile8Wfxu1qdrZOmLGqkvVuQ4fXvHgVMo/view?usp=sharing. Acesso em 16 nov. 2022

FAN, J. et al. Syntactic parsing of clinical text: guideline and corpus development with handling ill-formed sentences. Journal of the American Medical Informatics Association : JAMIA, v. 20, n. 6, p. 1168–1177, 1 nov. 2013. DOI: http://doi.org/10.1136/amiajnl-2013-001810. Acesso em: 20 nov. 2022.

HANAUER, D. A. et al. Complexities, variations, and errors of numbering within clinical notes: the potential impact on information extraction and cohort-identification. BMC Medical Informatics and Decision Making, v. 19, n. S3, abr. 2019. DOI: 5 https://doi.org/10.1186/s12911-019-0784-1. Acesso em: 20 nov. 2022.

HUMPHREYS, B. L.; MCCRAY, A. T.; LINDBERG, D. A. B. The Unified Medical Language System. Methods of Information in Medicine, v. 32, n. 04, p. 281-291, 1993.

JURAFSKY, D.; MARTIN, J. Speech and language processing: An introduction to speech recognition, computational linguistics and natural language processing. Upper Saddle River, NJ: Prentice Hall, 2008.

KARA, E. et al. A Domain-adapted Dependency Parser for German Clinical Text. Vienna, Austria: Proceedings of the 14th Conference on Natural Language Processing (KONVENS 2018), set. 2018. Disponível em:

<https://konvens.org/proceedings/2018/PDF/konvens18_02.pdf>. Acesso em: 21 nov. 2022.

NIVRE, J. (2015). Towards a Universal Grammar for Natural Language Processing. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2015. Lecture Notes in Computer Science, vol 9041. Springer, Cham.

https://doi.org/10.1007/978-3-319-18111-0_1. Acesso em: 13 set. 2022.

MARNEFFE, M. et al. Universal Dependencies. Computational Linguistics 2021; 47 (2): 255–308. doi: https://doi.org/10.1162/coli_a_00402. Acesso em: 13 set. 2022.

MOON, S. R.; PAKHOMOV, S.; RYAN, J. et al. Automated non-alphanumeric symbol resolution in clinical texts. AMIA ... Annual Symposium proceedings. AMIA Symposium, 2011, p. 979-986. Acesso em: 13 set. 2022.

NÉVÉOL, A., DALIANIS, H., VELUPILLAI, S. et al. Clinical Natural Language Processing in languages other than English: opportunities and challenges. J Biomed Semant 9, 12 (2018). https://doi.org/10.1186/s13326-018-0179-8. Acesso em 17 maio 2023.

OLIVEIRA, L.E.S., PETERS, A.C., DA SILVA, A.M.P. et al. SemClinBr - a multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasks. J Biomed Semant 13, 13 (2022). https://doi.org/10.1186/s13326-022-00269-1. Acesso em: 13 set. 2022.

OLIVEIRA, L. F. A. et al. Challenges in Annotating a Treebank of Clinical Narratives in Brazilian Portuguese. Lecture Notes in Computer Science, v. 15, p. 90–100, 2022. DOI: https://doi.org/10.1007/978-3-030-98305-5_9. Acesso em: 7 out. 2022.

XIA, F.; YETISGEN-YILDIZ, M. Clinical corpus annotation: Challenges and strategies. In: WORKSHOP ON BUILDING AND EVALUATING RESOURCES FOR BIOMEDICAL TEXT MINING, 3., 2012, Istanbul. Proceedings [...] Istanbul: European Language Resources Association, 2012. Disponível em: http://faculty.washington.edu/melihay/publications/LREC_BioTxtM_2012.pdf. Acesso em: 27 jun. 2023.