COVID-19 Named Entity Recognition for Vietnamese
Thinh Hung Truong, Mai Hoang Dao, Dat Quoc NguyenVinAI Research, Hanoi, Vietnam
Along with the outbreak of the pandemic, information about the COVID-19 is aggregated rapidly through different types of texts in different languages. Particularly, in Vietnam, text reports containing official information from the government about COVID-19 cases are presented in great detail, including de-identified personal information, travel history, as well as information of people who come into contact with the cases. The reports are frequently kept up to date at reputable online news sources, playing a significant role to help the country combat the pandemic. It is thus essential to building systems to retrieve and condense information from those official sources so that related people and organizations can promptly grasp the key information for epidemic prevention tasks, and the systems should also be able to adapt and sync quickly with epidemics that take place in the future. One of the first steps to develop such systems is to recognize relevant named entities mentioned in the texts, which is also known as the named entity recognition (NER) task.
Compared to other languages, data resources for the Vietnamese NER task are limited, including only two public datasets from the VLSP 2016 and 2018 NER shared tasks. These two datasets only focus on recognizing generic entities of person names, organizations, and locations in online news articles. Thus, making them difficult to adapt to the context of extracting key entity information related to COVID-19 patients.
These two concerns lead to our work’s main goals that are:
- To develop a NER task in the COVID-19 specified domain, that potentially impacts research and downstream applications.
- To provide the research community with a new dataset for recognizing COVID-19 related named entities in Vietnamese.
We define 10 entity types to extract key information related to COVID-19 patients, which are especially useful in downstream applications. In general, these entity types can be used in the context of not only the COVID-19 pandemic but also in other future epidemics. The description of each entity type is briefly described in Table 1.
In summary, our dataset construction consists of the following phases:
- COVID-19 related data collection: We crawl articles tagged with “COVID-19” or “COVID” keywords from the reputable Vietnamese online news sites, and segment the crawled news articles’ primary text content into sentences using RDRSegmenter from VnCoreNLP. Relevant sentences related to COVID-19 patients are selected using BM25Plus. We then manually filter out that do not contain information related to patients in Vietnam, resulting in 10027 raw sentences.
- Annotation process:
- We first develop an initial annotation guideline and randomly sample a pilot set of 1000 sentences to annotate manually to use for quality control.
- We divide the whole dataset of 10027 sentences into 10 non-overlapping and equal subset, each of which contains 100 sentences from the pilot set and employ 10 annotators. The annotation quality of each annotator is measured by F1 calculated over the 100 sentences that already have gold annotations from the pilot set. All annotators are asked to revise their annotations until they achieve an F1 of at least 0.92, we then revisit each sentence and make further corrections if needed.
- The resulting dataset consists of 35K entities over 10027 sentences.
How we evaluate
We conduct experiments on our dataset using strong baselines including: BiLSTM-CNN-CRF (denoted as BiL-CRF) and the pre-trained language models XLM-R and PhoBERT. XLM-R is a multi-lingual variant of RoBERTa, pre-trained on a 2.5TB multilingual dataset that contains 137GB of syllable-level Vietnamese texts. PhoBERT is a monolingual variant of RoBERTa, pre-trained on a 20GB word-level Vietnamese dataset. Our main findings are:
- The performances of word-level models are higher than their syllable-level counterparts, showing that automatic Vietnamese word segmentation helps improve NER.
- Fine-tuning the pre-trained language models XLM-R and PhoBERT helps produce better performances than BiLSTM-CNN-CRF.
Why it matters
We presented the first manually-annotated Vietnamese dataset in the COVID-19 domain, focusing on the named entity recognition task. We empirically conduct experiments on our dataset to compare strong baselines and find that the input representations and the pre-trained language models all have influences on this COVID-19 related NER task. We publicly release our dataset and hope that it can serve as the starting point for further Vietnamese NLP research and applications in fighting the COVID-19 and other future epidemics.
See our NAACL 2021 paper for details of the dataset construction and experimental results: https://arxiv.org/abs/2104.03879
Our dataset is publicly released: https://github.com/VinAIResearch/PhoNER_COVID19