Arabic part of speech tagger

12/10/2023

We conclude that the Bi-LSTM-based POS tagger achieves the state-of-the-art results for the ‘Mixed’ dataset with 96.5% accuracy. We also present two supervised POS taggers that are developed based on two approaches: Conditional Random Fields and Bidirectional Long Short-Term Memory (Bi-LSTM) models. In addition, we present an exploratory analysis of the behavior of using hashtags in Arabic tweets, which is a phenomenon that affects the task of POS tagging. We introduce three manually annotated datasets for the POS tagging of Arabic tweets: the ‘Mixed,’ ‘MSA,’ and ‘GLF’ datasets with 3000, 1000, and 1000 Arabic tweets, respectively. In this paper, we consider POS tagging, which is one of the NLP tasks that directly affects the performance of other subsequent text processing tasks. In contrast, only a few attempts have been made to develop POS taggers for Arabic content on Twitter. Part-of-speech (POS) tagging for English tweets is one of the tasks that is offered and facilitated by such a tool. For the English language, Twitter has an NLP tool that provides tweet-specific NLP tasks, which present significant opportunities for English NLP research and applications. However, Twitter data pose numerous challenges and obstacles to NLP tasks. This content has been a rich source for several studies that focused on natural language processing (NLP) research.

We exploit this insight in defining an optimized system selection model for the studied tasks.Over the past few years, Twitter has experienced massive growth and the volume of its online content has increased rapidly. Our results suggest that the variant proximity of pre-training data to fine-tuning data is more important than the pre-training data size. We compare our different models to each other, as well as to eight publicly available models by fine-tuning them on five NLP tasks spanning 12 datasets. We also examine the importance of pre-training data size by building additional models that are pre-trained on a scaled-down set of the MSA variant. To do so, we build three pre-trained language models across three variants of Arabic: Modern Standard Arabic (MSA), dialectal Arabic, and classical Arabic, in addition to a fourth language model which is pre-trained on a mix of the three. Publisher = "Association for Computational Linguistics",Ībstract = "In this paper, we explore the effects of language variants, data sizes, and fine-tuning task types in Arabic pre-trained language models. [rabic Pre-trained Language Models",īooktitle = "Proceedings of the Sixth Arabic Natural Language Processing Workshop", > pos = pipeline( 'token-classification', model= 'CAMeL-Lab/bert-base-arabic-camelbert-ca-pos-egy') To use the model with a transformers pipeline: > from transformers import pipeline This model will also be available in CAMeL Tools soon. You can use the CAMeLBERT-CA POS-EGY model as part of the transformers pipeline. Our fine-tuning procedure and the hyperparameters we used can be found in our paper " The Interplay of Variant, Size, and Task Type in Arabic Pre-trained Language Models." Our fine-tuning code can be found here. CAMeLBERT-CA POS-EGY Model is a Egyptian Arabic POS tagging model that was built by fine-tuning the CAMeLBERT-CA model.įor the fine-tuning, we used the ARZTB dataset.

0 Comments

Arabic part of speech tagger

Leave a Reply.

Author

Archives

Categories