Leveraging Machine-Labeled Data and Cross-Lingual Transfer for NER in Urdu and Sindhi
Volume 19, Issue 1, 2025
DownloadAuthor(s): |
Nazish Basir* University of Sindh, Jamshoro, Pakistan, nazish.basir@usindh.edu.pk Dil Nawaz Hakro University of Sindh, Jamshoro, Pakistan, dilnawaz@usindh.edu.pk Khalil-Ur-Rehman Khoumbati University of Sindh, Jamshoro, Pakistan, khalil.khoumbati@usindh.edu.pk Zeeshan Bhatti University of Sindh, Jamshoro, Pakistan, zeeshan.bhatti@usindh.edu.pk |
---|---|
Abstract | Named Entity Recognition (NER) is a fundamental task in Natural Language Processing (NLP) that involves identifying and classifying entities within text. In low-resource languages like Urdu and Sindhi, due to limited annotated datasets and complex linguistic features such as rich morphology, agglutination, and the absence of capitalization cues the researchers face many challenges. In our study we are introducing a distinct approach that combines machine-labeled data generation with advanced multilingual transformer models to enhance the performance of low resource languages, with cross lingual transfer learning to improve NER performance in Sindhi. To the best of our knowledge, this is the first work exploring cross-lingual NER transfer from Urdu to Sindhi. We have also introduced two new entity types that include colors and foods, which have not been explored previously in Urdu and Sindhi language research. To reduce the need for extensive manual annotation, we used a bagging-based ensemble of Conditional Random Field (CRF) models, to generate high-confidence machine-labeled datasets. These models were trained on subsets of a smaller dataset, which were annotated by language experts. The machine-labeled data notably increased the volume of training data, which is essential for low-resource languages. We pre-trained two models, Multilingual BERT (mBERT) and XLM-RoBERTa on machine-labeled data and fine-tuned them on the human-annotated datasets. Our experiments demonstrated improvements in the performance of the Named Entity Recognition for both languages. Particularly, for Sindhi, the XLM-RoBERTa model's F1 score increased from 0.302 (without pre-training) to 0.681 after pre-training on a combined machine-labeled data of Urdu and Sindhi language, which is approximate increase of 125%. Our results show the effectiveness of incorporating machine-labeled data and cross-lingual knowledge transfer from Urdu to Sindhi language. |
Keywords | Named Entity Recognition; NER; Urdu; Sindhi; Machine-Labeled Data; Cross-Lingual; Transfer Learning; mBERT; XLM-RoBERTa |
Year | 2025 |
Volume | 19 |
Issue | 1 |
Type | Research paper, manuscript, article |
Journal Name | Journal of Information & Communication Technology | Publisher Name | ILMA University | Jel Classification | - | DOI | - | ISSN no (E, Electronic) | 2075-7239 | ISSN no (P, Print) | 2415-0169 | Country | Pakistan | City | Karachi | Institution Type | University | Journal Type | Open Access | Manuscript Processing | Blind Peer Reviewed | Format | Paper Link | https://jict.ilmauniversity.edu.pk/journal/jict/19.1/1.pdf | Page | 1-8 |