Leveraging Machine-Labeled Data and Cross-Lingual Transfer for NER in Urdu and Sindhi

Volume 19, Issue 1,  2025

Download

Author(s):

Nazish Basir* University of Sindh, Jamshoro, Pakistan, nazish.basir@usindh.edu.pk

Dil Nawaz Hakro University of Sindh, Jamshoro, Pakistan, dilnawaz@usindh.edu.pk

Khalil-Ur-Rehman Khoumbati University of Sindh, Jamshoro, Pakistan, khalil.khoumbati@usindh.edu.pk

Zeeshan Bhatti University of Sindh, Jamshoro, Pakistan, zeeshan.bhatti@usindh.edu.pk

Abstract Named Entity Recognition (NER) is a fundamental task in Natural Language Processing (NLP) that involves identifying and classifying entities within text. In low-resource languages like Urdu and Sindhi, due to limited annotated datasets and complex linguistic features such as rich morphology, agglutination, and the absence of capitalization cues the researchers face many challenges. In our study we are introducing a distinct approach that combines machine-labeled data generation with advanced multilingual transformer models to enhance the performance of low resource languages, with cross lingual transfer learning to improve NER performance in Sindhi. To the best of our knowledge, this is the first work exploring cross-lingual NER transfer from Urdu to Sindhi. We have also introduced two new entity types that include colors and foods, which have not been explored previously in Urdu and Sindhi language research. To reduce the need for extensive manual annotation, we used a bagging-based ensemble of Conditional Random Field (CRF) models, to generate high-confidence machine-labeled datasets. These models were trained on subsets of a smaller dataset, which were annotated by language experts. The machine-labeled data notably increased the volume of training data, which is essential for low-resource languages. We pre-trained two models, Multilingual BERT (mBERT) and XLM-RoBERTa on machine-labeled data and fine-tuned them on the human-annotated datasets. Our experiments demonstrated improvements in the performance of the Named Entity Recognition for both languages. Particularly, for Sindhi, the XLM-RoBERTa model's F1 score increased from 0.302 (without pre-training) to 0.681 after pre-training on a combined machine-labeled data of Urdu and Sindhi language, which is approximate increase of 125%. Our results show the effectiveness of incorporating machine-labeled data and cross-lingual knowledge transfer from Urdu to Sindhi language.
Keywords Named Entity Recognition; NER; Urdu; Sindhi; Machine-Labeled Data; Cross-Lingual; Transfer Learning; mBERT; XLM-RoBERTa
Year 2025
Volume 19
Issue 1
Type Research paper, manuscript, article
Journal Name Journal of Information & Communication Technology
Publisher Name ILMA University
Jel Classification -
DOI -
ISSN no (E, Electronic) 2075-7239
ISSN no (P, Print) 2415-0169
Country Pakistan
City Karachi
Institution Type University
Journal Type Open Access
Manuscript Processing Blind Peer Reviewed
Format PDF
Paper Link https://jict.ilmauniversity.edu.pk/journal/jict/19.1/1.pdf
Page 1-8