Leveraging Machine-Labeled Data and Cross-Lingual Transfer for NER in Urdu and Sindhi

Nazish Basir*, Dil Nawaz Hakro, Khalil-Ur-Rehman Khoumbati, Zeeshan Bhatti

Leveraging Machine-Labeled Data and Cross-Lingual Transfer for NER in Urdu and Sindhi

Volume 19, Issue 1, 2025

Download

Author(s):	Nazish Basir* University of Sindh, Jamshoro, Pakistan, nazish.basir@usindh.edu.pk Dil Nawaz Hakro University of Sindh, Jamshoro, Pakistan, dilnawaz@usindh.edu.pk Khalil-Ur-Rehman Khoumbati University of Sindh, Jamshoro, Pakistan, khalil.khoumbati@usindh.edu.pk Zeeshan Bhatti University of Sindh, Jamshoro, Pakistan, zeeshan.bhatti@usindh.edu.pk
Abstract	Named Entity Recognition (NER) is a fundamental task in Natural Language Processing (NLP) that involves identifying and classifying entities within text. In low-resource languages like Urdu and Sindhi, due to limited annotated datasets and complex linguistic features such as rich morphology, agglutination, and the absence of capitalization cues the researchers face many challenges. In our study we are introducing a distinct approach that combines machine-labeled data generation with advanced multilingual transformer models to enhance the performance of low resource languages, with cross lingual transfer learning to improve NER performance in Sindhi. To the best of our knowledge, this is the first work exploring cross-lingual NER transfer from Urdu to Sindhi. We have also introduced two new entity types that include colors and foods, which have not been explored previously in Urdu and Sindhi language research. To reduce the need for extensive manual annotation, we used a bagging-based ensemble of Conditional Random Field (CRF) models, to generate high-confidence machine-labeled datasets. These models were trained on subsets of a smaller dataset, which were annotated by language experts. The machine-labeled data notably increased the volume of training data, which is essential for low-resource languages. We pre-trained two models, Multilingual BERT (mBERT) and XLM-RoBERTa on machine-labeled data and fine-tuned them on the human-annotated datasets. Our experiments demonstrated improvements in the performance of the Named Entity Recognition for both languages. Particularly, for Sindhi, the XLM-RoBERTa model's F1 score increased from 0.302 (without pre-training) to 0.681 after pre-training on a combined machine-labeled data of Urdu and Sindhi language, which is approximate increase of 125%. Our results show the effectiveness of incorporating machine-labeled data and cross-lingual knowledge transfer from Urdu to Sindhi language.
Keywords	Named Entity Recognition; NER; Urdu; Sindhi; Machine-Labeled Data; Cross-Lingual; Transfer Learning; mBERT; XLM-RoBERTa
Year	2025
Volume	19
Issue	1
Type	Research paper, manuscript, article
Journal Name	Journal of Information & Communication Technology
Publisher Name	ILMA University
Jel Classification	-
DOI	-
ISSN no (E, Electronic)	2075-7239
ISSN no (P, Print)	2415-0169
Country	Pakistan
City	Karachi
Institution Type	University
Journal Type	Open Access
Manuscript Processing	Blind Peer Reviewed
Format	PDF
Paper Link	https://jict.ilmauniversity.edu.pk/journal/jict/19.1/1.pdf
Page	1-8

JICT | Volume. 19 Issue 1, 2025

Leveraging Machine-Labeled Data and Cross-Lingual Transfer for NER in Urdu and Sindhi