Sentiment Analysis of Transliterated Hindi and Marathi Using Lexicon-Enriched Transformer Models

Authors

  • Rishikesh Janardan Sutar Author
  • Kamalakar Ravindra Desai Author

DOI:

https://doi.org/10.64252/atdz1d85

Keywords:

Sentiment Analysis, Transliterated Languages, Lexicon-based Approach, Transformer Models, Low-resource languages

Abstract

This research introduces a structured approach for sentiment analysis in transliterated Hindi and Marathi, two low-resource Indian languages, through a combination of lexicon-driven data generation and enhanced transformer-based modeling. We began by manually curating sentiment lexicons from two authoritative bilingual dictionaries as Oxford Hindi-English and SalaamChaus Marathi-English, selecting 13,231 Hindi and 9,712 Marathi sentiment-bearing words. Each word was manually annotated with a sentiment weight. To address spelling variability in transliterated text, extensive variant forms were generated (176,755 for Hindi, 159,804 for Marathi). Using these, 53,211 Hindi and 30,659 Marathi synthetic sentences were created, with sentence-level sentiment scores derived by averaging the weights of the included sentiment words.

We also created a parallel version of these datasets using publicly available Kaggle sentiment word lists for Hindi and Marathi. Sentence sentiment scores were recalculated based on the Kaggle weights, allowing direct performance comparisons between our manually curated lexicons and an external resource. Additionally, we extracted 11,679 transliterated Hindi comments from YouTube and annotated them with sentiment scores using both our dictionary-based resource and the Kaggle word list, producing two real-world evaluation sets.

To evaluate sentiment classification, we fine-tuned transformer models as MuRIL, XLM-RoBERTa-base, XLM-RoBERTa-large, and IndicBERT, under two experimental setups. In the first, we integrated numerical linguistic features with each transformer model. In the second, we enhanced the models further by incorporating graph-based structural embeddings (via Node2Vec) and applied rank-based feature selection. Results show that our dictionary-based datasets significantly outperformed Kaggle-derived versions for Hindi, mixed Hindi-Marathi, and YouTube comments. For Marathi-only sentences, both resources performed comparably. Notably, incorporating graph embeddings and feature selection further improved accuracy, particularly for Marathi and YouTube datasets. This study highlights the impact of handcrafted lexical resources and structural augmentation in advancing sentiment analysis for underrepresented, transliterated languages.

Downloads

Download data is not yet available.

Downloads

Published

2025-06-02

How to Cite

Sentiment Analysis of Transliterated Hindi and Marathi Using Lexicon-Enriched Transformer Models. (2025). International Journal of Environmental Sciences, 11(7s), 1228-1238. https://doi.org/10.64252/atdz1d85