Automated Generation Of Multiple-Choice Questions In Kazakh Using Transformer Architecture
DOI:
https://doi.org/10.64252/w2mk3f33Keywords:
Kazakh language processing, question generation, multiple-choice assessment, transformers, educational NLP.Abstract
This study presents a multi-stage pipeline for automated generation of multiple-choice questions (MCQs) in the Kazakh language, a low-resource, agglutinative language with limited NLP tooling and datasets. We cast question generation as a sequence-to-sequence problem and fine-tune a T5 model on a Kazakh adaptation of SQuAD and a geography-themed SQuAD-style set (85/15 train/validation; 50 epochs). Given a passage and an answer span, the generator produces a candidate question, after which a BERT-based semantic verifier filters incoherent or tautological pairs, improving validation accuracy from 44% (pre-trained) to 78% (fine-tuned). To construct MCQs, we integrate a SpaCy NER module that samples distractors from entities of the same type as the correct answer, increasing plausibility while preserving linguistic coherence. Automatic evaluation yields BLEU-1/2/3/4 of 42.57/25.78/18.46/13.42, METEOR of 17.81, and ROUGE-L of 41.09, indicating good lexical coverage with expected degradation at higher n-grams and adequate retention of key content. Qualitative analysis against GPT-4o suggests our system generally produces contextually relevant questions, with some tendency toward broader prompts in ambiguous contexts. The contribution is an end-to-end, replicable framework, combining transformer-based generation, semantic verification, and entity-aware distractor synthesis, and tailored to Kazakh and extensible to other low-resource educational settings.
						



