Audio - Driven Prediction Of Child Language Proficiency: A Hybrid Transformer-Lightgbm Ensemble Framework For Automated Evaluation
DOI:
https://doi.org/10.64252/w4y45395Keywords:
Child ESL, Speech-to-text, Transformer, Wave2Vec, Whisper ModelAbstract
This paper presents a novel approach for automated assessment of child English as a Sec-ond Language (ESL) proficiency using transcript based analysis from speech recordings. Leveraging raw audio data of 5000 files, the proposed pipeline first extracts transcript-based linguistic features using Whisper, along with prosodic and acoustic characteristics using Wave2VeC. The pipeline combines features with transformer embeddings, evaluated via a hybrid Transformer and LightGBM model. Experimental results demonstrate strong performance, with Accuracy: 0.972, and Pearson correlation: 0.98, outperforming baseline machine learning approaches. Comparative analysis with state-of-the-art methods, including ASR-driven GPT classifiers, highlights the advantages of the proposed offline, cost-efficient pipeline while maintaining high predictive fidelity. The system further supports real-time user feedback by analyzing key linguistic and syntactic indicators, enabling practical applications in educational and language learning environments. Overall, this study demonstrates the effectiveness of combining speech-driven embeddings with ensemble machine learning for precise, scalable child ESL proficiency assessment.