Sign Language Translator Using Transformer Model
DOI:
https://doi.org/10.64252/9j23fm84Keywords:
SignLanguageRecognition (SLR),DeepLearning,ConvolutionalNeuralNetworks(CNNs),LongShort-Term Memory (LSTM)networks, Spatiotemporaldependencies, Transformer-based framework, Self-attention mechanism, Raw image frames, MediaPipe-extracted skeletal keypoints, Self-supervised learningAbstract
Sign Language Recognition (SLR) is important to facilitate communication bridges between deaf and hearing populations. Traditional CNN and LSTM models fail to handle spatiotemporal complexities, particularly in continuous sign language, but we introduce a Transformer-based dual-stream approach with self-attention to extract spatial and temporal relationships. Our method handles raw video frames and Media Pipe-sourced skeletal key points with self-supervised masked feature prediction and contrastive learning for enhanced generalization under low-resource environments. Motivated by SLGT former, we incorporate hierarchical attention layers to learn about fine gesture subtleties. Tested on the ISL-CSLTR dataset, our model performs better than CNN-LSTM and state-of-the-art SLR baselines on both isolated and continuous gesture recognition and generalized well to out-of-distribution signs with small amounts of labelled data. This research pushes forward real-time, accessible AI for scalable and the most useful sign languageĀ translation.