Self-Supervised Learning For Global Health Insights Using Multilingual Large Language Models
DOI:
https://doi.org/10.64252/bt5y4v73Keywords:
Self-Supervised Learning, Multilingual Models, Global Health Data, Clinical NLP, Unsupervised ClusteringAbstract
The global distribution of healthcare records in diverse languages presents a substantial barrier to effective real-time public health surveillance. In this paper a methodology is presented, which uses self-supervised learning with multimodal health data to train multilingual large language models on a global scale. This paper outlines a predicted method of unsupervised clustering that will process multilingual symptom narratives of a clinical data set. They show that by processing patient symptoms and diagnoses with text cleaning, TF-IDF vectorization, and KMeans clustering they can identify latent structures present in the data even in the absence of manual annotations. This will form the basis of real-time and multilingual extraction of insight in global health records that would possibly be in transformational in the management, monitoring, and response of health crises by the institutions of public health in general.