Identification of Regional Languages in Indonesia Using Multinomial Naïve Bayes
Abstract
Currently, there has been a lot of research that has carried out language identification, but not many results have been provided for identifying regional languages in Indonesia. For this reason, this research will discuss the identification of local languages in Indonesia using seven languages, namely, Indonesian, Javanese, Sundanese, Minang, Muna, Bugis and Madurese. The approach used to identify languages in this research uses the Multinomial Naïve Bayes method. This approach is used to calculate the probability of each word pattern or row of words appearing in a labeled sentence. The resulting probability model is then used to determine the class of new sentences for which the language will be determined. The performance of this language identification method is measured by conducting two test scenarios. The first test was to find out the effect of n-gram pattern on the F-measure, while the second test was to observe the effect of the amount of training data on the F-measure. The test results show that the unigram and bigram patterns provide the highest accuracy results of 98.86%. As for the amount of training data of 1500 sentences for each language, it shows an accuracy of 98%. Keywords: language identification, local languages, multinomial naïve bayes
References
Lui, M., Lau, J. H., & Baldwin, T. (2014). Automatic detection and language identification of multilingual documents. Transactions of the Association for Computational Linguistics, 2, 27-40.
Ibrohim, M. O., & Budi, I. (2019, August). Multi-label hate speech and abusive language detection in Indonesian twitter. In Proceedings of the Third Workshop on Abusive Language Online (pp. 46-57)
Moryossef, A., Tsochantaridis, I., Aharoni, R., Ebling, S., & Narayanan, S. (2020, August). Real-time sign language detection using human pose estimation. In European Conference on Computer Vision (pp. 237- 248). Springer, Cham.
Jauhiainen, T., Lui, M., Zampieri, M., Baldwin, T., & Lindén, K. (2019). Automatic language identification in texts: A survey. Journal of Artificial Intelligence Research, 65, 675-782.
Zhang, Y., Riesa, J., Gillick, D., Bakalov, A., Baldridge, J., & Weiss, D. (2018). A fast, compact, accurate model for language identification of codemixed text. arXiv preprint arXiv:1810.04142.
Nishijima, M., & Liu, Y. (2021). Native Language Identification and Reconstruction of Native Language Relationship Using Japanese Learner Corpus. In Proceedings of the 35th Pacific Asia Conference on Language, Information and Computation (pp. 368-376).
Wijonarko, P., & Zahra, A. (2022). Spoken language identification on 4 Indonesian local languages using deep learning. Bulletin of Electrical Engineering and Informatics, 11(6), 3288-3293
Nugraha, A. B., & Romadhony, A. (2023). Identification of 10 Regional Indonesian Languages Using Machine Learning. Sinkron: jurnal dan penelitian teknik informatika, 8(4), 2203-2214.
Alham Fikri Aji, Genta Indra Winata, Fajri Koto, Samuel Cahyawijaya, Ade Romadhony, Rahmad Mahendra, Kemal Kurniawan, David Moeljadi, Radityo Eko Prasojo, Timothy Baldwin, Jey Han Lau, and Sebastian Ruder. 2022. One Country, 700+ Languages: NLP Challenges for Underrepresented Languages and Dialects in Indonesia. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7226–7249, Dublin, Ireland. Association for Computational Linguistics.
Berrar, D. (2018). Bayes’ theorem and naive Bayes classifier. Encyclopedia of Bioinformatics and Computational Biology: ABC of Bioinformatics, 403.
Baldwin, T., & Lui, M. (2010, June). Language identification: The long and the short of the matter. In Human language technologies: The 2010 annual conference of the North American Chapter of the Association for Computational Linguistics (pp. 229-237).
Sasaki, Y. (2007). The truth of the F-measure. Teach tutor mater, 1(5), 1-5.
ARDIYANTI SURYANI, ARIE; Widyantoro, Dwi Hendratmo; Purwarianti, Ayu; Sudaryat, Yayat, 2022, "Sundanese-Indonesian Parallel Corpus", https://doi.org/10.34820/FK2/HDYWXW, Telkom University Dataverse, V1
Sujaini, H. (2020). Improving the role of language model in statistical machine translation (Indonesian- Javanese). International Journal of Electrical and Computer Engineering, 10(2), 2102.
Winata, G. I., Aji, A. F., Cahyawijaya, S., Mahendra, R., Koto, F., Romadhony, A., ... & Ruder, S. (2022). Nusax: Multilingual parallel sentiment dataset for 10 indonesian local languages. arXiv preprint arXiv:2205.15960.
Jurafsky, Daniel & Martin, James. (2008). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition.



