Acta Informatica Pragensia 2023, 12(2), 243-259 | DOI: 10.18267/j.aip.2104023

Multi-Class Text Classification on Khmer News Using Ensemble Method in Machine Learning Algorithms

Raksmey Phann, Chitsutha Soomlek, Pusadee Seresangtakul ORCID...
Department of Computer Science, College of Computing, Khon Kaen University, Khon Kaen, Kingdom of Thailand

The research herein applies text classification with which to categorize Khmer news articles. News articles were collected from three online websites through web scraping and grouped into nine categories. After text preprocessing, the dataset was split into training and testing sets. We then evaluated the performance of the ensemble learning method via machine learning classifiers with k-fold validation. Various machine learning classifiers were employed, namely logistic regression, Complement Naive Bayes, Bernoulli Naive Bayes, k-nearest neighbours, perceptron, support vector machines, stochastic gradient descent, AdaBoost, decision tree, and random forest were employed. Accuracy was improved for the categorization of Khmer news articles, in which Grid Search CV was used to find the optimal hyperparameters for each machine learning classifier with feature extraction TF-IDF and Delta TF-IDF. The results determined that the highest accuracy was achieved through the ensemble learning method in the support vector machine with the optimal hyperparameters (C = 10, kernel = rbf), using feature extraction TF-IDF and Delta TF-IDF, at 83.47% and 83.40%, respectively. The model establishes that Khmer news articles can be accurately categorized.

Keywords: Text classification; Khmer news; Machine learning; Feature extraction; Optimal hyperparameters; News categorization; Ensemble learning method.

Received: October 28, 2022; Revised: February 20, 2023; Accepted: February 26, 2023; Prepublished online: March 10, 2023; Published: October 10, 2023  Show citation

ACS AIP APA ASA Harvard Chicago Chicago Notes IEEE ISO690 MLA NLM Turabian Vancouver
Phann, R., Soomlek, C., & Seresangtakul, P. (2023). Multi-Class Text Classification on Khmer News Using Ensemble Method in Machine Learning Algorithms. Acta Informatica Pragensia12(2), 243-259. doi: 10.18267/j.aip.210
Download citation

References

  1. Alsaleh, D., & Marie-Sainte, S. L. (2021). Arabic Text Classification Using Convolutional Neural Network and Genetic Algorithms. IEEE Access, 9, 91670-91685. https://doi.org/10.1109/access.2021.3091376 Go to original source...
  2. Arabameri, A., Pal, S. C., Rezaie, F., Chakrabortty, R., Saha, A. K., Blaschke, T., Di Napoli, M., Ghorbanzadeh, O., & Ngo, P. T. T. (2021). Decision tree based ensemble machine learning approaches for landslide susceptibility mapping. Geocarto International, 37(16), 4594-4627. https://doi.org/10.1080/10106049.2021.1892210 Go to original source...
  3. Azam, M., Ahmed, T., Sabah, F., & Hussain, M. I. (2018). Feature Extraction based Text Classification using K-Nearest Neighbor Algorithm. International Journal of Computer Science and Network Security, 18(12), 95-101.
  4. Babič, F., Pusztová, Ą., & Majnarić, L. T. (2020). Mild Cognitive Impairment Detection Using Association Rules Mining. Acta Informatica Pragensia, 9(2), 92-107. https://doi.org/10.18267/j.aip.135 Go to original source...
  5. Barua, A., Sharif, O., & Hoque, M. M. (2021). Multi-class Sports News Categorization using Machine Learning Techniques: Resource Creation and Evaluation. Procedia Computer Science, 193, 112-121. https://doi.org/10.1016/j.procs.2021.11.002 Go to original source...
  6. Bianchi, P., Hachem, W., & Schechtman, S. (2022). Convergence of Constant Step Stochastic Gradient Descent for Non-Smooth Non-Convex Functions. Set-valued and Variational Analysis, 30(3), 1117-1147. https://doi.org/10.1007/s11228-022-00638-z Go to original source...
  7. Buitinck, L., Louppe, G., Blondel, M., Pedregosa, F., Mueller, A., Grisel, O., Niculae, V., Prettenhofer, P., Gramfort, A., & Grobler, J. (2013). API design for machine learning software: experiences from the scikit-learn project. ArXiv Preprint ArXiv:1309.0238. https://doi.org/10.48550/arXiv.1309.0238 Go to original source...
  8. Buoy, R., Taing, N., & Chenda, S. (2021). Khmer Text Classification Using Word Embedding and Neural Networks. ArXiv Preprint ArXiv:2112.06748. https://doi.org/10.48550/arXiv.2112.06748 Go to original source...
  9. Chea, V., Thu, Y. K., Ding, C., Utiyama, M., Finch, A., & Sumita, E. (2015). Khmer word segmentation using conditional random fields. https://att-astrec.nict.go.jp/member/ding/KhNLP2015-SEG.pdf
  10. Dhar, P., & Abedin, M. Z. (2021). Bengali News Headline Categorization Using Optimized Machine Learning Pipeline. International Journal of Information Engineering and Electronic Business, 13(1), 15-24. https://doi.org/10.5815/ijieeb.2021.01.02 Go to original source...
  11. Dlamini, G., Kholmatova, Z., Kruglov, A., Succi, G., Tarasau, H., & Valeev, A. (2021). Meta-analytical Comparison of SVM and KNN for Text Classification. In 2021 International Conference Nonlinearity, Information and Robotics. IEEE. https://doi.org/10.1109/nir52917.2021.9666133 Go to original source...
  12. Gamal, D., Alfonse, M., El-Horbaty, E. M., & Salem, A. M. (2019). Implementation of Machine Learning Algorithms in Arabic Sentiment Analysis Using N-Gram Features. Procedia Computer Science, 154, 332-340. https://doi.org/10.1016/j.procs.2019.06.048 Go to original source...
  13. Hu, X., Luo, H., Guo, M., & Wang, W. (2022). Ecological technology evaluation model and its application based on Logistic Regression. Ecological Indicators, 136, 108641. https://doi.org/10.1016/j.ecolind.2022.108641 Go to original source...
  14. Jiang, S., Fu, S., Lin, N., & Fu, Y. (2022). Pretrained models and evaluation data for the Khmer language. Tsinghua Science & Technology, 27(4), 709-718. https://doi.org/10.26599/tst.2021.9010060 Go to original source...
  15. Khamphakdee, N., & Seresangtakul, P. (2021a). A Framework for Constructing Thai Sentiment Corpus using the Cosine Similarity Technique. In 2021 13th International Conference on Knowledge and Smart Technology (KST), (pp. 202-207). IEEE. https://doi.org/10.1109/KST51265.2021.9415802 Go to original source...
  16. Khamphakdee, N., & Seresangtakul, P. (2021b). Sentiment Analysis for Thai Language in Hotel Domain Using Machine Learning Algorithms. Acta Informatica Pragensia, 10(2), 155-171. https://doi.org/10.18267/j.aip.155 Go to original source...
  17. Khan, M. S., Shah, M. A., Javed, M. S., Khan, M. I., Rasheed, S., El-Shorbagy, M. A., El-Zahar, E. R., & Malik, M. (2021). Application of random forest for modelling of surface water salinity. Ain Shams Engineering Journal, 13(4), 101635. https://doi.org/10.1016/j.asej.2021.11.004 Go to original source...
  18. Kowsari, K., Meimandi, K. J., Heidarysafa, M., Mendu, S., Barnes, L. E., & Brown, D. E. (2019). Text Classification Algorithms: A Survey. Information, 10(4), 150. https://doi.org/10.3390/info10040150 Go to original source...
  19. Lee, S., Tseng, C., Yang, H., Jin, X., Jiang, Q., Pu, B., Hu, W., Liu, D., Huang, Y., & Zhao, N. (2022). Random RotBoost: An Ensemble Classification Method Based on Rotation Forest and AdaBoost in Random Subsets and Its Application to Clinical Decision Support. Entropy, 24(5), 617. https://doi.org/10.3390/e24050617 Go to original source...
  20. Li, X. (2022). Chinese Language and Literature Online Resource Classification Algorithm Based on Improved SVM. Scientific Programming, 2022, Article ID 4373548. https://doi.org/10.1155/2022/4373548 Go to original source...
  21. Liang, H., Sun, X. W., Sun, Y., & Gao, Y. (2017). Text feature extraction based on deep learning: a review. Eurasip Journal on Wireless Communications and Networking, 2017(1). https://doi.org/10.1186/s13638-017-0993-1 Go to original source...
  22. Luo, X. (2021). Efficient English text classification using selected Machine Learning Techniques. Alexandria Engineering Journal, 60(3), 3401-3409. https://doi.org/10.1016/j.aej.2021.02.009 Go to original source...
  23. Maldonado, S., Weber, R., & Famili, F. (2014). Feature selection for high-dimensional class-imbalanced data sets using Support Vector Machines. Information Sciences, 286, 228-246. https://doi.org/10.1016/j.ins.2014.07.015 Go to original source...
  24. Martineau, J., & Finin, T. (2009). Delta TFIDF: An Improved Feature Space for Sentiment Analysis. In Third AAAI Internatonal Conference on Weblogs and Social Media (pp. 1-4). AAAI. Go to original source...
  25. Mousavi, R., & Eftekhari, M. (2015). A new ensemble learning methodology based on hybridization of classifier ensemble selection approaches. Applied Soft Computing, 37, 652-666. https://doi.org/10.1016/j.asoc.2015.09.009 Go to original source...
  26. Nurfikri, F. S., & Mubarok, M. S. (2018). News topic classification using mutual information and bayesian network. In 2018 6th International Conference on Information and Communication Technology (ICoICT), (pp. 162-166). IEEE. https://doi.org/10.1109/ICoICT.2018.8528806 Go to original source...
  27. Petridis, K., Tampakoudis, I. A., Drogalas, G., & Kiosses, N. (2022). A Support Vector Machine model for classification of efficiency: An application to M&A. Research in International Business and Finance, 61, 101633. https://doi.org/10.1016/j.ribaf.2022.101633 Go to original source...
  28. Ponnaganti, N. D., & Anitha, R. (2022). A Novel Ensemble Bagging Classification Method for Breast Cancer Classification Using Machine Learning Techniques. Traitement Du Signal, 39(1), 229-237. https://doi.org/10.18280/ts.390123 Go to original source...
  29. Sagheer, A., Zidan, M. A., & Abdelsamea, M. M. (2019). A Novel Autonomous Perceptron Model for Pattern Classification Applications. Entropy, 21(8), 763. https://doi.org/10.3390/e21080763 Go to original source...
  30. Sahoo, R., Pasayat, A. K., Bhowmick, B., Fernandes, K. J., & Tiwari, M. K. (2021). A hybrid ensemble learning-based prediction model to minimise delay in air cargo transport using bagging and stacking. International Journal of Production Research, 60(2), 644-660. https://doi.org/10.1080/00207543.2021.2013563 Go to original source...
  31. Seni, G., & Elder, J. F. (2010). Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions. Morgan & Claypool. Go to original source...
  32. Singh, G., Kumar, B., Gaur, L., & Tyagi, A. (2019). Comparison between Multinomial and Bernoulli Naïve Bayes for Text Classification. In 2019 International Conference on Automation, Computational and Technology Management, (pp. 593-596). IEEE. https://doi.org/10.1109/ICACTM.2019.8776800 Go to original source...
  33. Sovietov, P., & Gorchakov, A. V. (2022). Digital Teaching Assistant for the Python Programming Course. In 2022 2nd International Conference on Technology Enhanced Learning in Higher Education (TELE), (pp. 272-276). IEEE. https://doi.org/10.1109/tele55498.2022.9801060 Go to original source...
  34. Symeonidis, S., Effrosynidis, D., & Arampatzis, A. (2018). A comparative evaluation of pre-processing techniques and their interactions for twitter sentiment analysis. Expert Systems With Applications, 110, 298-310. https://doi.org/10.1016/j.eswa.2018.06.022 Go to original source...
  35. Trigueros, O., Blanco, A. G., Lebeña, N., Casillas, A., & Pérez, A. G. (2022). Explainable ICD multi-label classification of EHRs in Spanish with convolutional attention. International Journal of Medical Informatics, 157, 104615. https://doi.org/10.1016/j.ijmedinf.2021.104615 Go to original source...
  36. Umar, N., & Nur, N. M. (2022). Application of Naïve Bayes Algorithm Variations On Indonesian General Analysis Dataset for Sentiment Analysis. Jurnal RESTI (Rekayasa Sistem Dan Teknologi Informasi), 6(4), 585-590. https://doi.org/10.29207/resti.v6i4.4179 Go to original source...
  37. Uzun, E., Yerlikaya, T., & Kirat, O. (2018). Comparison of python libraries used for web data extraction. Journal of the Technical University at Plovdiv, 24, 87-92.
  38. Verma, A., Pal, S., & Kumar, S. (2020). Prediction of Skin Disease Using Ensemble Data Mining Techniques and Feature Selection Method-a Comparative Study. Applied Biochemistry and Biotechnology, 190(2), 341-359. https://doi.org/10.1007/s12010-019-03093-z Go to original source...
  39. Vidhya S., Singh, D. A. A., & Leavline, E. J. (2016). Feature Extraction for Document Classification. International Journal of Innovative Research in Science, Engineering and Technology, 4(6), 50-56.
  40. Wongso, R., Luwinda, F. A., Trisnajaya, B. C., & Rusli, O. (2017). News Article Text Classification in Indonesian Language. Procedia Computer Science, 116, 137-143. https://doi.org/10.1016/j.procs.2017.10.039 Go to original source...
  41. Yang, D., Kim, B., Lee, S., Ahn, Y. C., & Kim, H. (2022). AutoDefect: Defect text classification in residential buildings using a multi-task channel attention network. Sustainable Cities and Society, 80, 103803. https://doi.org/10.1016/j.scs.2022.103803 Go to original source...
  42. Yao, L., Pengzhou, Z., & Chi, Z. (2019). Research on News Keyword Extraction Technology Based on TF-IDF and TextRank. In 2019 IEEE/ACIS 18th International Conference on Computer and Information Science (ICIS), (pp. 452-455). IEEE. https://doi.org/10.1109/icis46139.2019.8940293 Go to original source...

This is an open access article distributed under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0), which permits use, distribution, and reproduction in any medium, provided the original publication is properly cited. No use, distribution or reproduction is permitted which does not comply with these terms.