Acta Informatica Pragensia 2021, 10(2), 155-171 | DOI: 10.18267/j.aip.1555822
Sentiment Analysis for Thai Language in Hotel Domain Using Machine Learning Algorithms
- Natural Language and Speech Processing Laboratory, Department of Computer Science, Faculty of Science, Khon Kaen University, 123 Mittraparb Road, Nai-Meuang, Meuang, Khon Kaen 40002, Kingdom of Thailand
Sentiment analysis is one of the most frequently used aspects of Natural Language Processing (NLP), which utilizes the polarity classification of reviews expressed at the aspect, sentence or document level. Several businesses and organizations utilize this technique to improve production, as well as employee and service efficiency. However, the users’ reviews in our study were expressed in an unstructured data form, which contained spelling errors, leading to complex classifications for both the users and the machine. To solve the problem, a supervised technique of Machine Learning (ML) algorithms can be applied to the data extraction, where classification polarity can be categorized into a positive, negative or neutral class. In this research, we compared nine ML algorithms to determine the most suitable ML algorithm for creating sentiment polarity classification of customer reviews in Thai, which is a low-resource language. The dataset was collected manually from two online agencies (Agoda.com and Booking.com) utilizing a special Thai language. We employed 11 preprocessing steps to clean and handle the large amount of noise data. Next, the Delta TF-IDF, TF-IDF, N-Gram, and Word2Vec techniques were applied to convert the text reviews into vectors, processed with different ML algorithms, to determine sentiment polarity classification and to make accurate comparisons. All ML algorithms were evaluated for sentiment polarity classification with ten-fold cross-validation, with which to compare the values of recall, precision, F1-score and accuracy. The experiment results show that the Support Vector Machine (SVM) using the Delta TF-IDF technique was the best ML algorithm for polarity classification of hotel reviews in the Thai language with the highest accuracy of 89.96%. The results of this research can be applied as the tool for small and medium-sized enterprises within the field of sentiment analysis of the Thai language in the hotel domain.
Keywords: Feature extraction; Machine learning algorithms; Natural language processing; Sentiment analysis.
Received: April 15, 2021; Revised: July 28, 2021; Accepted: July 29, 2021; Prepublished online: July 31, 2021; Published: September 10, 2021 Show citation
References
- Abdaoui, A., Pradel, C., & Sigel, G. (2020). Load What You Need: Smaller Versions of Multilingual BERT. https://arxiv.org/abs/2010.05609
Go to original source...
- Al-Saqqa, S., & Awajan, A. (2019). The Use of Word2vec Model in Sentiment Analysis: A Survey. In Proceedings of the 2019 International Conference on Artificial Intelligence, Robotics and Control, (pp. 39-43). ACM. https://doi.org/10.1145/3388218.3388229
Go to original source...
- Anaconda: Data Science Platform. (2020). Anaconda. https://www.anaconda.com/
- Arreerard, R., & Senivongse, T. (2018). Thai Defamatory Text Classification on Social Media. In 2018 IEEE International Conference on Big Data, Cloud Computing, Data Science & Engineering (BCD), (pp. 73-78). IEEE. https://doi.org/10.1109/BCD2018.2018.00019
Go to original source...
- Bansal, B., & Srivastava, S. (2018). Sentiment classification of online consumer reviews using word vector representations. Procedia Computer Science, 132, 1147-1153. https://doi.org/10.1016/j.procs.2018.05.029
Go to original source...
- Behdenna, S., Barigou, F., & Belalem, G. (2018). Document Level Sentiment Analysis: A survey. EAI Endorsed Transactions on Context-Aware Systems and Applications, 4(13), 154339. https://doi.org/10.4108/eai.14-3-2018.154339
Go to original source...
- Bhavitha, B. K., Rodrigues, A. P., & Chiplunkar, N. N. (2017). Comparative study of machine learning techniques in sentimental analysis. In 2017 International Conference on Inventive Communication and Computational Technologies (ICICCT), (pp. 216-221). IEEE. https://doi.org/10.1109/ICICCT.2017.7975191
Go to original source...
- Buitinck, L., Louppe, G., Blondel, M., Pedregosa, F., Mueller, A., Grisel, O., Niculae, V., Prettenhofer, P., Gramfort, A., Grobler, J., Layton, R., Vanderplas, J., Joly, A., Holt, B., & Varoquaux, G. (2013). API design for machine learning software: Experiences from the scikit-learn project. http://arxiv.org/abs/1309.0238
- Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. http://arxiv.org/abs/1810.04805
- Djaballah, K. A., Boukhalfa, K., & Boussaid, O. (2019). Sentiment Analysis of Twitter Messages using Word2vec by Weighted Average. In 2019 Sixth International Conference on Social Networks Analysis, Management and Security (SNAMS), (pp. 223-228). IEEE. https://doi.org/10.1109/SNAMS.2019.8931827
Go to original source...
- Fang, X., & Zhan, J. (2015). Sentiment analysis using product review data. Journal of Big Data, 2(1), 5. https://doi.org/10.1186/s40537-015-0015-2
Go to original source...
- Gamal, D., Alfonse, M., El-Horbaty, E.-S. M., & Salem, A.-B. M. (2019). Implementation of Machine Learning Algorithms in Arabic Sentiment Analysis Using N-Gram Features. Procedia Computer Science, 154, 332-340. https://doi.org/10.1016/j.procs.2019.06.048
Go to original source...
- Gamal, D., Alfonse, M., M. El-Horbaty, E.-S., & M. Salem, A.-B. (2018). Analysis of Machine Learning Algorithms for Opinion Mining in Different Domains. Machine Learning and Knowledge Extraction, 1(1), 224-234. https://doi.org/10.3390/make1010014
Go to original source...
- Hitesh, M., Vaibhav, V., Kalki, Y. J. A., Kamtam, S. H., & Kumari, S. (2019). Real-Time Sentiment Analysis of 2019 Election Tweets using Word2vec and Random Forest Model. In 2019 2nd International Conference on Intelligent Communication and Computational Techniques (ICCT), (pp. 146-151). https://doi.org/10.1109/ICCT46177.2019.8969049
Go to original source...
- Jianqiang, Z., & Xiaolin, G. (2017). Comparison Research on Text Pre-processing Methods on Twitter Sentiment Analysis. IEEE Access, 5, 2870-2879. https://doi.org/10.1109/ACCESS.2017.2672677
Go to original source...
- Khamphakdee, N., & Seresangtakul, P. (2021). A Framework for Constructing Thai Sentiment Corpus using the Cosine Similarity Technique. In 2021 13th International Conference on Knowledge and Smart Technology (KST), (pp. 202-207). IEEE. https://doi.org/10.1109/KST51265.2021.9415802
Go to original source...
- Khan, M. T., Durrani, M., Ali, A., Inayat, I., Khalid, S., & Khan, K. H. (2016). Sentiment analysis and the complex natural language. Complex Adaptive Systems Modeling, 4(1), 2. https://doi.org/10.1186/s40294-016-0016-9
Go to original source...
- Kumari, M., Jain, A., & Bhatia, A. (2016). Synonyms Based Term Weighting Scheme: An Extension to TF.IDF. Procedia Computer Science, 89, 555-561. https://doi.org/10.1016/j.procs.2016.06.093
Go to original source...
- Kurniawan, F. W., & Maharani, W. (2020). Indonesian Twitter Sentiment Analysis Using Word2Vec. In 2020 International Conference on Data Science and Its Applications (ICoDSA), (pp. 1-6). IEEE. https://doi.org/10.1109/ICoDSA50139.2020.9212906
Go to original source...
- Kusrini, & Mashuri, M. (2019). Sentiment Analysis In Twitter Using Lexicon Based and Polarity Multiplication. In 2019 International Conference of Artificial Intelligence and Information Technology (ICAIIT), (pp. 365-368). IEEE. https://doi.org/10.1109/ICAIIT.2019.8834477
Go to original source...
- Madasu, A., & Sivasankar, E. (2019). A Study of Feature Extraction techniques for Sentiment Analysis. http://arxiv.org/abs/1906.01573
- Martineau, J., & Finin, T. (2009). Delta TFIDF: An Improved Feature Space for Sentiment Analysis. In Third International AAAI Conference on Weblogs and Social Media, (pp. 258-261). AAAI. https://ojs.aaai.org/index.php/ICWSM/article/view/13979
Go to original source...
- Marukatat, R., Chumpia, J., & Yongcharoenchai, S. (2019). Topic and Sentiment Classification of Streaming Tweets about Tourist Destinations in Thailand. In 2019 9th IEEE International Conference on Control System, Computing and Engineering (ICCSCE), (pp. 84-88). IEEE. https://doi.org/10.1109/ICCSCE47578.2019.9068582
Go to original source...
- Maslej-Kre¹òáková, V., Sarnovský, M., Butka, P., & Machová, K. (2020). Comparison of Deep Learning Models and Various Text Pre-Processing Techniques for the Toxic Comments Classification. Applied Sciences, 10(23), 8631. https://doi.org/10.3390/app10238631
Go to original source...
- Mikolov, T., Le, Q. V., & Sutskever, I. (2013). Exploiting Similarities among Languages for Machine Translation. http://arxiv.org/abs/1309.4168
- Nawangsari, R. P., Kusumaningrum, R., & Wibowo, A. (2019). Word2Vec for Indonesian Sentiment Analysis towards Hotel Reviews: An Evaluation Study. Procedia Computer Science, 157, 360-366. https://doi.org/10.1016/j.procs.2019.08.178
Go to original source...
- Pasupa, K., Netisopakul, P., & Lertsuksakda, R. (2016). Sentiment analysis of Thai children stories. Artificial Life and Robotics, 21(3), 357-364. https://doi.org/10.1007/s10015-016-0283-8
Go to original source...
- Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., & Cournapeau, D. (2011). Scikit-learn: Machine Learning in Python. Machine Learning in Python, 12, 2825-2830.
- Poncelas, A., Pidchamook, W., Liu, C.-H., Hadley, J., & Way, A. (2020). Multiple Segmentations of Thai Sentences for Neural Machine Translation. http://arxiv.org/abs/2004.11472
- Porntrakoon, P. (2019). Improve the Accuracy of SenseComp in Thai Consumer's Review Using Syntactic Analysis. In 2019 16th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON), (pp. 369-372). https://doi.org/10.1109/ECTI-CON47248.2019.8955197
Go to original source...
- Porntrakoon, P., & Moemeng, C. (2018). Thai Sentiment Analysis for Consumer's Review in Multiple Dimensions Using Sentiment Compensation Technique (SenseComp). In 2018 15th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON), (pp. 25-28). IEEE. https://doi.org/10.1109/ECTICon.2018.8619892
Go to original source...
- PyThaiNLP. (2020). PyThaiNLP. https://pythainlp.github.io/
- Python Programming Language. (2020). Python. https://www.python.org/
- Rathee, N., Joshi, N., & Kaur, J. (2018). Sentiment Analysis Using Machine Learning Techniques on Python. In 2018 Second International Conference on Intelligent Computing and Control Systems (ICICCS), (pp. 779-785). IEEE. https://doi.org/10.1109/ICCONS.2018.8663224
Go to original source...
- Tesmuang, R., & Chirawichitchai, N. (2020). Sentiment Analysis of Thai Online Product Reviews using Genetic Algorithms with Support Vector Machine. Progress in Applied Science and Technology, 10, ICT01. https://doi.org/10.14456/PAST.2020.8
Go to original source...
- Rezaeinia, S. M., Rahmani, R., Ghodsi, A., & Veisi, H. (2019). Sentiment analysis based on improved pre-trained word embeddings. Expert Systems with Applications, 117, 139-147. https://doi.org/10.1016/j.eswa.2018.08.044
Go to original source...
- Riaz, S., Fatima, M., Kamran, M., & Nisar, M. W. (2019). Opinion mining on large scale data using sentiment analysis and k-means clustering. Cluster Computing, 22(S3), 7149-7164. https://doi.org/10.1007/s10586-017-1077-z
Go to original source...
- Rojratanavijit, J., & Eiamsithipan, C. (2019). Analysing Thai Social Media Content to Improve Customer Satisfaction. In Proc. of the 25th International Conference on Electricity Distribution (CIRED 2019), (no. 481). CIRED.
- Saberi, B., & Saad, S. (2017). Sentiment Analysis or Opinion Mining: A Review. International Journal of Advanced Science Engineering Information Technology, 7(5), 1660-1666. https://doi.org/10.18517/ijaseit.7.4.2137
Go to original source...
- Saifullah, S., Fauziyah, Y., & Aribowo, A. S. (2021). Comparison of machine learning for sentiment analysis in detecting anxiety based on social media data. Jurnal Informatika, 15(1), 45. https://doi.org/10.26555/jifo.v15i1.a20111
Go to original source...
- Saipech, P., & Seresangtakul, P. (2018). Automatic Thai Subjective Examination using Cosine Similarity. In 2018 5th International Conference on Advanced Informatics: Concept Theory and Applications (ICAICTA), (pp. 214-218). IEEE. https://doi.org/10.1109/ICAICTA.2018.8541276
Go to original source...
- Shayaa, S., Jaafar, N. I., Bahri, S., Sulaiman, A., Seuk Wai, P., Wai Chung, Y., Piprani, A. Z., & Al-Garadi, M. A. (2018). Sentiment Analysis of Big Data: Methods, Applications, and Open Challenges. IEEE Access, 6, 37807-37827. https://doi.org/10.1109/ACCESS.2018.2851311
Go to original source...
- Sungsri, T., & Ua-apisitwong, U. (2017). The Analysis and Summarizing System of Thai Hotel Reviews Using Opinion Mining Technique. In Proceedings of the 5th International Conference on Information and Education Technology - ICIET '17, (pp. 167-170). ACM. https://doi.org/10.1145/3029387.3029391
Go to original source...
- Symeonidis, S., Effrosynidis, D., & Arampatzis, A. (2018). A comparative evaluation of pre-processing techniques and their interactions for twitter sentiment analysis. Expert Systems with Applications, 110, 298-310. https://doi.org/10.1016/j.eswa.2018.06.022
Go to original source...
- Thai stop words list. (2019). Stopwords_th. https://github.com/PyThaiNLP/pythainlp/blob/dev/pythainlp/corpus/stopwords_th.txt
- Tripathy, A., Agrawal, A., & Rath, S. K. (2016). Classification of sentiment reviews using n-gram machine learning approach. Expert Systems with Applications, 57, 117-126. https://doi.org/10.1016/j.eswa.2016.03.028
Go to original source...
- Tubishat, M., Idris, N., & Abushariah, M. A. M. (2018). Implicit aspect extraction in sentiment analysis: Review, taxonomy, oppportunities, and open challenges. Information Processing & Management, 54(4), 545-563. https://doi.org/10.1016/j.ipm.2018.03.008
Go to original source...
- Word2Vec Embeddings. (2020). Gensim. https://radimrehurek.com/gensim/models/word2vec.html
- Yao, L., Pengzhou, Z., & Chi, Z. (2019). Research on News Keyword Extraction Technology Based on TF-IDF and TextRank. In 2019 IEEE/ACIS 18th International Conference on Computer and Information Science (ICIS), (pp. 452-455). IEEE. https://doi.org/10.1109/ICIS46139.2019.8940293
Go to original source...
- Zhang, H., Gan, W., & Jiang, B. (2014). Machine Learning and Lexicon Based Methods for Sentiment Classification: A Survey. In 2014 11th Web Information System and Application Conference, (pp. 262-265). IEEE. https://doi.org/10.1109/WISA.2014.55
Go to original source...
- Zhang, L., Ghosh, R., Dekhil, M., Hsu, M., & Liu, B. (2011). Combining Lexicon-based and Learning-based Methods for Twitter Sentiment Analysis. Hewlett-Packard Development Company, L.P. https://www.hpl.hp.com/techreports/2011/HPL-2011-89.pdf
This is an open access article distributed under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0), which permits use, distribution, and reproduction in any medium, provided the original publication is properly cited. No use, distribution or reproduction is permitted which does not comply with these terms.