Effect of Dimension Size and Window Size on Word Embedding in Classification Tasks

doi:10.18267/j.aip.309

Acta Informatica Pragensia X:X | DOI: 10.18267/j.aip.309336

Effect of Dimension Size and Window Size on Word Embedding in Classification Tasks

Dávid Držík ORCID...¹, Jozef Kapusta ORCID...^1,2: ¹ Faculty of Natural Sciences and Informatics, Constantine the Philosopher University in Nitra, Nitra, Slovakia; ² Institute of Security and Computer Science, University of the National Education Commission, Krakow, Poland

Background: Static word embedding models such as Word2Vec and GloVe remain widely used in natural language processing, yet key hyperparameters are often selected heuristically rather than through systematic validation.

Objective: This study provides an extrinsic evaluation of context window size and embedding dimensionality for Word2Vec (CBOW and Skip-gram) and GloVe embeddings in a downstream spam classification task.

Methods: Embeddings were trained on a large external corpus and evaluated using a neural network and several classical machine learning classifiers.

Results: The results show that context window size has a moderate influence on performance, whereas embedding dimensionality has a clearer effect: values below approximately 50 degrade performance, while increases beyond moderate ranges (approximately 100–150) yield diminishing returns. Across all experiments, Word2Vec achieves higher stability and performance than GloVe.

Conclusion: Overall, the findings suggest that robust classification performance can be achieved with moderate embedding dimensionalities and smaller context windows, providing practical guidance for efficient embedding configuration.

Keywords: Word embeddings; Word2Vec; GloVe; Vector dimension; Context window size.

Received: July 17, 2025; Revised: February 18, 2026; Accepted: February 23, 2026; Prepublished online: March 14, 2026

Download citation

References

Abayomi-Alli, O., Misra, S., Abayomi-Alli, A., & Odusami, M. (2019). A review of soft techniques for SMS spam classification: Methods, approaches and applications. Engineering Applications of Artificial Intelligence, 86, 197-212. https://doi.org/10.1016/j.engappai.2019.08.024 Go to original source...
Abubakar, H. D., Umar, M., & Bakale, M.A. (2022). Sentiment Classification: Review of Text Vectorization Methods: Bag of Words, Tf-Idf, Word2vec and Doc2vec. SLU Journal of Science and Technology, 4(1-2), 27-33. https://doi.org/10.56471/slujst.v4i.266 Go to original source...
Adewumi, T. P., Liwicki, F., & Liwicki, M. (2020). Word2Vec: Optimal Hyper-Parameters and Their Impact on NLP Downstream Tasks. arXiv:2003.11645. https://doi.org/10.48550/arXiv.2003.11645 Go to original source...
Almeida, T. A., Hidalgo, J. M. G., & Yamakami, A. (2011). Contributions to the study of SMS spam filtering. In Proceedings of the 11th ACM Symposium on Document Engineering, (pp. 259-262). ACM. https://doi.org/10.1145/2034691.2034742 Go to original source...
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., … Amodei, D. (2020). Language Models are Few-Shot Learners. In 34th Conference on Neural Information Processing Systems (NeurIPS 2020). NeurIPS.
Chugh, M., Whigham, P. A., & Dick, G. (2018). Stability of Word Embeddings Using Word2Vec. In AI 2018: Advances in Artificial Intelligence (pp. 812-818). Springer. https://doi.org/10.1007/978-3-030-03991-2_73 Go to original source...
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (pp. 4171-4186). ACL. https://doi.org/10.18653/v1/N19-1423 Go to original source...
Dharma, E. M., Lumban Gaol, F., Leslie, H., Warnars, H. S., & Soewito, B. (2022). The Accuracy Comparison Among word2vec, Glove, and Fasttext Towards Convolution Neural Network (CNN) Text Classification. Journal of Theoretical and Applied Information Technology, 100(2), 349-359.
Dutta, S., Das, A. K., Ghosh, S., & Samanta, D. (2023). Attribute selection to improve spam classification. In Data Analytics for Social Microblogging Platforms (pp. 95-127). Elsevier. https://doi.org/10.1016/B978-0-32-391785-8.00016-0 Go to original source...
Khurana, D., Koli, A., Khatter, K., & Singh, S. (2023). Natural language processing: state of the art, current trends and challenges. Multimedia Tools and Applications, 82(3), 3713-3744. https://doi.org/10.1007/s11042-022-13428-4 Go to original source...
Klimt, B., & Yang, Y. (2004). The Enron Corpus: A New Dataset for Email Classification Research. In Machine Learning: ECML 2004 (pp. 217-226). Springer. https://doi.org/10.1007/978-3-540-30115-8_22 Go to original source...
Levy, O., & Goldberg, Y. (2014). Dependency-Based Word Embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, (pp. 302-308). ACL. https://doi.org/10.3115/v1/P14-2050 Go to original source...
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. In 1st International Conference on Learning Representations, ICLR 2013. ICLR.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. In NIPS'13: Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, (pp. 3111-3119). NISP.
Nazir, S., Asif, M., Sahi, S. A., Ahmad, S., Ghadi, Y. Y., & Aziz, M. H. (2022). Toward the Development of Large-Scale Word Embedding for Low-Resourced Language. IEEE Access, 10, 54091-54097. https://doi.org/10.1109/ACCESS.2022.3173259 Go to original source...
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel V. and Thirion, B., Grisel, O., Blondel, M., Prettenhofer P. and Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825-2830.
Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, (pp. 1532-1543). ACL. https://doi.org/10.3115/v1/D14-1162 Go to original source...
Řehůřek, R., & Sojka, P. (2010). Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New for NLP Frameworks, (pp. 45-50). University of Malta.
Waja, G., Patil, G., Mehta, C., & Patil, S. (2023). How AI Can be Used for Governance of Messaging Services: A Study on Spam Classification Leveraging Multi-Channel Convolutional Neural Network. International Journal of Information Management Data Insights, 3(1), 100147. https://doi.org/10.1016/j.jjimei.2022.100147 Go to original source...
Wyatt, J. M., Booth, G. J., & Goldman, A. H. (2021). Natural Language Processing and Its Use in Orthopaedic Research. Current Reviews in Musculoskeletal Medicine, 14(6), 392-396. https://doi.org/10.1007/s12178-021-09734-3 Go to original source...
Yang, X., Macdonald, C., & Ounis, I. (2018). Using word embeddings in Twitter election classification. Information Retrieval Journal, 21(2-3), 183-207. https://doi.org/10.1007/s10791-017-9319-5 Go to original source...

This is an open access article distributed under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0), which permits use, distribution, and reproduction in any medium, provided the original publication is properly cited. No use, distribution or reproduction is permitted which does not comply with these terms.

Return