In-Memory Versus Disk-Based Computing with Random Forest for Stock Analysis: A Comparative Study

doi:10.18267/j.aip.275

Acta Informatica Pragensia 2025, 14(3), 460-473 | DOI: 10.18267/j.aip.2753062

In-Memory Versus Disk-Based Computing with Random Forest for Stock Analysis: A Comparative Study

Chitra Joshi ORCID..., Chitrakant Banchorr ORCID..., Omkaresh Kulkarni ORCID..., Kirti Wanjale ORCID...: Vishwakarma Institute of Information Technology, Pune, India

Background: The advancement of big data analytics calls for careful selection of processing frameworks to optimize machine learning effectiveness. Choosing the appropriate framework can significantly influence the speed and accuracy of data analysis, ultimately leading to more informed decision making. In adapting to this changing landscape, businesses should focus on factors such as how well a system scales, how easily it can be used and how effectively it integrates with their existing tools. The effectiveness of these frameworks plays a crucial role in determining data processing speed, model training efficiency and predictive accuracy. As data become increasingly large, diverse and fast-moving, conventional processing systems often fall short of the performance required for modern analytics.

Objective: This research seeks to thoroughly assess the performance of two prominent big data processing frameworks—Apache Spark (in-memory computing) and MapReduce (disk-based computing)—with a focus on applying random forest algorithms to predict stock prices. The primary objective is to assess and compare their effectiveness in handling large-scale financial datasets, focusing on key aspects such as predictive accuracy, processing speed and scalability.

Methods: The investigation uses the MapReduce methodology and Apache Spark independently to analyse a substantial stock price dataset and to train a random forest regressor. Mean squared error (MSE) and root mean square error (RMSE) were employed to assess the primary performance indicators of the models, while mean absolute error (MAE) and the R-squared value were used to evaluate the goodness of fit of the models.

Results: The RMSE, MAE and MSE obtained for the Spark-based implementation were lower, compared to the MapReduce-based implementation, although these low values indicate high prediction accuracy. It also had a big impact on the time it took to train and run models because of its optimized in-memory processing. As opposed to this, the MapReduce approach had higher latency and lower accuracy, reflecting its disk-based constraints and reduced efficiency for iterative machine learning tasks.

Conclusion: The conclusion supports the fact that Spark is the better option for complex machine learning tasks such as stock price prediction, as it is good for handling large amounts of data. MapReduce is still a reliable framework but not fast enough to process and not lightweight enough for analytics that are too rapid and iterative. The outcomes of this study are helpful for data scientists and financial analysts to choose the most appropriate framework for big data machine learning applications.

Keywords: Apache Spark; MapReduce; Big data; Random forest; Performance comparison; Data processing; In-memory processing; Disk-based processing.

Received: February 16, 2025; Revised: June 6, 2025; Accepted: June 8, 2025; Prepublished online: August 6, 2025; Published: August 19, 2025 Show citation

Joshi, C., Banchorr, C., Kulkarni, O., & Wanjale, K. (2025). In-Memory Versus Disk-Based Computing with Random Forest for Stock Analysis: A Comparative Study. Acta Informatica Pragensia, 14(3), 460-473. doi: 10.18267/j.aip.275

Download citation

References

Abdalla, H. B. (2022). A brief survey on big data: Technologies, terminologies and data-intensive applications. Journal of Big Data, 9, Article 107. https://doi.org/10.1186/s40537-022-00659-3 Go to original source...
Adil, B., Abdelhadi, F., Mohamed, B., & Haytam, H. (2019). A Spark based big data analytics framework for competitive intelligence. In 2019 1st International Conference on Smart Systems and Data Science (ICSSD) (pp. 1-6). IEEE. https://doi.org/10.1109/ICSSD47982.2019.9002837 Go to original source...
Ahmed, N., Barczak, A. L. C., Susnjak, T., & Rashid, M. A. (2020). A comprehensive performance analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench. Journal of Big Data, 7(1), Article 110. https://doi.org/10.1186/s40537-020-00388-5 Go to original source...
Apache Software Foundation. (n.d.). Apache cluster model. Apache Spark. https://spark.apache.org/docs/3.5.3/cluster-overview.html
Aziz, K., Zaidouni, D., & Bellafkih, M. (2019). Leveraging resource management for efficient performance of Apache Spark. Journal of Big Data, 6(1), Article 78. https://doi.org/10.1186/s40537-019-0240-1 Go to original source...
Badshah, A., Daud, A., Alharbey, R., Banjar, A., Bukhari, A., & Alshemaimri, B. (2024). Big data applications: overview, challenges and future. Artificial Intelligence Review, 57(11), Article 290. https://doi.org/10.1007/s10462-024-10938-5 Go to original source...
Barvaliya, V. (2024, 20 June). How Netflix uses Apache Spark: A technical deep dive. Medium.com. https://medium.com/data-engineer/how-netflix-uses-apache-spark-a-technical-deep-dive-096ee486f54b
Benlachmi, Y., & Hasnaoui, M. L. (2020). Big data and Spark: Comparison with Hadoop. In 2020 Fourth World Conference on Smart Trends in Systems, Security and Sustainability (WorldS4), (pp. 811-817). IEEE. https://doi.org/10.1109/worlds450073.2020.9210353 Go to original source...
Chaudhari, R. S., Patil, S. S., & Ghorpade, S. J. (2019). Classification and clustering methods along with MapReduce, Apache Spark: A study. International Journal of Research and Analytical Reviews, 7(4), 593-596.
Demirbaga, Ü., Aujla, G. S., Jindal, A., & Kalyon, O. (2024). Big data analytics platforms. In Big Data Analytics (pp. 79-126). Springer. https://doi.org/10.1007/978-3-031-55639-5_5 Go to original source...
Gao, Z., Fan, Y., Niu, K., & Ying, Z. (2018). MR-Mafia: Parallel Subspace Clustering Algorithm Based on MapReduce for Large Multi-dimensional Datasets. In 2018 IEEE International Conference on Big Data and Smart Computing (BigComp), (pp. 257-262). IEEE. https://doi.org/10.1109/bigcomp.2018.00045 Go to original source...
Gopalani, S., & Arora, R. (2015). Comparing Apache Spark and MapReduce with performance analysis using K-means. International Journal of Computer Applications, 113(1), 8-11. Go to original source...
Gupta, Y. K., & Sharma, N. (2020). Propositional Aspect between Apache Spark and Hadoop Map-Reduce for Stock Market Data. In 2020 3rd International Conference on Intelligent Sustainable Systems (ICISS), (pp. 479-483). IEEE. https://doi.org/10.1109/iciss49785.2020.9315977 Go to original source...
Hedayati, S., Maleki, N., Olsson, T., Ahlgren, F., Seyednezhad, M., & Berahmand, K. (2023). MapReduce scheduling algorithms in Hadoop: a systematic study. Journal of Cloud Computing Advances Systems and Applications, 12(1), Article 143. https://doi.org/10.1186/s13677-023-00520-9 Go to original source...
Ibtisum, N. S., Bazgir, N. E., Rahman, N. S. M. A., & Hossain, N. S. M. S. (2023). A comparative analysis of big data processing paradigms: Mapreduce vs. apache spark. World Journal of Advanced Research and Reviews, 20(1), 1089-1098. https://doi.org/10.30574/wjarr.2023.20.1.2174 Go to original source...
Kulkarni, M. S., Bharathi, S. V., Perdana, A., & Kilari, D. (2025). A quest for Context-Specific Stock Price Prediction: A comparison between time series, machine learning and deep learning models. SN Computer Science, 6(4), Article 335. https://doi.org/10.1007/s42979-025-03848-y Go to original source...
Lulli, A., Oneto, L., & Anguita, D. (2019). Mining big data with random forests. Cognitive Computation, 11(3), 294-316. https://doi.org/10.1007/s12559-018-9615-4 Go to original source...
Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., ... & Zaharia, M. (2016). MLLib: Machine learning in Apache Spark. Journal of Machine Learning Research, 17(1), 1235-1241.
Omar, A. B., Huang, S., Salameh, A. A., Khurram, H., & Fareed, M. (2022). Stock market forecasting using the random forest and deep neural network models before and during the COVID-19 period. Frontiers in Environmental Science, 10, Article 917047. https://doi.org/10.3389/fenvs.2022.917047 Go to original source...
Oo, M. C. M., & Thein, T. (2019). Hyperparameters optimization in scalable random forest for big data analytics. In 2019 IEEE 4th International Conference on Computer and Communication Systems (ICCCS). IEEE. https://doi.org/10.1109/CCOMS.2019.8821752 Go to original source...
Peddi, P. (2019). An efficient analysis of stocks data using MapReduce. Journal of Applied Science and Computations, 6(1), 4076-4087.
Ronaghan, S. (2018, 12 May). The mathematics of decision trees, random forest, and feature importance in Scikit-learn and Spark. Towards Data Science. Medium.com. https://medium.com/towards-data-science/the-mathematics-of-decision-trees-random-forest-and-feature-importance-in-scikit-learn-and-spark-f2861df67e3
Salloum, S., Dautov, R., Chen, X., Otrok, H., & Yassine, A. (2016). Big data analytics on Apache Spark. International Journal of Data Science and Analytics, 1(3-4), 145-164. https://doi.org/10.1007/s41060-016-0027-9 Go to original source...
Shah, J. (2022). Netflix stock price prediction dataset [Data set]. Kaggle.com. https://www.kaggle.com/datasets/jainilcoder/netflix-stock-price-prediction
Tosi, D., Kokaj, R., & Roccetti, M. (2024). 15 years of big data: A systematic literature review. Journal of Big Data, 11, Article 73. https://doi.org/10.1186/s40537-024-00914-9 Go to original source...
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., ... & Stoica, I. (2012). Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI'12), (pp. 1-14). USENIX.

This is an open access article distributed under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0), which permits use, distribution, and reproduction in any medium, provided the original publication is properly cited. No use, distribution or reproduction is permitted which does not comply with these terms.

Return to the content