Acta Informatica Pragensia 2023, 12(2), 275-295 | DOI: 10.18267/j.aip.2153097

Use of Data Mining for Analysis of Czech Real Estate Market

Ilya Tsakunov, David Chudán ORCID...
Faculty of Informatics and Statistics, Prague University of Economics and Business, Prague, Czech Republic

This paper analyses data from the real estate market domain. The data were scraped from the bezrealitky.cz portal. The analysis looks at both sales and rental data. A total of 3546 records and 54 attributes were obtained. A basic overview of the data was performed using exploratory data analysis where some basic characteristics of the data were identified, such as the average price of sold and rented flats. More specific results were obtained by applying data mining methods such as regression (linear regression, lasso regression and ridge regression) for predicting the flat prices and payments for utilities, classification (support vector machines, KNN, Gaussian naïve Bayes, decision tree and random forest) for estimating the PENB class (building energy performance certificate) and building condition. Lasso regression performed the most successfully (R2 = 0.76) in predicting the rent price. Among the classification tasks, the best result was achieved with random forest, which had an accuracy over 80% in some cases. Other tasks included clustering (k-means and k-modes) and anomaly detection (isolation forest). The main focus was on descriptive data mining, especially on clustering. Clusters created using the k-means algorithm (silhouette score of 0.78) with flats based on geographic coordinates were identified which show that the most expensive flats are on average in Bohemian regions, followed by Silesia and the cheapest are in central Moravia. Another cluster application identified flats in the Moravian-Silesian region with very high payments for utilities (silhouette score of 0.56). The models can help estimate the value of flats based on their attributes as well as location.

Keywords: Data mining; Web scraping; Real estate market; Exploratory analysis.

Received: November 23, 2022; Revised: March 23, 2023; Accepted: April 19, 2023; Prepublished online: April 21, 2023; Published: October 10, 2023  Show citation

ACS AIP APA ASA Harvard Chicago Chicago Notes IEEE ISO690 MLA NLM Turabian Vancouver
Tsakunov, I., & Chudán, D. (2023). Use of Data Mining for Analysis of Czech Real Estate Market. Acta Informatica Pragensia12(2), 275-295. doi: 10.18267/j.aip.215
Download citation

References

  1. Chaurasia, V., Pandey, M., & Pal, S. (2022). Chronic kidney disease: a prediction and comparison of ensemble and basic classifiers performance. Human-Intelligent Systems Integration, 4(1-2), 1-10. https://doi.org/10.1007/s42454-022-00040-y Go to original source...
  2. Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273-297. https://doi.org/10.1007/bf00994018 Go to original source...
  3. Cover, T. M., & Hart, P. D. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), 21-27. https://doi.org/10.1109/tit.1967.1053964 Go to original source...
  4. Česká Bankovní Asociace. (2022). ČBA Hypomonitor říjen 2022: Úroková sazba mírně vzrostla. https://cbaonline.cz/cba-hypomonitor-rijen-2022
  5. ČTK - České Noviny. (2022) Průměrná cena staršího bytu v Česku přesáhla 90.000 korun za metr čtvereční. https://www.ceskenoviny.cz/zpravy/prumerna-cena-starsiho-bytu-v-cesku-presahla-90-000-korun-za-metr-ctverecni/2203149
  6. Delloite. (2022a). Deloitte Real Index: How do real prices of flats in the Czech Republic develop? https://www2.deloitte.com/cz/en/pages/real-estate/articles/cze-real-index.html
  7. Deloitte. (2022b). Deloitte Real Index Q1 2022, Skutečné ceny prodaných bytů v ČR. https://www2.deloitte.com/content/dam/Deloitte/cz/Documents/real-estate/Real-index-1Q-2022-CZ.pdf
  8. Dželihodžić, A., & Đonko, D. (2016). Comparison of ensemble classification techniques and single classifiers performance for customer credit assessment. Modeling of Artificial Intelligence, 11(3), 140-150. https://doi.org/10.13187/mai.2016.11.140 Go to original source...
  9. Eurostat. (2022) Rents up by 17%, house prices by 45% since 2010. https://ec.europa.eu/eurostat/web/products-eurostat-news/-/ddn-20220708-1
  10. Fayyad, U. M., Piatetsky-Shapiro, G., & Smyth, P. (1996). Advances in knowledge discovery and data mining. AAAI press MIT press.
  11. Ho, T. K. (1995). Random decision forests. In Proceedings of 3rd international conference on document analysis and recognition (Vol. 1, pp. 278-282). IEEE. https://doi.org/10.1109/ICDAR.1995.598994 Go to original source...
  12. Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1), 55. https://doi.org/10.2307/1267351 Go to original source...
  13. Hossin, M., & Sulaiman, M. (2015). A review on Evaluation Metrics for Data Classification Evaluations. International Journal of Data Mining & Knowledge Management Process, 5(2), 1-11. https://doi.org/10.5121/ijdkp.2015.5201 Go to original source...
  14. Hromada, E. (2015). Mapping of Real Estate Prices Using Data Mining Techniques. Procedia Engineering, 123, 233-240. https://doi.org/10.1016/j.proeng.2015.10.083 Go to original source...
  15. Hromada, E. (2018). Analysis of relationship between market value of property and its distance from center of capital. In 17th International Scientific Conference Engineering for Rural Development (pp. 646-651). Engineering for Rural Development. https://doi.org/10.22616/erdev2018.17.n305 Go to original source...
  16. Hromada, E. (2021). Development of the Real Estate Market in the Czech Republic in Connection with the Covid-19 Pandemic. In Proceedings of the 15th Economics & Finance Conference (pp. 169-176). IISES. https://doi.org/10.20472/EFC.2021.015.014 Go to original source...
  17. Huang, Z. (1998). Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values. Data Mining and Knowledge Discovery, 2(3), 283-304. https://doi.org/10.1023/a:1009769707641 Go to original source...
  18. Liu, F., Ting, K. M., & Zhou, Z. (2008). Isolation Forest. In 2008 Eighth IEEE International Conference on Data Mining (pp. 413-422). IEEE. https://doi.org/10.1109/icdm.2008.17 Go to original source...
  19. Louati, A., Lahyani, R., Aldaej, A., Aldumaykhi, A., & Otai, S. (2022). Price forecasting for real estate using machine learning: A case study on Riyadh city. Concurrency and Computation: Practice and Experience, 34(6). https://doi.org/10.1002/cpe.6748 Go to original source...
  20. MacQueen, J. B. (1967). Some Methods for classification and Analysis of Multivariate Observations. In Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability (pp. 281-297). University of California Press.
  21. Martinez-Plumed, F., Contreras-Ochando, L., Ferri, C., Hernandez-Orallo, J., Kull, M., Lachiche, N., Ramirez-Quintana, M. J., & Flach, P. (2021). CRISP-DM Twenty Years Later: From Data Mining Processes to Data Science Trajectories. IEEE Transactions on Knowledge and Data Engineering, 33(8), 3048-3061. https://doi.org/10.1109/TKDE.2019.2962680 Go to original source...
  22. Oliveira, T. C., De Medeiros, L., & Detzel, D. H. M. (2021). Applying data mining algorithms to real estate appraisals: a comparative study. International Journal of Housing Markets and Analysis, 14(5), 969-986. https://doi.org/10.1108/ijhma-07-2020-0080 Go to original source...
  23. Plevris, V., Solorzano, G., Bakas, N., & Ben Seghier, M. (2022). Investigation of performance metrics in regression analysis and machine learning-based prediction models. In 8th European Congress on Computational Methods in Applied Sciences and Engineering. Scipedia. https://doi.org/10.23967/eccomas.2022.155 Go to original source...
  24. Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81-106. https://doi.org/10.1007/bf00116251 Go to original source...
  25. Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53-65. https://doi.org/10.1016/0377-0427(87)90125-7 Go to original source...
  26. Realitymix.cz (2022). Průměrná cena za 1 m<sup>2</sup> bytu. https://realitymix.cz/statistika-nemovitosti/
  27. Sawant, R., Jangid, Y., Tiwari, T., Jain, S., & Gupta, A. (2018). Comprehensive Analysis of Housing Price Prediction in Pune Using Multi-Featured Random Forest Approach. In 2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA), (pp. 1-5). IEEE. https://doi.org/10.1109/ICCUBEA.2018.8697402 Go to original source...
  28. Šitera, R. (2020). Nabídkové vs. Realizované ceny - jaký je skutečný rozdíl. Valuo. https://www.valuo.cz/blog/nabidkove-vs-realizovane-ceny-jaky-je-skutecny-rozdil/
  29. Thevaraja, M., Rahman, A., & Gabirial, M. (2019). Recent developments in data science: Comparing linear, ridge and lasso regressions techniques using wine data. In International Conference on Digital Image and Signal Processing 2019: DISP 2019 (pp. 1-6). University of Oxford.
  30. Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society, Series B (Methodological), 58(1), 267-288. Go to original source...
  31. Tsakunov, I. (2022). Využití data miningu pro analýzu českého realitního trhu. Prague University of Economics and Business.
  32. Verma, A., Nagar, C., Singhi, N., Dongariya, N., & Sethi, N. (2022). Predicting House Price in India Using Linear Regression Machine Learning Algorithms. In 2022 3rd International Conference on Intelligent Engineering and Management (ICIEM), (pp. 917-924). IEEE. https://doi.org/10.1109/ICIEM54221.2022.9853185 Go to original source...

This is an open access article distributed under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0), which permits use, distribution, and reproduction in any medium, provided the original publication is properly cited. No use, distribution or reproduction is permitted which does not comply with these terms.