Acta Informatica Pragensia 2018, 7(1), 58-73 | DOI: 10.18267/j.aip.1143472

Dolování z otevřených dat o rozpočtech a výdajích

David Chudán1, Vojtěch Svátek1, Jaroslav Kuchař1,2, Stanislav Vojíř1
1 Department of Information and Knowledge Engineering, Faculty of Informatics and Statistics, University of Economics, Prague, W. Churchill Sq. 4, 130 67 Prague 3, Czech Republic
2 Department of Software Engineering, Faculty of Information Technology, Czech Technical University in Prague, Thakurova 9, 160 00 Prague, Czech Republic

Metody dolování z dat jsou aplikovány ve stále větší míře, a to i v doménách, které tradičně nemají tak silnou podporu analytických nástrojů a kde převládá ruční práce analytika. Použití těchto metod v oblasti fiskálních dat umožní jejich hlubší analýzu a může přinést nová zjištění. Nasazení pokročilých metod dolování z dat je jednou z částí projektu OpenBudgets.eu, který se zaměřuje na transparentnost a odpovědnost v oblasti nakládání s veřejnými prostředky. Tento přehledový článek shrnuje některé zkušenosti autorů z tohoto projektu získané při vývoji, implementaci a aplikaci vybraných metod dolování z fiskálních dat. Jedná se zejména o metody detekce anomálií a dolování asociačních pravidel. Tyto metody jsou integrovány do centrální platformy projektu, která je k dispozici pokročilým i běžným uživatelům v případě zájmu o analýzu fiskálních dat. Pilotní analýzy ukázaly, že problémem dataminingové analýzy v této doméně je velký objem nacházených pravidel a různorodý původ jejich vzniku.

Keywords: Dolování z dat, otevřená data, fiskální data, evropský projekt, OpenBudgets.eu

Data Mining from Open Fiscal Data

Data mining methods are still more popular, even in domains where there is traditionally limited support by analytical tools and where the analyst´s manual work still prevails. Using these methods in the fiscal domain enables deeper analysis and can bring new findings. The deployment of data mining methods is one part of the OpenBudgets.eu project, which focuses on transparency and accountability in the public funds management. This overview article summarizes selected experiences of the authors of the project from the development, implementation and application of selected data mining methods on mining fiscal data. These methods are integrated into the central platform of the project available for the advanced and common users interested in fiscal data analysis. The pilot analysis showed that the problem of data mining in this domain is the large amount of found rules together with its heterogenous origin.

Keywords: Data mining, Open data, Fiscal data, European project, OpenBudgets.eu

Accepted: December 7, 2017; Prepublished online: December 7, 2017; Published: June 30, 2018  Show citation

ACS AIP APA ASA Harvard Chicago Chicago Notes IEEE ISO690 MLA NLM Turabian Vancouver
Chudán, D., Svátek, V., Kuchař, J., & Vojíř, S. (2018). Data Mining from Open Fiscal Data. Acta Informatica Pragensia7(1), 58-73. doi: 10.18267/j.aip.114
Download citation

References

  1. Agrawal, R., & Srikant, R. (1994). Fast Algoritm for Mining Association Rules in Large Databases. In Proceedings of 20th International Conference on Very Large Data Bases (pp. 487-499). San Francisco: Morgan Kaufmann Publisher.
  2. Alloca, S. (2016). CFOs' Top Goal for 2017: Better Analysis and Reporting. CFO: Corporate Finance News and Events. Retrieved November 2, 2017, from http://ww2.cfo.com/analytics/2016/11/cfos-top-goal-2017-better-analysis-reporting/
  3. Cyganiak, R., Wood, D., Lanthaler, M. (2014). RDF 1.1 Concepts and Abstract Syntax. W3C Recommendation. Retrieved November 2, 2017, from https://www.w3.org/TR/rdf11-concepts/
  4. Dong, T., Musyaffa, F., Kuchař, J., Vojíř, S., Zeman, V., Chudán, D., Svátek, V., Koupidis, K., Chatzopoulou, A., Bratsas, Ch., Orlandi, F., & Merten, T. (2017). Deliverable 2.4 - Data Mining and Statistical Analytics Techniques. Retrieved November 2, 2017, from http://openbudgets.eu/assets/deliverables/D2.4.pdf
  5. Engels, C., Musyaffa, F., Dong, T., Klímek, J., Mynarz, J., Orlandi, F., & Auer, S. (2016a). Deliverable 2.1 - Tools for Semantic Lifting of Multiformat Budgetary Data. Retrieved November 2, 2017, from http://openbudgets.eu/assets/deliverables/D2.1.pdf
  6. Engels, C., Bratsas, Ch., Koupidis, K., Musyaffa, F., Orlandi, F., Chudán, D., Kuchař, J., Mynarz, J., & Zeman, V. (2016b). Deliverable 2.3 - Requirements for Statistical Analytics and Data Mining. Retrieved November 2, 2017, from http://openbudgets.eu/assets/deliverables/D2.3.pdf
  7. Čerpání. (2017). Evropské strukturální a investiční fondy - Čerpání v období 2007-2013. Retrieved November 2, 2017, from http://dotaceeu.cz/cs/Fondy-EU/Predchozi-programova-obdobi/Programove-obdobi-2007-2013/Cerpani-v-obdobi-2007-2013
  8. Hájek, P., Holeňa, M., & Rauch, J. (2010). The GUHA Method and its Meaning for Data Mining. Journal of Computer and System Sciences, 76(1), 34-48. doi: 10.1016/j.jcss.2009.05.2004 Go to original source...
  9. Harris, S., & Seaborne, A. (2013). SPARQL 1.1 Query Language. W3C Recommendation 21 March 2013. Retrieved November 2, 2017, from https://www.w3.org/TR/sparql11-query/
  10. Chandola, V., Banerjee, A., & Vipin, K. (2009). Anomaly Detection: A Survey. ACM Computing Surveys, 41(3), Article 15. doi: 10.1145/1541880.1541882 Go to original source...
  11. Journalism++ (2017). The Good, The Band and The Accountant. GitHub. Retrieved November 2, 2017, from https://jplusplus.github.io/the-accountant/#/
  12. Kejkula, M. (2009). Post-processing of association rules by multicriterial clustering method. (Dissertation thesis). Prague: VŠE-FIS.
  13. Koukal, B., Chudán, D., & Svátek, V. (2017). OLAP Recommender: Supporting Navigation in OLAP Cubes Using Association Rule Mining. In Data a Znalosti 2017 (pp. 46-50). Plzeň: Západočeská univerzita v Plzni.
  14. Kovalerchuk B., & Vityaev E. (2005). Data Mining for Financial Applications. In Maimon O., Rokach L. (Eds.), Data Mining and Knowledge Discovery Handbook (pp. 1153-1169). Boston: Springer. doi: 10.1007/978-0-387-09823-4_60 Go to original source...
  15. Kliegr, T., Chudán, D., Hazucha, A., & Rauch, J. (2010). SEWEBAR-CMS: A System for Postprocessing Data Mining Models. In Proceedings of the RuleML-2010 Challenge, at the 4th International Web Rule Symposium (paper 9). Cáchy: CEUR Workshop Proceedings.
  16. Klímek, J., Mynarz, J., Škoda, P., Zbranek, J., & Zeman, V. (2016). Deliverable 2.2 - Data optimisation, enrichment, and preparation for analysis. Retrieved November 2, 2017, from http://openbudgets.eu/assets/deliverables/D2.2.pdf
  17. Kuchař, J. (2017). Fpmoutliers. Github. Retrieved November 2, 2017, from https://github.com/jaroslav-kuchar/fpmoutliers
  18. Kuchař, J., Ashenfelter, A., & Kliegr, T. (2017). Outlier (Anomaly) Detection Modelling in PMML. In Proceedings of the Doctoral Consortium, Challenge, Industry Track, Tutorials and Posters @ RuleML+RR 2017 (paper 9). Cáchy: CEUR Workshop Proceedings.
  19. Kuchař, J., & Svátek V. (2017). Spotlighting Anomalies using Frequent Patterns. In Proceedings of the KDD 2017 Workshop on Anomaly Detection in Finance. Halifax: PMLR.
  20. Kuchař, J., Vojíř, S., Zeman, V., Mynarz, J., Svátek, V., Koupidis, K., Chatzopoulou, A., Bratsas, Ch., Dong, T., Musyaffa, F., Wang, K., Orlandi, F., & Li, Y. (2017). Deliverable 2.5 - Data Mining Interfaces. Retrieved November 2, 2017, from http://openbudgets.eu/assets/deliverables/D2.5.pdf
  21. Kliegr, T., Kuchař, J., Vojíř, S., & Zeman, V. (2017). EasyMiner - Short History of Research and Current Development. In Proceedings of the 17th Conference on Information Technologies - Applications and Theory (pp. 235-239). Cáchy: CEUR Workshop Proceedings.
  22. Martin, A., Manjula, M., & Venkatesan, P. (2011). A Business Intelligence Model to Predict Bankruptcy using Financial Domain Ontology with Association Rule Mining Algorithm. International Journal of Computer Sciences Issues, 8(3), 211-218.
  23. Mohiuddin, A., Choudhury, N., & Shahadat, U. (2017). Anomaly Detection in Big Financial Data. In IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (pp. 998-1001). New York: ACM. doi: 10.1145/3110025.3119402 Go to original source...
  24. Mohiuddin, A., Mahmood, A., N., & Islam, R. (2016). A survey of anomaly detection techniques in financial domain. Future Generation Computer Systems, 55(C), 278-288. doi: 10.1016/j.future.2015.01.001 Go to original source...
  25. Mynarz, J., Klímek, J., Dudáš, M., Škoda, P., Engels, C., Musyaffa, F. A., & Svátek, V. (2016). Reusable transformations of Data Cube Vocabulary datasets from the fiscal domain. In Proceedings of the 4th International Workshop on Semantic Statistics co-located with 15th International Semantic Web Conference (paper 04). Cáchy: CEUR Workshop Proceedings.
  26. Ngaia, E.W.T., Hu, Y., Wong, Y.H., Yijun, Ch., & Sun, X. (2011). The application of data mining techniques in financial fraud detection: A classification framework and an academic review of literature. Decision Support Systems, 50(3), 559-569. doi: 10.1016/j.dss.2010.08.006 Go to original source...
  27. Orlandi, F., Dong, T., Karampatakis, S., Hernandez, P, Musyaffa, F., & Liu, H. (2017). Deliverable D7.7 Large-scale trail report including best-practices. Retrieved November 2, 2017, from https://drive.google.com/file/d/0B5ecBIVKmMmeVVYtRHNRMHh3emM/view
  28. Paranjape-Voditel, P, Deshpande, U. (2013). A stock market portfolio recommender system based on association rule mining. Applied Soft Computing, 13(2), 1055-1063. doi: 10.1016/j.asoc.2012.09.012 Go to original source...
  29. Rauch, J., & Šimůnek, M. (2014). Dobývání znalostí z databází, LISp-Miner a GUHA. Praha: Oeconomica.
  30. Svátek V., Mynarz J., Węcel K., Klímek J., Knap T., & Nečaský M. (2014). Linked Open Data for Public Procurement. In Auer S., Bryl V., Tramp S. (Eds.), Linked Open Data - Creating Knowledge Out of Interlinked Data (pp. 196-213). Cham: Springer. doi: 10.1007/978-3-319-09846-3_10 Go to original source...
  31. Sagar, B. B., Singh, P., & Mallika, S. (2016). Online transaction fraud detection techniques: A review of data mining approaches. In Proceesings of 3rd International Conference on Computing for Sustainable Global Development. New York: IEEE.
  32. Sharma, A., & Panigrahi, P.K. (2013). A Review of Financial Accounting Fraud Detection based on Data Mining Techniques. Retrieved November 2, 2017, from https://arxiv.org/abs/1309.3944
  33. Tackling Corruption through Open Data. (2017). European Commission - Tackling Corruption through Open Data. Retrieved November 2, 2017, from http://ec.europa.eu/budget/euprojects/node/7645_en?language=en
  34. Vafopoulos, M., Meimaris, M., Álvarez Rodríguez, J. M., Xidias, I., Klonaras, M., & Vafeiadis, G. (2013). Insights in global public spending. In Proceedings of the 9th International Conference on Semantic Systems (pp. 135-139). New York: ACM. doi: 10.1145/2506182.2506201 Go to original source...
  35. Vojíř, S., Zeman, V., Kuchař, J., & Kliegr, T. (2017a). Using EasyMiner API for Financial Data Analysis in the OpenBudgets.eu Project. In Proceedings of the Doctoral Consortium, Challenge, Industry Track, Tutorials and Posters @ RuleML+RR 2017 (paper 21). Cáchy: CEUR Workshop Proceedings.
  36. Vojíř, S., Zeman, V., Kuchař, J., & Kliegr, T. (2017b). Využití EasyMiner API v projektu OpenBudgets.eu. In Data a Znalosti 2017 (pp. 56-60). Plzeň: Západočeská univerzita v Plzni.
  37. Yue, D., Wu, X., Wang, Y., & Chu, C.-H. (2007). A Review of Data Mining-Based Financial Fraud Detection Research. In Proceedings of International Conference on Wireless Communications, Networking and Mobile Computing (pp. 5519-5522). New York: IEEE. doi: 10.1109/WICOM.2007.1352 Go to original source...
  38. Zhang, D., & Zhou, L. (2004) Discovering Golden Nuggets: Data Mining in Financial Application. IEEE Transactions on Systems, Man, and Cybernetics - Part C: Applications and Reviews, 34(4), 513-522. doi: 10.1109/TSMCC.2004.829279 Go to original source...

This is an open access article distributed under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0), which permits use, distribution, and reproduction in any medium, provided the original publication is properly cited. No use, distribution or reproduction is permitted which does not comply with these terms.