Acta Informatica Pragensia 2021, 10(1), 38-60 | DOI: 10.18267/j.aip.1453087

Three Metric-Based Method for Data Compatibility Calculation

Daniel Vodňanský ORCID...
Faculty of Informatics and Statistics, Prague University of Economics and Business, W. Churchill Sq. 1938/4, 130 67 Prague 3, Czech Republic

This article analyzes ways of calculating characteristics of data and most common data structure types that allow comparison between them or on a time axis. To achieve this, it studies the key aspects of relational databases, XML, JSON and RDF structure types. These data structure types are compared to multiple isolated approaches to data quality and other data characteristics measurements. The goals of the article are the calculation method itself and a storage structure for calculated values. The article presents a method of characterization of data and data structure types based on the calculation of three metrics: the amount of structuredness, the amount of hierarchicallity and the amount of information. This triad of metrics allows comparison between various data sets (objects), for example evaluating the complexity of the transformation of data from one data object to another, as well as with data structure types (as mentioned above). Based on the vector of three metrics, the calculation method of the compatibility between data and data structure type is proposed. This method can help select the most compatible data format for existing data. The calculated values of metrics can also detect non-optimal storage design and classify data transformations. The method was evaluated on an example case study, which showed its usability on an example demonstration data set. It can be used in the process of data modelling to help select optimal data structure type, to design a data transformation process and to optimize existing data storages.

Keywords: Data metrics, Amount of information, Metadata, Relational database, XML, JSON, RDF, Ontology, Transformation, Structuredness, Hierarchicallity, Normalization, Visualization.

Received: March 6, 2021; Revised: May 2, 2021; Accepted: May 2, 2021; Prepublished online: May 2, 2021; Published: June 30, 2021  Show citation

ACS AIP APA ASA Harvard Chicago Chicago Notes IEEE ISO690 MLA NLM Turabian Vancouver
Vodňanský, D. (2021). Three Metric-Based Method for Data Compatibility Calculation. Acta Informatica Pragensia10(1), 38-60. doi: 10.18267/j.aip.145
Download citation

References

  1. Bartmann, D., Bodendorf, F., Sinz, E. J., & Ferstl, O. K. (2011). Dienstorientierte IT-Systeme für hochflexible Geschäftsprozesse. University of Bamberg Press.
  2. Begg, C., & Caira, T. (2012). Exploring the SME Quandary: Data Governance in Practise in the Small to Medium-Sized Enterprise Sector. Electronic Journal of Information Systems Evaluation, 15(1), 3-13.
  3. Codd, E. F. (1990). The relational model for database management: Version 2. Addison-Wesley.
  4. Florescu, D. (2005). Managing Semi-Structured Data. Queue, 3(8), 18-24. https://doi.org/10.1145/1103822.1103832 Go to original source...
  5. Floridi, L. (2013). Information Quality. Philosophy & Technology, 26(1), 1-6. https://doi.org/10.1007/s13347-013-0101-3 Go to original source...
  6. Gangemi, A., Presutti, V., Reforgiato Recupero, D., Nuzzolese, A. G., Draicchio, F., & Mongiovì, M. (2017). Semantic web machine reading with FRED. Semantic Web, 8(6), 873-893. https://doi.org/10.3233/SW-160240 Go to original source...
  7. Grünwald, P. D., & Vitányi, P. M. (2008). Algorithmic information theory. https://arxiv.org/abs/0809.2754 Go to original source...
  8. Halpin, T. (2001). Information Modeling and Relational Databases: From Conceptual Analysis to Logical Design. Morgan Kaufmann Publishers Inc.
  9. Helland, P. (2017). XML and JSON Are Like Cardboard. Communications of the ACM, 60(12), 46-47. https://doi.org/10.1145/3132269 Go to original source...
  10. Hutter, M. (2007). Algorithmic information theory. Scholarpedia, 2(3), 2519. https://doi.org/10.4249/scholarpedia.2519 Go to original source...
  11. Krishnamurthy, R., Naughton, J. F., Shanmugasundaram, J., & Shekita, E. (2001). Dealing with (un) structuredness in XML Data and Queries Using Relational Databases. DB Seminar at Wise University. https://pdfs.semanticscholar.org/acb6/72e6feea4893192c74fc4cf3dcce31b3ad65.pdf
  12. Ma, Z., Bai, L., & Yan, L. (2020). Transformation of Fuzzy Spatiotemporal Data Between Relational Databases and XML. In Z. Ma, L. Bai, & L. Yan (Eds.), Modeling Fuzzy Spatiotemporal Data with XML (pp. 123-145). Springer International Publishing. https://doi.org/10.1007/978-3-030-41999-8_6 Go to original source...
  13. Meinsma, G. (n.d.). Data compression & Information theory. 2014. https://www.yumpu.com/en/document/view/27882302/data-compression-information-theory
  14. Morton, J. (Ed.). (2014). Big data: Opportunities and challenges. BCS, The Chartered Institute for IT.
  15. Musca, S. C., Kamiejski, R., Nugier, A., Méot, A., Er-Rafiy, A., & Brauer, M. (2011). Data with Hierarchical Structure: Impact of Intraclass Correlation and Sample Size on Type-I Error. Frontiers in Psychology, 2, 74. https://doi.org/10.3389/fpsyg.2011.00074 Go to original source...
  16. Närman, P., Holm, H., Johnson, P., König, J., Chenine, M., & Ekstedt, M. (2011). Data accuracy assessment using enterprise architecture. Enterprise Information Systems, 5(1), 37-58. https://doi.org/10.1080/17517575.2010.507878 Go to original source...
  17. Oren, E., Möller, K., Scerri, S., Handschuh, S., & Sintek, M. (2006). What are semantic annotations. Relatório Técnico. 9, 62. https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.97.7985&rep=rep1&type=pdf
  18. Pokorný, J. (2010). Databases in the 3rd Millennium: Trends and Research Directions. Journal of Systems Integration, 1(1-2), 3-15. https://doi.org/10.20470/jsi.v1i1-2.25 Go to original source...
  19. Ramel, D. (2015). Relational Databases Still Reign in Enterprises, Survey Says. Enterprise Systems Journal. https://esj.com/articles/2015/04/23/database-survey.aspx
  20. Shanmugasundaram, J., Shekita, E., Kiernan, J., Krishnamurthy, R., Viglas, E., Naughton, J., & Tatarinov, I. (2001). A general technique for querying XML documents using a relational database system. ACM SIGMOD Record, 30(3), 20-26. https://doi.org/10.1145/603867.603871 Go to original source...
  21. Shannon, C. E. (2001). A mathematical theory of communication. ACM SIGMOBILE Mobile Computing and Communications Review, 5(1), 3. https://doi.org/10.1145/584091.584093 Go to original source...
  22. Song, E., & Haw, S.-C. (2020). XML-REG: Transforming XML Into Relational Using Hybrid-Based Mapping Approach. IEEE Access, 8, 177623-177639. https://doi.org/10.1109/ACCESS.2020.3026006 Go to original source...
  23. Šperková, L. (2014). Unstructured Data Analysis from Facebook Banking Sites. Acta Informatica Pragensia, 3(2), 154-167. https://doi.org/10.18267/j.aip.44 Go to original source...
  24. Vodňanský, D. (2016). Entropy-based hierarchization of relational data structures. Journal of Systems Integration, 7(4), 25-34. https://doi.org/10.20470/jsi.v7i4.275 Go to original source...
  25. Vodňanský, D. (2020). 3D data metrics visualizer. https://danielvodnansky.github.io/3d-data-histogram/
  26. Vodňanský, D., & Zamazal, O. (2016). Study on Graph Metrics over Linked Open Vocabularies and OntoFarm Collections. In Proceedings of the 7th International Conference of Knowledge Engineering and Semantic Web, KESW 2016 (pp. 1-2). Prague University of Economics and Business.
  27. Wellenzohn, K., Böhlen, M. H., & Helmer, S. (2020). Dynamic Interleaving of Content and Structure for Robust Indexing of Semi-Structured Hierarchical Data (Extended Version). https://doi.org/10.14778/3401960.3401963 Go to original source...

This is an open access article distributed under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0), which permits use, distribution, and reproduction in any medium, provided the original publication is properly cited. No use, distribution or reproduction is permitted which does not comply with these terms.