skip to main content
research-article

Data Quality and Explainable AI

Published:03 May 2020Publication History
Skip Abstract Section

Abstract

In this work, we provide some insights and develop some ideas, with few technical details, about the role of explanations in Data Quality in the context of data-based machine learning models (ML). In this direction, there are, as expected, roles for causality, and explainable artificial intelligence. The latter area not only sheds light on the models, but also on the data that support model construction. There is also room for defining, identifying, and explaining errors in data, in particular, in ML, and also for suggesting repair actions. More generally, explanations can be used as a basis for defining dirty data in the context of ML, and measuring or quantifying them. We think dirtiness as relative to the ML task at hand, e.g., classification.

References

  1. Z. Bahmani, L. Bertossi, and N. Nikolaos Vasiloglou. 2017. ERBlox: Combining matching dependencies with machine learning for entity resolution. International Journal of Approximate Reasoning 83 (2017), 118--141.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. C. Batini and M. Scannapieco. 2016. Data Quality: Concepts, Methodologies and Techniques. Second edition, Springer.Google ScholarGoogle Scholar
  3. L. Bertossi and M. Milani. 2018. Ontological multidimensional data models and contextual data quality. Journal of Data and Information Quality 9, 3 (2018), 14.1--14.36.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. L. Bertossi, F. Rizzolo, and J. Lei. 2011. Data quality is context dependent. In Proc. of the Workshop on Enabling Real-Time Business Intelligence (BIRTE) Collocated with the International Conference on Very Large Data Bases (VLDB). Springer LNBIP 84, 52--67.Google ScholarGoogle Scholar
  5. L. Bertossi and B. Salimi. 2017. From causes for database queries to repairs and model-based diagnosis and back. Theory of Computing Systems 61, 1 (2017), 191--232.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. L. Bertossi and B. Salimi. 2017. Causes for query answers from databases: Datalog abduction, view-updates, and integrity constraints. International Journal of Approximate Reasoning 90 (2017), 226--252.Google ScholarGoogle ScholarCross RefCross Ref
  7. L. Bertossi, S. Kolahi, and L. Lakshmanan. 2013. Data cleaning and query answering with matching dependencies and matching functions. Theory of Computing Systems 52, 3 (2013), 441--482.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. L. Bertossi, J. Li, M. Schleich, D. Suciu, and Z. Vagena. [n.d.]. Experimenting with score-based explanations for classification outcomes. Forthcoming.Google ScholarGoogle Scholar
  9. D. Calvanese, M. Ortiz, M. Simkus, and G. Stefanoni. 2013. Reasoning about explanations for negative query answers in DL-lite. Journal of Artificial Intelligence Research 48 (2013), 635--669.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. D. Calvanese, D. Lanti, A. Ozaki, R. Peñaloza, and G. Xiao. 2019. Enriching ontology-based data access with provenance. In Proc. IJCAI.Google ScholarGoogle Scholar
  11. A. Chalamalla, I. F. Ilyas, M. Ouzzani, and P. Papotti. 2017. Descriptive and prescriptive data cleaning. In Proc. SIGMOD.Google ScholarGoogle Scholar
  12. C. Chen, K. Lin, C. Rudin, Y. Shaposhnik, S. Wang, and T. Wang. [n.d.]. An interpretable model with globally consistent explanations for credit risk. In Proc. NIPS 2018 Workshop on Challenges and Opportunities for AI in Financial Services: the Impact of Fairness, Explainability, Accuracy, and Privacy.Google ScholarGoogle Scholar
  13. H. Chockler and J. Y. Halpern. 2004. Responsibility and blame: A structural-model approach. Journal of Artificial Intelligence Research 22 (2004), 93--115.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. F. Croce and M. Lenzerini. 2018. A framework for explaining query answers in DL-lite. In Proc. EKAW.Google ScholarGoogle Scholar
  15. A. Datta, S. Sen, and Y. Zick. 2016. Algorithmic transparency via quantitative input influence: Theory and experiments with learning systems. In IEEE Symposium on Security and Privacy.Google ScholarGoogle Scholar
  16. U. Draisbach, P. Christen, and F. Naumann. 2019. Transforming pairwise duplicates to entity clusters for high-quality duplicate detection. Journal of Data and Information Quality 12, 1 (2019), 3:1--3:30.Google ScholarGoogle Scholar
  17. J. Du, K. Wang, and Y. Shen. 2014. A tractable approach to ABox abduction over description logic ontologies. In Proc. AAAI.Google ScholarGoogle Scholar
  18. P. Dubey and L. S. Shapley. 1979. Mathematical properties of the Banzhaf power index. Mathematics of Operations Research 4, 2 (1979), 99--131.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. W. Fan and F. Geerts. 2012. Foundations of Data Quality Management. Morgan 8 Claypool.Google ScholarGoogle Scholar
  20. W. Fan, H. Gao, X. Ji, J. Li, and S. Ma. 2009. Dynamic constraints for record matching. The International Journal on Very Large Data Bases (VLDBJ) 20, 4 (2009), 495--520.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. Halpern and J. Pearl. 2005. Causes and explanations: A structural-model approach: Part 1. British Journal of Philosophy of Science 56 (2005), 843--887.Google ScholarGoogle ScholarCross RefCross Ref
  22. A. Heidari, J. McGrath, I. F. Ilyas, and Th. Rekatsinas. 2019. HoloDetect: Few-shot learning for error detection. In Proc. Sigmod.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. L. Jiang, A. Borgida, and J. Mylopoulos. 2008. Towards a compositional semantic account of data quality atrributes. In Proc. International Conference on Conceptual Modeling (ER). 55--68.Google ScholarGoogle Scholar
  24. M. A. Khamis, H. Q. Ngo, X. Nguyen, D. Olteanu, and M. Schleich. 2018. AC/DC: In-database learning thunderstruck. In Proc. DEEM.Google ScholarGoogle Scholar
  25. P. Kouki, J. Pujara, C. Marcum, L. Koehly, and L. Getoor. 2019. Collective entity resolution in multi-relational familial networks. Knowledge and Information Systems 61, 3 (2019), 1547--1581.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. B. Kimelfeld and C. Ré. 2017. A relational framework for classifier engineering. In Proc. PODS.Google ScholarGoogle Scholar
  27. J. Kleinberg, J. Ludwig, S. Mullainathan, and A. Rambachan. 2018. Algorithmic fairness. AEA Papers and Proceedings 108 (2018), 22--27.Google ScholarGoogle ScholarCross RefCross Ref
  28. J. Krishnan, M. J. Franklin, K. Goldberg, J. Wang, and E. Wu. 2017. BoostClean: Automated error detection and repair for machine learning. arXiv:1711.01299 (2017).Google ScholarGoogle Scholar
  29. E. Livshits, L. Bertossi, B. Kimelfeld, and M. Sebag. 2020. The Shapley value of tuples in query answering. In Proc. ICDT. arXiv:1904.08679.Google ScholarGoogle Scholar
  30. S. Lundberg and S.-I. Lee. 2017. A unified approach to interpreting model predictions. In Proc. NIPS.Google ScholarGoogle Scholar
  31. A. Meliou, W. Gatterbauer, K. F. Moore, and D. Suciu. 2010. The complexity of causality and responsibility for query answers and non-answers. In Proc. VLDB.Google ScholarGoogle Scholar
  32. J. Pearl. 2009. Causality: Models, Reasoning and Inference. Cambridge Univ. Press, 2nd ed.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. J. Rammelaere and F. Geerts. 2018. Explaining repaired data with CFDs. In Proc. VLDB.Google ScholarGoogle Scholar
  34. A. Roth (ed.). 1988. The Shapley Value: Essays in Honor of Lloyd S. Shapley. Cambridge University Press.Google ScholarGoogle Scholar
  35. C. Rudin. 2019. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence 1, 5 (2019), 206--215. arXiv:1811.10154Google ScholarGoogle ScholarCross RefCross Ref
  36. P. Saleiro, B. Kuester, A. Stevens, A. Anisfeld, L. Hinkson, J. London, and R. Ghani. 2018. Aequitas: A bias and fairness audit toolkit. CoRR abs/1811.05577 (2018).Google ScholarGoogle Scholar
  37. B. Salimi, L. Bertossi, D. Suciu, and G. Van den Broeck. 2016. Quantifying causal effects on query answering in databases. In Proc. TaPP.Google ScholarGoogle Scholar
  38. B. Salimi, J. Gehrke, and D. Dan Suciu. 2018. Bias in OLAP queries: Detection, explanation, and removal. In Proc. SIGMOD. 1021--1035.Google ScholarGoogle Scholar
  39. B. Salimi, B. Howe, and D. Suciu. 2019. Data management for causal algorithmic fairness. IEEE Data Engineering Bulletin 42, 3 (2019), 24--35.Google ScholarGoogle Scholar
  40. D. Suciu, D. Olteanu, C. Re, and C. Koch. 2011. Probabilistic Databases. Synthesis Lectures on Data Management, Morgan 8 Claypool Publishers.Google ScholarGoogle Scholar

Index Terms

  1. Data Quality and Explainable AI

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image Journal of Data and Information Quality
      Journal of Data and Information Quality  Volume 12, Issue 2
      Special Issue on Quality Assessment of Knowledge Graphs and On the Horizon
      June 2020
      105 pages
      ISSN:1936-1955
      EISSN:1936-1963
      DOI:10.1145/3397186
      Issue’s Table of Contents

      Copyright © 2020 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 3 May 2020
      • Received: 1 March 2020
      • Accepted: 1 March 2020
      Published in jdiq Volume 12, Issue 2

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format