Source: xkcd The Case for Empirical Evaluation of Methods for Causal Modeling David Jensen College of Information & Computer Sciences Computational Social Science Institute University of Massachusetts Amherst 15 February 2018 Center for Causal Discovery University of Pittsburgh This work is licensed under a Creative
Commons Attribution-NonCommercial 3.0 Unported License. 3 The idea Our understanding of causal inference, and the impact of practical techniques for causal modeling, would substantially improve if we more regularly conducted empirical evaluation of our practical techniques.
4 The idea Specifically, we should apply a broader and more consistent set of methods to evaluate the performance of techniques for causal modeling. These should include theoretical analysis, simulation, and empirical evaluation. 5 The idea
In contrast to theory and simulation, the use of empirical methods is conspicuously absent from most current work. Further development and adoption of such methods offers a major opportunity for improving both the quality of our research and its external impact. 6 Topics Example Challenges of evaluating techniques for causal
modeling Current dominant method: Structural evaluation Limitations Alternative methods Interventions Potential outcomes 7 Example 8
Goals for empirical evaluation Empirical A pre-existing system created by someone other than the researchers. Stochastic Produces experimental results that are non-deterministic through some combination of epistemic factors and aleatory factors. Identifiable Amenable to direct experimental investigation to accurately estimate interventional distributions Recoverable Lacks memory or irreversible effects,
which enables complete state recovery during experiments. Efficient Capable of generating large amounts of data that can be recorded with relatively little effort. Reproducible Allows future investigators to recreate nearly identical data sets with reasonable resources and without access to one-of-a-kind equipment or resources. Simple example: Database configuration ML for database configuration (setup) Assume a fixed database
and DB server hardware Questions For a given query, what is the expected performance under each set of configuration parameters? For a given query, which configuration will give me the best performance? Data Run 11,252 queries actually run against the Stack Exchange Data Explorer
Each query run using one of many different joint values of the configuration parameters using Postgres 9.2.2 ML for database configuration (variables) Configuration parameters (treatments) Performance measures (outcomes) Other variables
of query & user (covariates) Indexing Runtime Query: Index primarykey/foreign-key fields? Page cost Random access
relatively fast or slow? Memory level Small or large working memory? Total runtime Disk reads Blocks read from disk during
execution Cache hits Number of memory reads from cache Length, Table count, Join count, Group-by count, Total rows returned User:
Total queries by user, Year created For simplicity, all treatments are binary, and all other variables are discretized to five categories by densitybased binning Process 1 CGM for database configuration CGM for database configuration
CGM for database configuration Comparing associational and causal models Compare a state-of the-art associational model (a random forest) to a CGM constructed using greedy equivalence search (GES) (Chickering & Meek 2002) Evaluate by comparing to ground truth (experimental
results for all queries obtained using a specific joint setting of the configuration parameters). Cache Hits (Garant & Jensen 2016) Comparing associational and causal models Compare a state-of the-art associational model (a random forest) to a CGM constructed using greedy equivalence search (GES) (Chickering & Meek 2002)
Evaluate by comparing to ground truth (experimental results for all queries obtained using a specific joint setting of the configuration parameters). Cache Hits (Garant & Jensen 2016)
Comparing associational and causal models Compare a state-of the-art associational model (a random forest) to a CGM constructed using greedy equivalence search (GES) (Chickering & Meek 2002) Evaluate by comparing to ground truth (experimental results for all queries obtained using a specific
joint setting of the configuration parameters). Disk Reads (Garant & Jensen 2016) Comparing associational and causal models Compare a state-of the-art associational model (a random forest) to a CGM constructed using greedy equivalence search (GES) (Chickering & Meek 2002) Evaluate by
comparing to ground truth (experimental results for all queries obtained using a specific joint setting of the configuration parameters). Disk Reads (Garant & Jensen 2016) Comparing associational and causal models Compare a state-of the-art associational model (a
random forest) to a CGM constructed using greedy equivalence search (GES) (Chickering & Meek 2002) Evaluate by comparing to ground truth (experimental results for all queries obtained using a specific joint setting of the configuration parameters).
Runtime (Garant & Jensen 2016) Comparing associational and causal models Compare a state-of the-art associational model (a random forest) to a CGM constructed using greedy equivalence search (GES) (Chickering & Meek 2002) Evaluate by comparing to ground truth (experimental
results for all queries obtained using a specific joint setting of the configuration parameters). Runtime (Garant & Jensen 2016) Some conclusions Existence proof Existing methods can learn a reasonably accurate causal graphical model of a real system using observational data.
Comparative performance GES appears to outperform MMHC and PC on this task, both in terms of expected value and variance. Practical observations Not all structure errors are created equal. Errors of existence and direction on many edges do not substantially reduce accuracy on estimates of causal effect on this task. 2 Challenges of
evaluating techniques for causal modeling 2 Predictive accuracy is insufficient Predictive accuracy on held-out observational data underdetermines accuracy of causal inference. Predictive accuracy is nearly a necessary condition, but it is not a sufficient condition. Typically, the interventional distribution differs
substantially from the observational distribution. Thus, observational data alone are insufficient. This renders insufficient a set of methods for empirical evaluation long-developed and used within the machine learning community (e.g., cross-validation, specific accuracy measures, etc.) 2 In addition to this technical
barrier there are sociological and cultural barriers as well 2 Diverse research communities Several diverse research communities have developed methods for causal inference and modeling from observational data, These include... CS, Statistics, Philosophy > CGMs Stats, Social Science, Econ > POF (PS, IV, ITS,
etc.) Physics, CS > Distributional methods Each of these communities favors specific task definitions, standards, and traditions 2 Example: Machine learning Traditional work in ML makes empirical evaluation look relatively easy Tasks Long-studied and widely applicable tasks
(e.g., conditional probability estimation) Evaluation methods Well-established methods including cross-validation, temporal partitioning of test sets, ROC analysis, etc. Resources Large repositories of real-world data sets and software for performing empirical evaluation. Many other examples in CS: IR (TREC); Robotics (RoboCup); RL (Backgammon); Deep
RL (Atari 2600) 2 What is the task of causal modeling? One formulation Given an arbitrary intervention X in a system of interest, estimate the effect on some set of outcomes Y. However, there are many alternative formulations Potential Outcomes Framework Given an
intervention on a single binary treatment X, estimate the effect on a single continuous outcome Y. Reinforcement learning Given an action policy , estimate the long-term discounted reward of , estimate the long-term discounted reward of following that policy. Structure only Determine whether X causes Y. 2 Structure-only is not well-specified The structure-only task Does X causes Y is not a well-specified task.
An answer (Yes) for a specific variable pair {x,y} specifies neither the magnitude of effect nor the conditions under which the effect occurs. Any answer can be valid, given a particular effect-size threshold, sample size, and marginal distribution of other potential causes of y. A well-specified answer would also specify the strength-of-effect and the conditioning set under which those effects occur.
3 Current dominant method: Structural evaluation 3 Current evaluation methods Multiple types of methods are currently used to evaluate practical techniques for causal modeling These include: Theoretical analysis (e.g., soundness and completeness theorems)
Innovative forms of empirical evidence (stay tuned) However, at least for work on CGMs, one method of evaluation is dominant Assessing structural accuracy on models learned from synthetic data. 3 Structural accuracy on synthetic data Given A technique T for that constructs a DAG from observational data
A fully specified causal graphical model M* consisting of a DAG D* and set of CPDs Generate data Generate a data sample S from M* Learn model Apply T to learn a DAG DT from S Evaluate model Compare DT and D* using f(D1,D2) 3 Structural measures Evaluate existence of false positive and false negative edges as well as edge orientation
errors. Simple measures False positive and false negative rates Rates of edge orientation errors Precision and recall; oriented precision and recall Summary measures Structure Hamming distance Structural intervention distance 3 Structural Hamming Distance (SHD)
(Tsamardinos et al. 2006; Acid and de Campos 2003 ) 3 Structural intervention distance (SID) Graph mis-specification is not fundamentally related to quality of a causal model (Peters & Buhlmann 2015) Including superfluous edges does not necessarily bias a causal model Reversing or omitting edges can potentially induce bias in many interventional distributions
Core idea Count the number of mis-specified pairwise interventional distributions 3 Structural Hamming Distance (SHD) P(Z|do(X)) P(Y|do(X)) P(Z|do(Y)) P(Y|do(Z)) 3
Limitations and detrimental effects of our current approach 3 Loosely coupled to a well-specified task Structural accuracy is neither necessary nor sufficient for high-accuracy estimates of causal effects Not necessary Many edges may have no effects or very small effects on causal paths
from interventions to outcomes. Not sufficient Either errors on very strong edges or entirely correct structure with incorrect parameters can still produce poor estimates of causal effect. SHD is unfocused in that you cannot specify treatments and outcomes of interest (although SID can do this) 3 CGM for database configuration
Limits comparison of representations Structural measures artificially limit model representation to Bayesian networks. They essentially bake in the DAG representation of a causal model An ideal evaluation method would be representation agnostic (e.g., cross-validated accuracy in conditional probability estimation) 4
Prevent use of empirical data sets Structural measures require knowledge of the correct structure, meaning either... ...Artificial data generator in the form of a DAG; or ...Extraordinary knowledge of a real-world generative process Prevents impressive applications No use of real data sets that speak to researchers outside of the community
Examples Atari 2600, TDGammon, SkyCat 4 No surprises about the real-world effects Essentially no use of real data sets Use of empirical evaluation is important because, in some cases, intuitively plausible learning methods actually lead to worse performance. (Minton 1985; Langley 1988). Essentially all techniques for causal modeling
are sufficiently complex that formal analysis cannot tell us everything about their behavior. 4 Alternatives 4 Alternatives Compare interventional distributions on synthetic data (e.g., TV) Compare interventional distributions on
synthetic data intended to represent real data BNs learned from real data DARPA program with social science simulations Compare inferences in empirical cases with known causality (e.g., Cause-Effects Pairs Challenge) 4 Alternatives (continued) Compare interventional distributions on real data with limited sets of real-world
interventions Arabidopsis data DREAM data Compare interventional distributions on real data with exhaustive experimental data from closed systems (e.g., Postgres, HTTP, JDK) Doubly randomized controlled trials (Shadish) 4 Interventions
4 Where should we go? Existence of alternatives Dataset development from systems in which extensive experimentation is possible Cyber-physical Small-scale biological systems and even standard synthetic data generators Standardized evaluation protocols Availability Creation of shared repositories
with version numbers, etc. Use Authors, reviewers, and readers who use the approaches 4 Potential outcomes 4 Where could this lead? Wider recognition of method utility outside of the field (e.g., DeepMind Atari paper, SkyCat)
Clearer sense of comparative effectiveness of alternative methods Most QEDs cant represent effects of combined interventions CGMs dont exploit many types of prior knowledge about data generating process (e.g., within-subjects designs) Many distributional methods dont estimate size of effect or account for confounding 5
Where could this lead? (continued) Identification of relative importance of key assumptions e.g., NBC and conditional independence assumption e.g., Effects of measurement error (Scheines) Identification of additional, unrecognized assumptions Development of methods that unify multiple traditions in causal inference
5 Conclusions 5 The idea Our understanding of causal inference, and the impact of practical techniques for causal modeling, would substantially improve if we more regularly conducted empirical evaluation
of our practical techniques. 5 The idea Specifically, we should apply a broader and more consistent set of methods to evaluate the performance of techniques for causal modeling. These should include theoretical analysis, simulation, and empirical evaluation. 5
The idea In contrast to theory and simulation, the use of empirical methods is conspicuously absent from most current work. Further development and adoption of such methods offers a major opportunity for improving both the quality of our research and its external impact. 5