Source: xkcd The Case for Empirical Evaluation of

Source: xkcd The Case for Empirical Evaluation of

Source: xkcd The Case for Empirical Evaluation of Methods for Causal Modeling David Jensen College of Information & Computer Sciences Computational Social Science Institute University of Massachusetts Amherst 15 February 2018 Center for Causal Discovery University of Pittsburgh This work is licensed under a Creative

Commons Attribution-NonCommercial 3.0 Unported License. 3 The idea Our understanding of causal inference, and the impact of practical techniques for causal modeling, would substantially improve if we more regularly conducted empirical evaluation of our practical techniques.

4 The idea Specifically, we should apply a broader and more consistent set of methods to evaluate the performance of techniques for causal modeling. These should include theoretical analysis, simulation, and empirical evaluation. 5 The idea

In contrast to theory and simulation, the use of empirical methods is conspicuously absent from most current work. Further development and adoption of such methods offers a major opportunity for improving both the quality of our research and its external impact. 6 Topics Example Challenges of evaluating techniques for causal

modeling Current dominant method: Structural evaluation Limitations Alternative methods Interventions Potential outcomes 7 Example 8

Goals for empirical evaluation Empirical A pre-existing system created by someone other than the researchers. Stochastic Produces experimental results that are non-deterministic through some combination of epistemic factors and aleatory factors. Identifiable Amenable to direct experimental investigation to accurately estimate interventional distributions Recoverable Lacks memory or irreversible effects,

which enables complete state recovery during experiments. Efficient Capable of generating large amounts of data that can be recorded with relatively little effort. Reproducible Allows future investigators to recreate nearly identical data sets with reasonable resources and without access to one-of-a-kind equipment or resources. Simple example: Database configuration ML for database configuration (setup) Assume a fixed database

and DB server hardware Questions For a given query, what is the expected performance under each set of configuration parameters? For a given query, which configuration will give me the best performance? Data Run 11,252 queries actually run against the Stack Exchange Data Explorer

Each query run using one of many different joint values of the configuration parameters using Postgres 9.2.2 ML for database configuration (variables) Configuration parameters (treatments) Performance measures (outcomes) Other variables

of query & user (covariates) Indexing Runtime Query: Index primarykey/foreign-key fields? Page cost Random access

relatively fast or slow? Memory level Small or large working memory? Total runtime Disk reads Blocks read from disk during

execution Cache hits Number of memory reads from cache Length, Table count, Join count, Group-by count, Total rows returned User:

Total queries by user, Year created For simplicity, all treatments are binary, and all other variables are discretized to five categories by densitybased binning Process 1 CGM for database configuration CGM for database configuration

CGM for database configuration Comparing associational and causal models Compare a state-of the-art associational model (a random forest) to a CGM constructed using greedy equivalence search (GES) (Chickering & Meek 2002) Evaluate by comparing to ground truth (experimental

results for all queries obtained using a specific joint setting of the configuration parameters). Cache Hits (Garant & Jensen 2016) Comparing associational and causal models Compare a state-of the-art associational model (a random forest) to a CGM constructed using greedy equivalence search (GES) (Chickering & Meek 2002)

Evaluate by comparing to ground truth (experimental results for all queries obtained using a specific joint setting of the configuration parameters). Cache Hits (Garant & Jensen 2016)

Comparing associational and causal models Compare a state-of the-art associational model (a random forest) to a CGM constructed using greedy equivalence search (GES) (Chickering & Meek 2002) Evaluate by comparing to ground truth (experimental results for all queries obtained using a specific

joint setting of the configuration parameters). Disk Reads (Garant & Jensen 2016) Comparing associational and causal models Compare a state-of the-art associational model (a random forest) to a CGM constructed using greedy equivalence search (GES) (Chickering & Meek 2002) Evaluate by

comparing to ground truth (experimental results for all queries obtained using a specific joint setting of the configuration parameters). Disk Reads (Garant & Jensen 2016) Comparing associational and causal models Compare a state-of the-art associational model (a

random forest) to a CGM constructed using greedy equivalence search (GES) (Chickering & Meek 2002) Evaluate by comparing to ground truth (experimental results for all queries obtained using a specific joint setting of the configuration parameters).

Runtime (Garant & Jensen 2016) Comparing associational and causal models Compare a state-of the-art associational model (a random forest) to a CGM constructed using greedy equivalence search (GES) (Chickering & Meek 2002) Evaluate by comparing to ground truth (experimental

results for all queries obtained using a specific joint setting of the configuration parameters). Runtime (Garant & Jensen 2016) Some conclusions Existence proof Existing methods can learn a reasonably accurate causal graphical model of a real system using observational data.

Comparative performance GES appears to outperform MMHC and PC on this task, both in terms of expected value and variance. Practical observations Not all structure errors are created equal. Errors of existence and direction on many edges do not substantially reduce accuracy on estimates of causal effect on this task. 2 Challenges of

evaluating techniques for causal modeling 2 Predictive accuracy is insufficient Predictive accuracy on held-out observational data underdetermines accuracy of causal inference. Predictive accuracy is nearly a necessary condition, but it is not a sufficient condition. Typically, the interventional distribution differs

substantially from the observational distribution. Thus, observational data alone are insufficient. This renders insufficient a set of methods for empirical evaluation long-developed and used within the machine learning community (e.g., cross-validation, specific accuracy measures, etc.) 2 In addition to this technical

barrier there are sociological and cultural barriers as well 2 Diverse research communities Several diverse research communities have developed methods for causal inference and modeling from observational data, These include... CS, Statistics, Philosophy > CGMs Stats, Social Science, Econ > POF (PS, IV, ITS,

etc.) Physics, CS > Distributional methods Each of these communities favors specific task definitions, standards, and traditions 2 Example: Machine learning Traditional work in ML makes empirical evaluation look relatively easy Tasks Long-studied and widely applicable tasks

(e.g., conditional probability estimation) Evaluation methods Well-established methods including cross-validation, temporal partitioning of test sets, ROC analysis, etc. Resources Large repositories of real-world data sets and software for performing empirical evaluation. Many other examples in CS: IR (TREC); Robotics (RoboCup); RL (Backgammon); Deep

RL (Atari 2600) 2 What is the task of causal modeling? One formulation Given an arbitrary intervention X in a system of interest, estimate the effect on some set of outcomes Y. However, there are many alternative formulations Potential Outcomes Framework Given an

intervention on a single binary treatment X, estimate the effect on a single continuous outcome Y. Reinforcement learning Given an action policy , estimate the long-term discounted reward of , estimate the long-term discounted reward of following that policy. Structure only Determine whether X causes Y. 2 Structure-only is not well-specified The structure-only task Does X causes Y is not a well-specified task.

An answer (Yes) for a specific variable pair {x,y} specifies neither the magnitude of effect nor the conditions under which the effect occurs. Any answer can be valid, given a particular effect-size threshold, sample size, and marginal distribution of other potential causes of y. A well-specified answer would also specify the strength-of-effect and the conditioning set under which those effects occur.

3 Current dominant method: Structural evaluation 3 Current evaluation methods Multiple types of methods are currently used to evaluate practical techniques for causal modeling These include: Theoretical analysis (e.g., soundness and completeness theorems)

Innovative forms of empirical evidence (stay tuned) However, at least for work on CGMs, one method of evaluation is dominant Assessing structural accuracy on models learned from synthetic data. 3 Structural accuracy on synthetic data Given A technique T for that constructs a DAG from observational data

A fully specified causal graphical model M* consisting of a DAG D* and set of CPDs Generate data Generate a data sample S from M* Learn model Apply T to learn a DAG DT from S Evaluate model Compare DT and D* using f(D1,D2) 3 Structural measures Evaluate existence of false positive and false negative edges as well as edge orientation

errors. Simple measures False positive and false negative rates Rates of edge orientation errors Precision and recall; oriented precision and recall Summary measures Structure Hamming distance Structural intervention distance 3 Structural Hamming Distance (SHD)

(Tsamardinos et al. 2006; Acid and de Campos 2003 ) 3 Structural intervention distance (SID) Graph mis-specification is not fundamentally related to quality of a causal model (Peters & Buhlmann 2015) Including superfluous edges does not necessarily bias a causal model Reversing or omitting edges can potentially induce bias in many interventional distributions

Core idea Count the number of mis-specified pairwise interventional distributions 3 Structural Hamming Distance (SHD) P(Z|do(X)) P(Y|do(X)) P(Z|do(Y)) P(Y|do(Z)) 3

Limitations and detrimental effects of our current approach 3 Loosely coupled to a well-specified task Structural accuracy is neither necessary nor sufficient for high-accuracy estimates of causal effects Not necessary Many edges may have no effects or very small effects on causal paths

from interventions to outcomes. Not sufficient Either errors on very strong edges or entirely correct structure with incorrect parameters can still produce poor estimates of causal effect. SHD is unfocused in that you cannot specify treatments and outcomes of interest (although SID can do this) 3 CGM for database configuration

Limits comparison of representations Structural measures artificially limit model representation to Bayesian networks. They essentially bake in the DAG representation of a causal model An ideal evaluation method would be representation agnostic (e.g., cross-validated accuracy in conditional probability estimation) 4

Prevent use of empirical data sets Structural measures require knowledge of the correct structure, meaning either... ...Artificial data generator in the form of a DAG; or ...Extraordinary knowledge of a real-world generative process Prevents impressive applications No use of real data sets that speak to researchers outside of the community

Examples Atari 2600, TDGammon, SkyCat 4 No surprises about the real-world effects Essentially no use of real data sets Use of empirical evaluation is important because, in some cases, intuitively plausible learning methods actually lead to worse performance. (Minton 1985; Langley 1988). Essentially all techniques for causal modeling

are sufficiently complex that formal analysis cannot tell us everything about their behavior. 4 Alternatives 4 Alternatives Compare interventional distributions on synthetic data (e.g., TV) Compare interventional distributions on

synthetic data intended to represent real data BNs learned from real data DARPA program with social science simulations Compare inferences in empirical cases with known causality (e.g., Cause-Effects Pairs Challenge) 4 Alternatives (continued) Compare interventional distributions on real data with limited sets of real-world

interventions Arabidopsis data DREAM data Compare interventional distributions on real data with exhaustive experimental data from closed systems (e.g., Postgres, HTTP, JDK) Doubly randomized controlled trials (Shadish) 4 Interventions

4 Where should we go? Existence of alternatives Dataset development from systems in which extensive experimentation is possible Cyber-physical Small-scale biological systems and even standard synthetic data generators Standardized evaluation protocols Availability Creation of shared repositories

with version numbers, etc. Use Authors, reviewers, and readers who use the approaches 4 Potential outcomes 4 Where could this lead? Wider recognition of method utility outside of the field (e.g., DeepMind Atari paper, SkyCat)

Clearer sense of comparative effectiveness of alternative methods Most QEDs cant represent effects of combined interventions CGMs dont exploit many types of prior knowledge about data generating process (e.g., within-subjects designs) Many distributional methods dont estimate size of effect or account for confounding 5

Where could this lead? (continued) Identification of relative importance of key assumptions e.g., NBC and conditional independence assumption e.g., Effects of measurement error (Scheines) Identification of additional, unrecognized assumptions Development of methods that unify multiple traditions in causal inference

5 Conclusions 5 The idea Our understanding of causal inference, and the impact of practical techniques for causal modeling, would substantially improve if we more regularly conducted empirical evaluation

of our practical techniques. 5 The idea Specifically, we should apply a broader and more consistent set of methods to evaluate the performance of techniques for causal modeling. These should include theoretical analysis, simulation, and empirical evaluation. 5

The idea In contrast to theory and simulation, the use of empirical methods is conspicuously absent from most current work. Further development and adoption of such methods offers a major opportunity for improving both the quality of our research and its external impact. 5

Recently Viewed Presentations

  • Nationalist and revolution sweep the west- Chapter 8

    Nationalist and revolution sweep the west- Chapter 8

    Saint Dominique- 500,000 enslaved Africans worked on French Plantations. White masters used brutal methods and working conditions. During the French Rev., oppressed people of the French colony of Haiti rose up. 1791- 100,000 Africans rose in revolt. Toussaint . L'Overture...
  • Institute for Environmental Studies (IVM) Assessing adaptation: A

    Institute for Environmental Studies (IVM) Assessing adaptation: A

    Adaptation policy recommendation = specific step or steps should be taken e.g. "It is recommended that in the next five years we allocate €20M to the development of a heat wave early warning system". Adaptation policy measure = something in...
  • Surface Area and Volume - Trailblazers

    Surface Area and Volume - Trailblazers

    Surface Area and Volume. Think about this: Your parents are allowing you to redo your bedroom. How would you determine how much paint you need for your room? ... Surface Area of a Triangular Prism. Draw the net and Solve:...
  • Identifying the Elements of A Plot Diagram

    Identifying the Elements of A Plot Diagram

    Arial MS Pゴシック Calibri Arial Black Default Design Identifying the Elements of Plot (via Diagram) Today Plot Diagram Plot (definition) Hook 1. Exposition Beauty and the Beast 2. Rising Action Rocky 3. Climax Armageddon 4. Falling Action Saints and Soldiers...
  • Chapter 4 TAXABLE INCOME AND TAX PAYABLE FOR

    Chapter 4 TAXABLE INCOME AND TAX PAYABLE FOR

    Canada CaregiverAdditional Amount - ITA 118(1)(e) Additional amount for spouse or eligible dependant. If base for spousal or eligible dependant credits is less than the Canada caregiver amount base; Then, additional amount is available: Actual spousal or eligible dependant amount,...
  • CurricUNET Linking Faculty, Staff and Administration for Curriculum

    CurricUNET Linking Faculty, Staff and Administration for Curriculum

    CurricUNET Linking Faculty, Staff and Administration for Curriculum Management ... Manager [email protected] Curriculum and Instructional Services San Diego Community College District (619) 388-6963 CurricUNET Linking Faculty, Staff and Administration for Curriculum Management What is CurricUNET
  • Impostors for Interactive Parallel Computer Graphics Orion Sky

    Impostors for Interactive Parallel Computer Graphics Orion Sky

    PowerPoint Presentation Antialiasing Summary Aliasing: The Problem Texture Antialiasing via Mipmaps Geometry Antialiasing Geometry Antialiasing via Texture Antialiased Impostor Challenges Ground Texture Antialiasing Ground Texture Antialiasing Splat Aliasing Splat Antialiasing PowerPoint Presentation Quality: Soft Shadows Penumbra Limit Shadows ...
  • Adapted from Kathryn Van Wagoner, Pat Nelson Eldon

    Adapted from Kathryn Van Wagoner, Pat Nelson Eldon

    P O WTESTING R. Adapted from . Kathryn Van Wagoner, Pat Nelson. Eldon McMurray, Robert Williams. Test Taking Tips for Success. Tests--Opportunities. to demonstrate . your mastery. Disarm Test Misconceptions. Grades (A to E) do not give power to tests....