Hello World - University of Colorado Boulder | University of ...

Hello World - University of Colorado Boulder | University of ...

Population Stratification Benjamin Neale Leuven August 2008 Objectives Population Stratification What & Why? Dealing with PS in association studies Revisiting Genomic Control (small studies) EIGENSTRAT PLINK practical Other methods Population Stratification What & Why? What is population stratification/structure (PS)?

This just in! Human beings dont mate at random Physical barriers Political barriers Socio-cultural barriers Isolation by distance None of these barriers are absolute, and in fact by primate standards we are remarkably homogeneous Most human variation is within population Reflects recent common ancestry (Out of Africa)

Between population variation still exists, even though the vast majority of human variation is shared 0.02 0.00 -0.02 Colombians Karitiana Maya Pima Surui Balochi Brahui Burusho

Cambodians Dai Daur Han Hazara Hezhen Japanese Kalash Lahu Makrani -0.04 data$PC2 Population Stratification What & Why?

0.04 Human Genetic Diversity Panel, Illumina 650Y SNP chip (Li et al. 2008, Science 319: 1100) -0.08 -0.06 Miaozu Mongola Naxi Oroqen Pathan She

Sindhi Tu Tujia Uygur Xibo Yakut Yizu Adygei French French_Basque North_Italian Orcadian -0.04 data$PC1

-0.02 Russian Sardinian Tuscan NAN_Melanesian Papuan Bantu_NE Bantu_S Biaka_Pygmies Mandenka Mbuti_Pygmies San Yoruba Bedouin Druze

Palestinian Mozabite 0.00 0.02 Population Stratification What & Why? Human Genetic Diversity Panel, Europeans only Population Stratification What & Why? Why is hidden PS a problem for association studies? Reduced Power

Lower chance of detecting true effects Confounding Higher chance of spurious association finding Population Stratification What & Why? Requirements of stratification Both conditions necessary for stratification Variation in disease rates across groups Variation in allele frequencies Population Stratification What & Why? Visualization of stratification conditions

Suppose that a disease is more common in one subgroup than in another Group 1 Group 2 then the cases will tend to be over-sampled from that group, relative to controls. Population Stratification What & Why? and this can lead to false positive associations Any allele that is more common in Group 2 will appear to be associated with the disease.

Group 1 Group 2 This will happen if Group 1 & 2 are hidden if they are known then they can be accounted for. Discrete groups are not required admixture yields same problem. Dealing with PS in association studies Dealing with PS in association studies Dealing with PS in association studies

Family-based association studies Transmission conditional on known parental (founder) genotypes E.g. TDT Recent review: Tiwari et al. (2008, Hum. Hered. 66: 67) Pros Cast-iron PS protection Cons 50% more genotyping needed (if using trios) Not all trios are informative Families more difficult to collect Dealing with PS in association studies

Genomic Control (GC) Devlin and Roeder (1999) used theoretical arguments to propose that with population structure, the distribution of Chi-square tests is inflated by a constant multiplicative factor . To estimate , add a separate GC set of neutral loci to genotype, and calculate chi-square tests for association in these Now perform an adjusted test of association that takes account of any

mismatching of cases/controls: 2GC = 2Raw/ Dealing with PS in association studies Genomic Control (GC) Correct 2 test statistic by inflation factor Pros Easy to use Doesnt need many SNPs Can handle highly mismatched Case/Control design Cons Less powerful than other methods when many SNPs available

Cant handle lactase-type false positives -scaling assumption breaks down for large Genomic Control variants GCmed (Devlin & Roeder 1999, Biometrics 55: 997) Dealing with PS in association studies = median(2GC)/0.455 GCmean (Reich & Goldstein 2001, Gen Epi 20: 4) = mean(2GC) Upper 95% CI of used as conservative measure GCF (Devlin et al. 2004, Nat Genet 36: 1129) Test 2Raw/ as F-statistic Recent work (Dadd, Weale & Lewis, submitted) confirms GCF as the

best choice More variants on the theme Use Q-Q plot to remove GC-SNP outliers (Clayton et al. 2005, Nat Genet 37: 1243) Ancestry Informative Markers (Review: Barnholtz-Sloan et al. 2008, Cancer Epi Bio Prev 17: 471) Frequency matching (Reich & Goldstein 2001, Gen Epi 20: 4) Other methods Dealing with PS in association studies Structured Association E.g. strat (Pritchard et al. 2000, Am J Hum Genet 67: 170) Fits explicit model of discrete ancestral sub-populations

Breaks down for small datasets, too computationally costly for large datasets Mixed modelling Fits error structure based on matrix of estimated pairwise relatedness among all individuals (e.g. Yu et al. 2006, Nat Genet 38: 203) Requires many SNPs to estimate relatedness well Cant handle binary phenotypes (e.g. Ca/Co) well Still an active area of methodological development Delta-centralization (Gorrochurn et al. 2006, Gen Epi 30: 277)

Logistic Regression (Setakis et al. 2006, Genome Res 16: 290) Stratification Score (Epstein et al. 2007, Am J Hum Genet 80: 921) Review: Barnholtz-Sloan et al. (2008, Cancer Epi Bio Prev 17: 471) Genomic Control fails if stratification affects certain SNPs more than the average EIGENSTRAT LCT Height Campbell et al. (2005, Nat Genet 37: 868) EIGENSTRAT

An example: height associates with lactase persistence SNP in US-European sample False Positive EIGENSTRAT The EIGENSTRAT solution PCA for SNP data (EIGENSTRAT) PC 1 SNP 3 X=

SNP 2 n indivs m SNPs PC 2 x11 0 xij xnm EIGENSTRAT

indiv i x 1 Indiv3 SNP 1 2 x SNP j

Indiv2 0 1 2 Indiv1 PCA properties Each axis is a linear equation, defining individual scores or SNP loadings EIGENSTRAT Zi = a1xi1 + .. + ajxij + .. + amxnm

Zj = b1x1j + .. + bixij + .. + bnxnm Axes can be created in either projection Max NO axes = min(n-1,m-1) Each axis is at right angles to all others (orthogonal) Eigenvectors define the axes, and eigenvalues define the variance explained by each axis PC axis types PCA dissects and ranks the correlation structure of multivariate data Stratification is one way that correlations in SNPs can be set up EIGENSTRAT

Stratification Systematic genotyping artefacts Local LD (Theoretical) Many high-effect causal SNPs in a casecontrol study Inspection of PC axis properties can determine which type of effect is at work for each axis Original EIGENSTRAT procedure 1) Code all SNP data {0,1,2}, where 1=het

2) Normalize by subtracting mean and dividing by p (1 p ) EIGENSTRAT 3) 4) 5) 6) Recode missing genotype as 0 Apply PCA to matrix of coded SNP data Extract scores for 1st 10 PC axes Calculate modified Armitage Trend statistic using 1st 10 PC scores as covariates Price et al. (2006, Nat Genet 38: 904)

Patterson et al. (2006, PLoS Genet 2: e190) Earlier more general structure: Zhang et al. (2003, Gen Epi 24: 44) EIGENSTRAT Identifying PC axis types EIGENSTRAT PC 2 EIGENSTRAT applied to genomewide SNP data typed in two populations Black = Munich Ctrls Red = Munich Schiz

Green = Aberdeen Ctrls Blue = Aberdeen Schiz PC 1 EIGENSTRAT PC individual scores EIGENSTRAT SNP loadings, PC1 PC1 SNP loading distribution Whole genome contributes

SNP loadings, PC1 EIGENSTRAT Whole genome contributes PC1 SNP loading Q-Q plot SNP loadings, PC2 EIGENSTRAT Only part of the genome contributes PC2 driven by known ~4Mb inversion poly on Chr8

EIGENSTRAT Characteristic LD pattern revealed by SNP loadings EIGENSTRAT PC axis types revealed by SNP loading Q-Q plots in Illumina iControl dataset Extended EIGENSTRAT procedure corrects for local LD 1) 2) 3)

EIGENSTRAT 4) 5) 6) Known high-LD regions excluded SNPs thinned using LD criterion r2<0.2 Window size = 1500 contiguous SNPs Step size = 150

Each SNP regressed on the previous 5 SNPs, and the residual entered into the PCA analysis Iterative removal of outlier SNPs and/or outlier individuals Nomination of axes to use as covariates based on Tracy Widom statistics Enter significant PC axes as covariates in a logistic or linear regression: Phenotype = g(const. + *covariates + *SNP j genotype) + EIGENSTRAT Guidance on use of EIGENSTRAT EIGENSTRAT

Phase-change in ability to detect structure: Fst = 1/nm Patterson et al. (2006, PLoS Genet 2: e190) Number of SNPs needed for EIGENSTRAT to work EIGENSTRAT N=1000, FST=0.01, =0.0001, lactase-type SNPs Price et al. (2006, Nat Genet 38: 904) Take-home messages EIGENSTRAT work very well with >2000 SNPs

Clinal/admixture model seems to work well in practice Other more computationally demanding methods dont achieve huge power increases Genomic Control works well with <200 SNPs Still has a place in smaller studies (GWAS replication, candidate gene) Also copes with mismatched Case/Control designs (e.g. centralized control resources) PLINK Practical Genomic control 2 No stratification

E 2 Test locus Unlinked null markers 2 E 2 Stratification adjust test statistic Structured association LD observed under stratification

Unlinked null markers Subpopulation A Subpopulation B Identity-by-state (IBS) sharing Pair from same population Individual 1 A/C | Individual 2 C/C IBS 1

G/T | T/T 1 Pair from different population Individual 3 A/C G/G | Individual 4 C/C T/T IBS 1 0

A/G | | A/G 2 A/A A/A A/A G/G 0 C/C

0 C/C 0 G/G | | G/G 2 G/G | A/G 1 Empirical assessment of ancestry

Han Chinese Japanese Complete linkage IBS-based hierarchical clustering Multidimensional scaling plot: ~10K random SNPs Population stratification: LD pruning Perform LD-based pruning plink plink --bfile --bfile example example -indep

-indep 50 50 55 22 Window size in SNPs Number of SNPs to shift the window VIF threshold Spawns two files: plink.prune.in (SNPs to be kept) and plink.prune.out (SNPs to be removed) PLINK tutorial, October 2006; Shaun Purcell, [email protected] Population stratification: Genome-file Generates plink.genome plink plink --bfile

--bfile example example -genome -genome --extract --extract plink.prune.in plink.prune.in Extracts only the LD-pruned SNPs from the previous command The genome file that is created is the basis for all subsequent population based comparisons PLINK tutorial, October 2006; Shaun Purcell, [email protected] Population stratification: IBS clustering Perform IBS-based cluster analysis for 2 clusters

plink plink --bfile --bfile example example -cluster -cluster --K --K 22 --extract --extract plink.prune.in plink.prune.in --read-genome --read-genome plink.genome plink.genome In this case, we are reading the genome file we generated Clustering can be constrained in a number of other ways

cluster size, phenotype, external matching criteria, patterns of missing data, test of absolute similarity between individuals PLINK tutorial, October 2006; Shaun Purcell, [email protected] Population stratification: MDS plotting Telling plink to run cluster analysis plink plink --bfile --bfile example example -cluster -cluster --mds-plot --mds-plot 44 --K --K 22 --extract extract plink.prune.in

plink.prune.in --read-genome --read-genome plink.genome plink.genome Calculating 4 mds axes of variation, similar to PCA We will now use R to visualize the MDS plots. Including the --K 2 command supplies the clustering solution in the mds plot file PLINK tutorial, October 2006; Shaun Purcell, [email protected] Plotting the results in R CHANGE DIR This is the menu item you must change to

change where the simulated data will be placed Note you must have the R console highlighted Picture of the dialog box Either type the path name or browse to where you saved plink.mds Running the R script SOURCE R CODE This is where we load

the R program that simulates data Screenshot of source code selection This is the file rprog.R for the source code

Recently Viewed Presentations

  • BREAST CANCER EDUCATION, PREVENTION, AND TREATMENT FOR LATINAS:

    BREAST CANCER EDUCATION, PREVENTION, AND TREATMENT FOR LATINAS:

    " campaign event by the media campaign by television commercials or the bulletins of religious institutions. Objective 2: Latinas attending the "Vida Sana Para la Mujer de Hoy" on either of the two day campaign event will be educated or...
  • The RIF Team Analatom, Inc. MERC Resensys Georgia Tech Warner ...

    The RIF Team Analatom, Inc. MERC Resensys Georgia Tech Warner ...

    The RIF Team Analatom, Inc. MERC Resensys Georgia Tech Warner Robins: Frank Zahiri Carol….. Moving Forward: Establishing the technology base in the laboratory and transitioning it on-platform Author
  • Review of qualitative Research AND PRINCIPLES of Qualitative

    Review of qualitative Research AND PRINCIPLES of Qualitative

    * Types of Qualitative Research Grounded theory Ethnography Phenomenology Field research * Strengths and Weaknesses Strengths Depth of understanding Flexibility Weaknesses Subjectivity Suggestive, not definitive Limited generalizability Mixed methodology is possible * Qualitative Research Terms ...
  • Directly Rendering Spectral Elements using Texture Shaders Bernard

    Directly Rendering Spectral Elements using Texture Shaders Bernard

    Texture Shader - a programmable part of the hardware that takes texture coordinates and maps them to colors on a texture map. Overview We directly render high-order polynomials using Texture Shaders on modern graphics hardware such as nVidia's GeForce3.
  • Consumers Rule - Dr. Aziz Madi

    Consumers Rule - Dr. Aziz Madi

    Example: those who endorse sense of belonging read Reader's Digest and TV Guide, drink and entertain more, and prefer group activities Prentice-Hall, cr 2009 4-* Using Values to Explain Consumer Behavior (cont.) Means-End Chain Model assumes: Very specific product attributes...
  • Understanding ASEAN, its systems, structure and mechanisms

    Understanding ASEAN, its systems, structure and mechanisms

    Fundamental Principles. Feb. 1967 - Treaty of Amity and Cooperation (TAC) : Mutual respect for the independence, sovereignty, equality, territorial integrity, and national identity of all nations
  • Chapter 10 Mobile Commerce and Pervasive Computing Learning

    Chapter 10 Mobile Commerce and Pervasive Computing Learning

    Mobile Commerce and Pervasive Computing ... containing an authentication key along with other vital information about the subscriber PIN number protects the cell phone against illegal use if it happens to be stolen or lost Mobile Computing Infrastructure (cont.) Wireless...
  • Primary care for people who use drugs

    Primary care for people who use drugs

    Use HRC's "Getting Off Right" or CATIE's Sharp Shooters. This is a sophisticated process that can get derailed at any step. And doing this multiple times a day—the probabilities add up that something won't go right, esp with baseline unknowledgeability