TIPP: Taxon Identification and Phylogenetic Profiling

Ultra-large Multiple Sequence Alignment Tandy Warnow Founder Professor of Engineering The University of Illinois at Urbana-Champaign http://tandy.cs.illinois.edu Phylogeny (evolutionary tree) Orangutan From the Tree of the Life Website, University of Arizona Gorilla Chimpanzee Human

Phylogenies and Applications Basic Biology: How did life evolve? Applications of phylogenies to: protein structure and function population genetics human migrations metagenomics Hard Computational Problems NP-hard problems Large datasets 100,000+ sequences thousands of genes Big data complexity: model misspecification fragmentary sequences errors in input data

streaming data Phylogeny Problem U AGGGCAT V W TAGCCCA X TAGACTT Y

TGCACAA X U Y V W TGCGCTT Much is known about this problem from a mathematical and empirical viewpoint U AGGGCAT V

W TAGCCCA X TAGACTT Y TGCACAA X U Y V

W TGCGCTT However U V W AGGGCATGA AGAT X TAGACTT

Y TGCACAA X U Y V W TGCGCTT Indels (insertions and deletions) Deletion Mutation

ACGGTGCAGTTACCA ACCAGTCACCA Deletion Substitution ACGGTGCAGTTACCA Insertion ACCAGTCACCTA ACGGTGCAGTTACC-A AC----CAGTCACCTA The true multiple alignment Reflects historical substitution, insertion, and deletion events Defined using transitive closure of pairwise alignments computed on edges of the true tree

Input: unaligned sequences S1 S2 S3 S4 = = = = AGGCTATCACCTGACCTCCA TAGCTATCACGACCGC TAGCTGACCGC TCACGACCGACA Phase 1: Alignment S1

S2 S3 S4 = = = = AGGCTATCACCTGACCTCCA TAGCTATCACGACCGC TAGCTGACCGC TCACGACCGACA S1 S2 S3 S4

= = = = -AGGCTATCACCTGACCTCCA TAG-CTATCAC--GACCGC-TAG-CT-------GACCGC--------TCAC--GACCGACA Phase 2: Construct tree S1 S2 S3 S4 = = = =

AGGCTATCACCTGACCTCCA TAGCTATCACGACCGC TAGCTGACCGC TCACGACCGACA S1 S4 S1 S2 S3 S4 S2 S3 =

= = = -AGGCTATCACCTGACCTCCA TAG-CTATCAC--GACCGC-TAG-CT-------GACCGC--------TCAC--GACCGACA Phylogenomic pipeline Select taxon set and markers Gather and screen sequence data, possibly identify orthologs Compute multiple sequence alignments for each locus, and construct gene

trees Compute species tree or network: Combine the estimated gene trees, OR Estimate a tree from a concatenation of the multiple sequence alignments Get statistical support on each branch (e.g., bootstrapping) Estimate dates on the nodes of the phylogeny

Use species tree with branch support and dates to understand biology Phylogenomic pipeline Select taxon set and markers Gather and screen sequence data, possibly identify orthologs Compute multiple sequence alignments for each locus, and construct gene trees Compute species tree or network:

Combine the estimated gene trees, OR Estimate a tree from a concatenation of the multiple sequence alignments Get statistical support on each branch (e.g., bootstrapping) Estimate dates on the nodes of the phylogeny Use species tree with branch support and dates to understand biology Multiple Sequence Alignment (MSA): a scientific grand challenge1 S1 = AGGCTATCACCTGACCTCCA

S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC Sn = TCACGACCGACA S1 S2 S3 Sn = -AGGCTATCACCTGACCTCCA = TAG-CTATCAC--GACCGC-= TAG-CT-------GACCGC-= -------TCAC--GACCGACA Novel techniques needed for scalability and accuracy NP-hard problems and large datasets Current methods do not provide good accuracy Few methods can analyze even moderately large datasets Many important applications besides phylogenetic estimation

1 Frontiers in Massive Data Analysis, National Academies Press, 2013 This talk Big data multiple sequence alignment SAT (Science 2009, Systematic Biology 2012) and PASTA (RECOMB and J Comp Biol 2015), methods for co-estimation of alignments and trees UPP (Genome Biology 2015): ultra-large multiple sequence alignment, using the Ensemble of HMMs technique. Evaluating BAli-Phy on biological and simulated datasets First Align, then Compute the Tree S1 S2 S3

S4 = = = = AGGCTATCACCTGACCTCCA TAGCTATCACGACCGC TAGCTGACCGC TCACGACCGACA S1 S4 S1 S2 S3

S4 S2 S3 = = = = -AGGCTATCACCTGACCTCCA TAG-CTATCAC--GACCGC-TAG-CT-------GACCGC--------TCAC--GACCGACA Simulation Studies S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC

S4 = TCACGACCGACA Unaligned Sequences S1 = AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-S3 = TAG-CT-------GACCGC-S4 = -------TCAC--GACCGACA S1 S2 S4 S3 True tree and alignment Compare S1 = AGGCTATCACCTGACCTCCA

S2 = TAG-CTATCAC--GACCGC-S3 = TAG-C--T-----GACCGC-S4 = T---C-A-CGACCGA----CA S1 S4 S2 S3 Estimated tree and alignment FN FN: false negative (missing edge) FP: false positive (incorrect edge) FP 50% error rate

Two-phase estimation Alignment methods Clustal POY (and POY*) Probcons (and Probtree) Probalign MAFFT Muscle Di-align T-Coffee Prank (PNAS 2005, Science 2008) Opal (ISMB and Bioinf. 2007) FSA (PLoS Comp. Bio. 2009) Infernal (Bioinf. 2009) Etc. Phylogeny methods

Bayesian MCMC Maximum parsimony Maximum likelihood Neighbor joining FastME UPGMA Quartet puzzling Etc. RAxML: heuristic for large-scale ML optimization 1000-taxon models, ordered by difficulty (Liu et al., 2009) SAT Family of methods Iterative divide-and-conquer methods Each iteration re-aligns the sequences using the current tree, running preferred MSA methods on small local subsets, and merging subset alignments Each iteration computes an ML tree on the current

alignment, under the GTR (Generalized Time Reversible) Markov model of evolution Note: these methods are MSA boosters, designed to improve accuracy and/or scalability of the base method We show results using MAFFT-l-ins-i to align subsets Re-aligning on a tree A B C D Decompose dataset

Estimate ML tree on merged alignment A B C D Align subsets ABCD A

B C D Merge subalignments SAT and PASTA Algorithms Obtain initial alignment and estimated ML tree Tree Use tree to compute new alignment Estimate ML tree on new alignment

Alignment Repeat until termination condition, and return the alignment/tree pair with the best ML score SAT-1 (Science 2009) performance 1000-taxon models, ordered by difficulty rate of evolution generally increases from left to right SAT-1 24 hour analysis, on desktop machines (Similar improvements for biological datasets) SAT-1 can analyze up to about 8,000 sequences. SAT-1 and SAT-2 (Systematic Biology, 2012) SAT-1: up to 8K SAT-2: up to ~50K 1000-taxon models ranked by difficulty

SAT variants differ only in the decomposition strategy A B C D Decompose dataset Estimate ML tree on merged alignment

A B C D Align subsets ABCD A B C

D Merge subalignments PASTA merging: Step 1 C D B A Compute a spanning tree connecting alignment subsets E PASTA merging: Step 2 C

CD CD BD D AB AB B BD DE A Use Opal (or Muscle) to merge adjacent subset alignments in the spanning tree

DE E PASTA merging: Step 3 C AB + BD = ABD ABD + CD = ABCD ABCD + DE = ABCDE CD BD D AB B DE A

Use transitivity to merge all pairwise-merged alignments from Step 2 into final an alignment on entire dataset Overall: O(n log(n) + L) E 1kp: Thousand Transcriptome Project G. Ka-Shu Wong U Alberta J. Leebens-Mack U Georgia N. Wickett Northwestern N. Matasci iPlant

T. Warnow, UIUC S. Mirarab, UT-Austin N. Nguyen, UT-Austin Plus many many other people First study (Wickett, Mirarab, et al., PNAS 2014) had ~100 species and ~800 genes, gene trees and alignments estimated using SAT, and a coalescent-based species tree estimated using ASTRAL Second study: Plant Tree of Life based on transcriptomes of ~1200 species, and more than 13,000 gene families (most not single copy)

Gene Tree Incongruence Challenges: Species tree estimation from conflicting gene trees Gene tree estimation of datasets with > 100,000 sequences Mean:317 Median:266 12000 10000 Counts 8000 6000

4000 2000 0 0 500 1000 Length 1500 2000 1KP dataset: more than 100,000 p450 amino-acid

sequences, many fragmentary Mean:317 Median:266 12000 10000 1KP dataset: more than 100,000 p450 amino-acid sequences, many fragmentary Counts 8000 6000

All standard multiple sequence alignment methods we tested performed poorly on datasets with fragments. 4000 2000 0 0 500 1000 Length

1500 2000 1kp: Thousand Transcriptome Project G. Ka-Shu Wong U Alberta J. Leebens-Mack U Georgia N. Wickett Northwestern N. Matasci iPlant T. Warnow, UIUC

S. Mirarab, UT-Austin N. Nguyen UT-Austin Plus many many other people Plant Tree of Life based on transcriptomes of ~1200 species More than 13,000 gene families (most not single copy) Gene Tree Incongruence Challenge: Alignment of datasets with > 100,000 sequences with many fragmentary sequences

UPP UPP = Ultra-large multiple sequence alignment using Phylogeny-aware Profiles Nguyen, Mirarab, and Warnow. Genome Biology, 2014. Purpose: highly accurate large-scale multiple sequence alignments, even in the presence of fragmentary sequences. UPP UPP = Ultra-large multiple sequence alignment using Phylogeny-aware Profiles Nguyen, Mirarab, and Warnow. Genome Biology, 2014. Purpose: highly accurate large-scale multiple sequence alignments, even in the presence of fragmentary sequences. Uses an ensemble of HMMs Simple idea (not UPP) Select random subset of sequences, and build backbone alignment

Construct a Hidden Markov Model (HMM) on the backbone alignment Add all remaining sequences to the backbone alignment using the HMM One Hidden Markov Model for the entire alignment? Simple idea (not UPP) Select random subset of sequences, and build backbone alignment Construct a Hidden Markov Model (HMM) on the backbone alignment Add all remaining sequences to the backbone alignment using the HMM This approach works well if the dataset is small and has low evolutionary rates, but is not very accurate otherwise.

Select random subset of sequences, and build backbone alignment Construct a Hidden Markov Model (HMM) on the backbone alignment Add all remaining sequences to the backbone alignment using the HMM One Hidden Markov Model for the entire alignment? HMM 1 Or 2 HMMs? HMM 1 HMM 2 Or 4 HMMs?

HMM 1 HMM 3 HMM 2 HMM 4 Or all 7 HMMs? HMM 1 HMM 2 HMM 4 m HMM 7

HMM 3 HMM 5 HMM 6 UPP Algorithmic Approach 1. Select random subset of full-length sequences, and build backbone alignment 2. Construct an Ensemble of Hidden Markov Models on the backbone alignment 3. Add all remaining sequences to the backbone alignment using the Ensemble of HMMs Evaluation Simulated datasets (some have fragmentary sequences): 10K to 1,000,000 sequences in RNASim complex RNA sequence evolution simulation 1000-sequence nucleotide datasets from SAT papers 5000-sequence AA datasets (from FastTree paper)

10,000-sequence Indelible nucleotide simulation Biological datasets: Proteins: largest BaliBASE and HomFam RNA: 3 CRW datasets up to 28,000 sequences RNASim: alignment error All methods given 24 hrs on a 12-core machine Note: Mafft was run under default settings for 10K and 50K sequences and under Parttree for 100K sequences, and fails to complete under any setting For 200K sequences. Clustal-Omega only completes on 10K dataset. RNASim: tree error All methods given 24 hrs

on a 12-core machine Note: Mafft was run under default settings for 10K and 50K sequences and under Parttree for 100K sequences, and fails to complete under any setting For 200K sequences. Clustal-Omega only completes on 10K dataset. RNASim Million Sequences: alignment error Notes: We show alignment error using average of SP-FN and SP-FP. UPP variants have better alignment scores than PASTA.

(Not shown: Total Column Scores PASTA more accurate than UPP) No other methods tested could complete on these data PASTA under-aligns: its alignment is 43 times wider than true alignment (~900 Gb of disk space). UPP RNASim Million Sequences: tree error Using 12 processors:

UPP(Fast,NoDecomp) took 2.2 days, UPP(Fast) took 11.9 days, and PASTA took 10.3 days UPP vs. PASTA: impact of fragmentation Under high rates of evolution, PASTA is badly impacted by fragmentary sequences (the same is true for other methods). Under low rates of evolution, PASTA can still be highly accurate (data not shown). UPP continues to have good accuracy even on datasets with many fragments under all rates of evolution.

Performance on fragmentary datasets of the 1000M2 model condition Wall clock align time (hr) UPP Running Time 15 10 5 0 50000

100000 Number of sequences 150000 UPP(Fast) Wall-clock time used (in hours) given 12 processors 200000 Co-estimation would be much better!!! S1 S2 S3 S4

= = = = AGGCTATCACCTGACCTCCA TAGCTATCACGACCGC TAGCTGACCGC TCACGACCGACA S1 S4 S1 S2 S3 S4

S2 S3 = = = = -AGGCTATCACCTGACCTCCA TAG-CTATCAC--GACCGC-TAG-CT-------GACCGC--------TCAC--GACCGACA What about BAli-Phy? BAli-Phy (Redelings and Suchard): leading method for statistical co-estimation of alignments and trees Like Bayesian phylogeny estimation, it is expected to be the most rigorous and accurate technique for estimating trees and alignments!

BAli-Phy: Better than PASTA! Total-Column Score 40% Alignment Accuracy (TC score) MAFFT PASTA 30% BAli-Phy 20%

10% 0% # Taxa: Simulator 100 Indelible (DNA) 200 100 200 RNAsim (RNA)

Simulated nucleotide datasets with 100 or 200 sequences (unpublished data from Mike Nutes PhD *Averages overdissertation). 10 replicates But: BAli-Phy is limited to small datasets From www.bali-phy.org/README.html, 5.2.1. Too many taxa? BAli-Phy is quite CPU intensive, and so we recommend using 50 or fewer taxa in order to limit the time required to accumulate enough MCMC samples. (Despite this recommendation, data sets with more than 100 taxa have occasionally been known to converge.) We recommend initially pruning as many taxa as possible from your data set, then adding some back if the MCMC is not too slow.

Re-aligning on a tree A B C D Decompose dataset Estimate ML tree on merged alignment ABCD

A B C D Align subsets: MAFFT A B C D

Merge subalignments Re-aligning on a tree A B C D Decompose dataset Estimate ML tree on merged alignment

ABCD A B C D Align subsets: BAli-Phy?? A B C

D Merge subalignments Results on 1000-sequence datasets (Comparing default PASTA to PASTA+BAli-Phy) Decomposition to 100-sequence subsets, one iteration of PASTA+BAli-Phy Results on 10,000-sequence datasets (Comparing UPP variants where the backbone alignment is computed using either default PASTA or PASTA+BAli-Phy)

Benchmarking Statistical Multiple Sequence Alignment Nute, Saleh, and Warnow 2018 Systematic Biology syy068, 2018, doi:10.1093/sysbio/syy068. Study design Goal: Evaluate Bali-Phy (Redelings and Suchard) on both biological and simulated datasets, in comparison to leading alignment methods on small protein sequence datasets (at most 27 sequences) Metrics: Modeller score (precision), SP-score (recall), Expansion ratio (normalized alignment length), and running time Datasets: 120 simulated datasets (6 model conditions) and biological datasets (4 biological benchmarks) 1192 Specific note: For each dataset, Bali-Phy was run independently on 32 processors for 48 hours, the burn-in was discarded, and the posterior decoding (PD) alignment was then computed.

These Bali-Phy analyses used 230 CPU years on Blue Waters (supercomputer at NCSA). Modeler vs SP-Score on 120 Simulated Datasets BAli-Phy is best! Systematic Biology, Volume 68, Issue 3, May 2019, Pages 396411, https://doi.org/10.1093/sysbio/syy068 The content of this slide may be subject to copyright: please see the slide notes for details. Clustal Method PROBCONS Probalign PRIME

PRANK MUSCLE MAFFT GINSi Indel_Rate: Med Clustal BAliPhy PROBCONS Probalign PRIME

PRANK MUSCLE MAFFT GINSi Indel_Rate: Low Clustal BAliPhy PROBCONS Probalign PRIME PRANK

MUSCLE 1.2 1.0 0.8 Mutation_Rate: Low MAFFT GINSi 1.2 1.0 0.8 Mutation_Rate: High

BAliPhy Expansion Ratio Expansion Ratios on 120 Simulated Datasets Indel_Rate: High BAli-Phy is best! Modeler score vs SP-score on 1192 biological datasets T-Coffee and PROMALS are best! BAli-Phy good for Modeler score, but not so good for SP-Score (e.g., MAFFT better) Systematic Biology, Volume 68, Issue 3, May 2019, Pages 396411, https://doi.org/10.1093/sysbio/syy068

The content of this slide may be subject to copyright: please see the slide notes for details. Modeler Score on 1192 Biological datasets BAli-Phy has the best modeler score SP-score on 1192 Biological datasets BAli-Phy not competitive for SP-score (but best method depends on % ID) Expansion Ratio on 1192 Biological datasets BAli-Phy under-aligns Running Time on 4 biological datasets with 17 sequences each BAli-Phy benefits from a long running time.

Therefore, we used >2 months for each dataset. Observations Bali-Phy is much more accurate than all other methods on simulated datasets Bali-Phy is generally less accurate than the top half of these methods on biological datasets, especially with respect to SP-score (recall) Average percent pairwise ID impacts all the measures of accuracy for all methods, and changes relative performance We do not know why there is a difference in accuracy. Most likely not an issue of failure of the MCMC analyses to converge (48 hours, 32 processors, small numbers of sequences). Possible explanations: 1. Model misspecification (proteins dont evolve under the Bali-Phy model) 2. Structural alignments and evolutionary alignments are different

3. The structural alignments are not correct (perhaps over-aligned) All these explanations are likely true, but the relative contributions are unknown. Final comments MSA is challenging, but algorithmic techniques can improve accuracy and scalability: Dataset size can be addressed using good divide-andconquer approaches. Heterogeneity in sequence length can be addressed using local alignment approaches, such as profile HMMs, with ensembles of profile HMMs providing improved accuracy. Yet the differences between performance on biological and simulated datasets is troubling. The Tree of Life: Multiple Challenges Scientific challenges:

Ultra-large multiple-sequence alignment Gene tree estimation Metagenomic classification Alignment-free phylogeny estimation Supertree estimation Estimating species trees from many gene trees Genome rearrangement phylogeny Reticulate evolution Visualization of large trees and alignments

Data mining techniques to explore multiple optima Theoretical guarantees under Markov models of evolution Techniques: applied probability theory, graph theory, supercomputing, and heuristics Testing: simulations and real data Acknowledgments PASTA and UPP: Nam Nguyen (now postdoc at UIUC) and Siavash Mirarab (now faculty at UCSD), undergrad: Keerthana Kumar (at UT-Austin) PASTA+BAli-Phy: Mike Nute (PhD student at UIUC) Evaluating BAli-Phy: Mike Nute and Ehsan Saleh (PhD students at UIUC) Current NSF grants: ABI-1458652 (multiple sequence alignment) Grainger Foundation (at UIUC), and UIUC TACC, UTCS, Blue Waters, and UIUC campus cluster PASTA, UPP, SEPP, and TIPP are available on github at https://github.com/smirarab/; see also PASTA+BAli-Phy at http://github.com/MGNute/pasta Papers available at http://tandy.cs.illinois.edu/MSAproject.html

Recently Viewed Presentations

  • BSC OISD AWARD IMS ISO 9001 ISO 14001

    BSC OISD AWARD IMS ISO 9001 ISO 14001

    BSC IMS ISO 9001 ISO 14001 OHSAS 18001 OISD AWARD SARVASHRESHTA SAFETY AWARD SWORD OF HONOR FIVE STAR By : Sukant Ku. Singh, Manager(F&S), GAIL- Vijaipur, MP.
  • Empire Building in Africa

    Empire Building in Africa

    Ali began a number of reforms to modernize Egypt. He established schools, small scale industrialization & he created a modern military. ... Cecil Rhodes, an entrepanuer who had made a fortune off of Gold & Diamond mines in South Africa...
  • The Georgia State University copyright case (Cambridge ...

    The Georgia State University copyright case (Cambridge ...

    For prose items, "Either a complete article, story or essay of less than 2,500 words, or (b) an excerpt from any prose work of not more than 1,000 words or 10% of the work, whichever is less, but in any...
  • The Science of Water

    The Science of Water

    Water in our atmosphere helps to keep the planet warm. Our bodies are composed of ~70% dependent on water. Although a person can live without food for more than a month, a person can only live without water for approximately...
  • Diapositive 1 - WordPress.com

    Diapositive 1 - WordPress.com

    1.5. Le toucher. Le toucher créé la familiarité. Etudes : Lorsqu'un client est touché par un vendeur, il développe une attitude plus positive vis-à-vis du magasin (cet effet est plus important chez les clientes).
  • CRAYFISH DISSECTION - local.brookings.k12.sd.us

    CRAYFISH DISSECTION - local.brookings.k12.sd.us

    crayfish dissection integumentary/skeletal exoskeleton carapace (exoskeleton that covers cephalothorax) molting respiratory system respiratory system principle of diffusion going from an area of high concentration to low concentration circulatory system circulatory system function: to carry oxygen to the tissues and carry...
  • PowerPoint Presentation

    PowerPoint Presentation

    Word recall, the dependent measure, was hypothesized to be affected only by orienting task (Craik & Lockhart, 1972). A more specific last sentence might be: Word recall, the dependent measure, was hypothesized to only be a function of orienting task...
  • PowerPoint Presentation

    PowerPoint Presentation

    RCC or FCC- Accuplacer will be administered at RHS. CSU or UC - SAT and ACT exams must be taken by December 2016. ... UNX, UNV, TRU, UIS. 90% attendance to walk at graduation. Walking is a privilege. If you...