Microbiome Analysis from sample to data MGL Users Group June 18, 2014 Assaying Microbial Content One of the most common approaches is to sequence 16S ribosomal RNA amplicons. Another option is shotgun sequencing of the community, assembling the sequences, and assigning the identified genes to metabolic pathways. If a finer level of detail is required, most often 16S is sequenced followed by generalized sequencing for a finer species resolution (or validation). 16S rRNA Sequencing Databases of all known 16S sequences have been

compiled (Silva, GreenGenes, others). Either targeted amplicons of variable regions or whole 16S sequencing. Isolate gDNA, PCR amplify using universal 16S primers. imers Primer pair 1 Primer pair 2 Primer pair 3 16S rRNA Sequencing Shear amplicon using Covaris focused acoustics The Microbiome Library Libraries have adaptor sequences at both ends used for PCR and sequencing priming. P1 is the universal Forward primer sequence.

P2 has an embedded barcode sequence. Between the two adapter ends is the DNA which will be sequenced from the P1 forward, and Barcode regions (green arrows). Note: Adapter sequences DIFFER from Illumina if other preparations are to be adapted to this platform. Bead Preparation from Libraries The pool of libraries is subjected to emulsion PCR to populate beads. Oil micro-reactors are titrated such that each bead is populated by a single template. Unpopulated beads are removed in subsequent cleanup. 16S rRNA Sequencing Nick translate Amplify Quantitate

Bead Preparation from Libraries Slide Deposition of enriched beads Beads are flowed into, and then adhered to, the FlowChip lanes. Optimum density is 160 million beads per lane. 16S rRNA Sequencing ABI SOLiD 5500xl 16S rRNA Sequencing The resulting library is sequenced. We do 75 bp on one end (Exact Call Chemistry; most commonly done on a long-read platform [454, MiSeq, etc.]).

We generate millions of reads (most commonly generate thousands). Reads are aligned to the database of 16S sequences to the possible level of resolution. We keep only uniquely aligned reads. Data Analysis - OTUs Sequences are often reported in OTUs (Operational Taxonomic Units) Due to high levels of identity in related 16S sequences, typically some identity threshold is applied and similar sequences are collapsed into OTU sequences (commonly at 97% identity) As a result, the level of taxonomic resolution for individual OTU sequences can vary, even at the same identity threshold. OTU examples

367523 187144 836974 310669 823916 878161 3064251 1138555 3918 339472 4457583 k__Bacteria; p__Bacteroidetes; c__Flavobacteriia; o__Flavobacteriales; f__Flavobacteriaceae; g__Flavobacterium; s__ k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__; g__; s__ k__Bacteria; p__Cyanobacteria; c__Chloroplast; o__Cercozoa; f__; g__; s__ k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__; g__; s__ k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pseudomonadales; f__Moraxellaceae; g__Enhydrobacter; s__ k__Bacteria; p__Acidobacteria; c__Acidobacteriia; o__Acidobacteriales; f__Acidobacteriaceae; g__Terriglobus; s__

k__Bacteria; p__Verrucomicrobia; c__Opitutae; o__Puniceicoccales; f__Puniceicoccaceae; g__Puniceicoccus; s__ k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Caldicoprobacteraceae; g__Caldicoprobacter; s__ k__Bacteria; p__Spirochaetes; c__Spirochaetes; o__Spirochaetales; f__Spirochaetaceae; g__Treponema; s__ k__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rhodospirillales; f__Rhodospirillaceae; g__; s__ k__Bacteria; p__; c__; o__; f__; g__; s__ (k=kingdom; p=phylum; c=class; o=order; f=family; g=genus; s=species) Typically lack species-level resolution (as seen in the example subset), but some get down to Order, Family, or even Genus. A few really not identifiable. Example using public dataset, treated to simulate short 75bp random reads 3 Simulations on

same dataset TestLib2 TestLib3 TestLib1 Gut microbiota Primarily comprised of Firmicutes and Bacteroidetes. Balance between these populations has been linked to obesity in mice. Initial trial run Sample E1

E2 E3 E4 Input Reads 17,084,015 14,509,122 14,296,175 15,206,899 Trimmed Surviving 11,284,983 10,057,220 9,596,474 10,485,923

% 66 69 67 69 Dropped 5,799,032 4,451,902 4,699,701 4,720,976 % 34 31 33 31

Aligned 11,199,365 9,975,097 9,526,501 10,395,189 % 99 99 99 99 Uniquely aligned 2,078,178 2,275,014 1,890,678 1,265,548

% 18 23 20 12 Initial trial run Bacteria Bacteria Bacteria Bacteria Bacteria Bacteria Bacteria Bacteria Bacteria

Bacteria Bacteria Phylum Bacteroidetes Firmicutes Firmicutes Firmicutes Proteobacteria Proteobacteria Firmicutes Deferribacteres Bacteroidetes Cyanobacteria Cyanobacteria Class

Order Family Genus Bacteroidia Bacteroidales Clostridia Clostridiales Clostridia Clostridiales Clostridia Clostridiales Deltaproteobacteria Desulfovibrionales Desulfovibrionaceae Epsilonproteobacteria Campylobacterales Helicobacteraceae Helicobacter Clostridia

Clostridiales Deferribacteres Deferribacterales Deferribacteraceae Mucispirillum Bacteroidia Bacteroidales 4C0d-2 YS2 4C0d-2 YS2 Species schaedleri ... #OTU Kingdom 3013444

1104817 275139 1571092 193418 1141335 1844565 4374042 306885 4405128 381666 Hundreds of lines E1 E2 E3

952,690 1,157,324 153,818 298 123 272,922 237 115 167,229 2,103 1,974 183,397 87,501 79,909 18,387 80,931 36,646 14,440 10,006

2,718 23,696 10,548 63,496 14,904 36,900 40,268 4,992 14,367 37,964 19,056 50 17 73,590 E4 TOTAL

57,347 2,321,179 53,820 327,163 155,815 323,396 89,837 277,311 25,968 211,765 11,072 143,089 66,709 103,129 8,756 97,704 1,776 83,936 10,872 82,259 5,969 79,626 Initial run results

4 mice run: 2 WT; 2 KO Phylum resolution N=7.5 Million E1 E2 E3 E4 Initial run results 4 mice run: 2 WT; 2 KO Class resolution N=7.5 Million

Initial run results 4 mice run: 2 WT; 2 KO Order resolution N=7.4 Million Initial run results 4 mice run: 2 WT; 2 KO Family resolution N=2.2 Million Initial run results 4 mice run: 2 WT; 2 KO Genus resolution N=1.1 Million

(Species N=340K; More complex) For more information: Website: mgl.nichd.nih.gov List serv: MGL-USERS-L Email: [email protected] Phone: 301-402-4563 Walk-in: Bldg 10/Rm 9D41 Typical Approach Use long reads from a whole, intact amplicon (A few thousand reads typically used) Perform trimming, remove chimeric sequences, join overlaps in paired ends, etc. Compare sequences to database through BLAST or comparison to a prepared multi-sequence alignment. Compare / clean data

Assign resolve taxonomy, describe distribution Compare populations across conditions, etc. (Statistical digging) Alternative approach using short reads Amplify 16S or amplicon as normal. Randomly shear to construct a typical short read library comprising random starts/ends. Generate millions of reads. Assign reads that only map unambiguously to OTUs using short read aligners. Analyze normally from OTU populations. A more wasteful approach, but in practice performs just as well. Utilizes higher throughput instruments vs lower capacity long-read platforms.

