The cSRA file format example.csra - A cSRA-file

The cSRA file format example.csra - A cSRA-file contains a serialized file-structure It is a read-only archive file format, similar to a tar-file The tool kar can extract the directories and files All tools in the sra-toolkit can access the data inside directly without prior extracting cSRA compresses the sequence data by replacing aligned base pairs with reference data example.csra SEQUENCE PRIMARY_ALIGNMENT SECONDARY_ALIGNMENT REFERENCE - The cSRA-archive contains the above 4 tables - The SEQUENCE-table is mandatory for the archive - The PRIMARY and SECONDARY ALIGNMENT tables can be missing ( if there is no aligned data in the archive ) - The PRIMARY and SECONDARY ALIGNMENT tables depend on the REFERENCE table - An archive that has ALIGNMENT tables - but no REFERENCE table is broken - The SECONDARY_ALIGNMENT table can be missing ( if the archive does not contain secondary aligned data ) example.csra SEQUENCE PRIMARY_ALIGNMENT SECONDARY_ALIGNMENT REFERENCE The vdb-dump tool can display what tables are inside a cSRA archive

>vdb-dump example.csra -E >>> enumerating the tables of database >example.csra< tbl #1: PRIMARY_ALIGNMENT tbl #2: REFERENCE tbl #3: SEQUENCE example.csra SEQUENCE Reassembles most of the data that came from the original spot PRIMARY_ALIGNMENT Information about where and how a primary alignment occurs SECONDARY_ALIGNMENT Information about where and how secondary alignments occur REFERENCE Contains the reference locally or points to an external reference PRIM. SEC. SEQUENCE A

Spot A has 2 reads: both are primary aligned B Spot B has 2 reads: the 1st read has a primary and a secondary alignment the 2nd read is primary aligned only C Spot C has 2 reads: the 1st read is primary aligned the 2nd read is not aligned A B C This slide shows how the Sequence table points to data in the primary and secondary tables with three different use cases The following slides explain the columns in the cSRA file format. The more important columns are highlighted and all other columns support those SEQUENCE part 1 ALIGNMENT_COUNT

vector of integers, how many alignments per read CS_NATIVE flag, to say that the sequence was produced in color-space BASE_COUNT how many bases are in the whole table FIXED_SPOT_LEN flag, set if all reads have the same length BIO_BASE_COUNT how many bases (excluded adapters) are in the whole table MAX_SPOT_ID id of the last spot in the table CMP_BASE_COUNT how many unaligned bases are in the whole table MIN_SPOT_ID id of the first spot in the table CMP_READ compressed read, only the unaligned reads NAME name of the spot, generated from the row-id COLOR_MATRIX static field to describe the translation between color-space and base-space PLATFORM name of the platform used to sequence the table

CSREAD translated READ-column into color-space PRIMARY_ALIGNMENT_ID pointer back to primary alignment table through row-id in prim. alignment table CS_KEY key for translation between color-space and base-space QUALITY stored quality values, in the direction how it was sequenced usually a static column (same value for all rows in the table) SEQUENCE part 2 READ assembled or stored bases, in the direction t was sequenced READ_FILTER vector of flags, one for each read, compatibility with SRA READ_LEN vector of integers, one for each read, length of each read READ_SEG vector of integer-pairs, one for each read [ zero-based start offset of read, length of read] READ_START vector of integers, one for each read, zero-based start offset of read SPOT_GROUP

describes grouping in the reads, equivalent to read-group in BAM SPOT_ID row-id of each spot ( 1 based ) SPOT_LEN how many bases are in this spot TRIM_LEN length of the section not subject to be trimmed TRIM_START start of the section not subject to be trimmed READ_TYPE vector of flags, one for each read, tells if read is biological or adapter and the direction it was sequenced SIGNAL_LEN compatibility with SRA, lengths of recorded signal SPOT_COUNT how many spots are in the table ( == MAX_SPOT_ID ) usually a static column (same value for all rows in the table) PRIMARY_ALIGNMENT SECONDARY_ALIGNMENT part 1 ALIGN_ID row-id EDIT_DISTANCE number of mismatches BASE_COUNT

how many bases are in the whole table GLOBAL_REF_START global position in the reference table BIO_BASE_COUNT how many bases (excluded adapters) are in the whole table HAS_MISMATCH bitfield of mismatches CIGAR_LONG long form of the cigar-string HAS_REF_OFFSET bitfield of offsets in the reference, used to represent indels CIGAR_SHORT short form of the cigar-string LABEL label of this alignment, for future compatibility to represent multi - ploid alignment COLOR_MATRIX static field to describe the translation between color-space and base-space LABEL_LEN length of the label-part to be used CS_KEY key for translation between color-space and base-space LABEL_START

start offset of the label-part to be used CS_NATIVE flag, to say that the sequence was produced in color-space MAPQ mapping quality usually a static column (same value for all rows in the table) PRIMARY_ALIGNMENT SECONDARY_ALIGNMENT part 2 MATE_ALIGN_ID row-id of the mate of this read ( if any ) MATE_REF_POS mate position on the reference MATE_CIGAR_LONG long form of the cigar-string of the mate MAX_SPOT_ID id of the last spot in the table MATE_CIGAR_SHORT short form of the cigar-string of the mate MIN_SPOT_ID id of the first spot in the table

MATE_EDIT_DISTANCE number of mismatches in the mate MISMATCH base values of the mismatches MATE_REF_ID row-id in the reference-table in the mate MISMATCH_QUAL qualities of the mismatches MATE_REF_LEN mate alignment lines in reference coordinates NAME auto-generated name of the alignment from row-id MATE_REF_NAME reference-name to which the mate is aligned PLATFORM name of the platform used to sequence the table MATE_REF_ORIENTATION orientation of the mate QUALTIY quality of the aligned sequence in the direction of the reference usually a static column (same value for all rows in the table) PRIMARY_ALIGNMENT

SECONDARY_ALIGNMENT part 3 RAW_READ original sequence read in the direction of sequencing REF_LEN length of alignment in reference coordinates RD_FILTER vector of flags, one for each read, compatibility with SRA REF_NAME name of the reference READ sequence read in the direction of the reference REF_OFFSET orientation of original sequence to the reference READ_FILTER vector of flags, one for each read, compatibility with SRA REF_POS position on the reference to the start of alignment READ_LEN vector of integers, one for each read, length of each read REF_READ chunk of the reference on which alignment is projected READ_START

vector of integers, one for each read, zero-based start offset of read REF_SEQ_ID sequence id of the reference READ_TYPE vector of flags, one for each read, tells if read is biological or adapter and the direction it was sequenced REF_ID row-id in the reference table REF_START offset in the row-id of the reference where alignment starts REF_TABLE name of the reference table PRIMARY_ALIGNMENT SECONDARY_ALIGNMENT part 4 SAM_FLAGS flags to be used in SAM-format SPOT_LEN how many bases are in this spot SAM_QUALITY quality converted to ascii presentation from sequence-row-id TEMPLATE_LEN size of the template

SEQ_NAME auto-generated name of the sequence from sequence-row-id TRIM_LEN length of the section not subject to be trimmed SEQ_READ_ID read-id of sequence being aligned TRIM_START start of the section not subject to be trimmed SPOT_COUNT how many spots are in the table ( == MAX_SPOT_ID ) SPOT_GROUP describes grouping in the reads, equivalent to read-group in BAM SEQ_SPOT_ID sequence spot id usually a static column (same value for all rows in the table) REFERENCE part 1 BASE_COUNT how many bases are in the whole table CMP_READ locally stored reference BIO_BASE_COUNT how many bases (excluded adapters) are in the whole table

COLOR_MATRIX static field to describe the translation between color-space and base-space CGRAPH_HIGH maximum depths of coverage in this chunk CGRAPTH_INDELS total number of indels in this chunk CGRAPH_LOW minimum depths of coverage in this chunk CGRAPH_MISMATCHES total number of mismatches between sequence and this chunk CIRCULAR flag if this reference is circular CMP_BASE_COUNT number of bases stored locally usually a static column (same value for all rows in the table) CSREAD translated READ-column into color-space CS_KEY key for translation between color-space and base-space CS_NATIVE flag, to say that the sequence was produced in color-space LABEL description of this chunk LABEL_LEN length of description LABEL_START start offset of description REFERENCE

part 2 MAX_SEQ_LEN maximum size for the chunks in this table MAX_SPOT_ID id of the last chunk in this table MIN_SPOT_ID id of the first chunk in this table NAME name of the sequence, equivalent what BAM used in the reference-sequence-name-field NAME_RANGE technical column, used for index lookup internally PRIMARY_ALIGNMENT_IDS list of row-ids from primary alignment table which start their alignment in this chunk QUALTIY stores the quality of the reference, auto-generated when not available RD_FILTER vector of flags, one for each read, compatibility with SRA READ the sequence of the reference, merges remote and local reference into one column READ_FILTER vector of flags, one for each read, compatibility with SRA READ_LEN vector of integers, one for each read, length of each read READ_START vector of integers, one for each read, zero-based start offset of read READ_TYPE vector of flags, one for each read, tells if read is biological or

adapter and the direction it was sequenced SECONDARY_ALIGNMENT_IDS list of row-ids from secondary alignment table which start their alignment in this chunk SEQ_ID id of remotely stored sequence, used as a key to find the sequence usually a static column (same value for all rows in the table) REFERENCE part 3 SEQ_LEN the length of the chunk from the remotely stored sequence SEQ_START the start of this chunk on the remote sequence SPOT_COUNT number of spots SPOT_GROUP describes grouping in the reads, equivalent to read-group in BAM SPOT_ID row-id of current chunk SPOT_LEN length of this chunk, used for compatibility with SRA TRIM_LEN length of the section not subject to be trimmed TRIM_START start of the section not subject to be trimmed The following slides show how the sequences are reconstructed from the data stored in cSRA. Play

the PowerPoint slides to see the full animation effect case: MISMATCH reference A C G T A C G sequence A C G A A C

G HAS_MISMATCH 0 0 0 1 0 0 0 HAS_REF_OFFSET 0 0 0 0 0 0 0 MISMATCH

A REF_OFFSET case: INSERT reference A C G T A C G sequence A C G A T

A C HAS_MISMATCH 0 0 0 1 0 0 0 HAS_REF_OFFSET 0 0 0 0 1 0 0

MISMATCH A REF_OFFSET -1 case: DELETE reference A C G T A C sequence A C G A

C G HAS_MISMATCH 0 0 0 0 0 0 HAS_REF_OFFSET 0 0 0 1 0 0 MISMATCH REF_OFFSET

+1 G case: COMBINED reference A C G T A C G sequence A A G A T C

G HAS_MISMATCH 0 1 0 1 0 0 0 HAS_REF_OFFSET 0 0 0 0 1 1 0

MISMATCH A A REF_OFFSET -1 +1 case: SOFTCLIP defined by ref_pos reference A C G T A C G sequence T

A G T A T A HAS_MISMATCH 1 1 0 0 0 1 1 HAS_REF_OFFSET 1 0

0 0 0 0 0 MISMATCH T A T A REF_OFFSET -2 The next slides show the conversion between exploded file structure (created by the loader) and the kar format exploded storage more storage space used many directories and files read- and writable static kar storage

less storage space used only one file read only exploded storage static kar storage kar c karfile_to_create d path_of_exploded_storage kar x karfile_to_extract_from d path_to_be_created Difference between SRA and cSRA formats SRA One table Containing one submission Available as exploded storage or as kar-file Self-containing, no need of external files to extract data cSRA Up to 4 tables Containing one BAM-file Available as exploded storage or as kar-file Requires external / remote files to extract all data How to use vdb-dump to inspect a cSRA-archive ( part 1 ) What tables are in the cSRA-achive? $vdb-dump example.csra E >>> enumerating the tables of database >example.csra< tbl #1: PRIMARY_ALIGNMENT

tbl #2: REFERENCE tbl #3: SEQUENCE What columns are available in a table? $vdb-dump example.csra T SEQUENCE o ALIGNMENT_COUNT (U8) BASE_COUNT (U64) BIO_BASE_COUNT (U64) CMP_BASE_COUNT (U64) CMP_READ (INSDC:dna:text) COLOR_MATRIX (U8) CSREAD (INSDC:color:text) CS_KEY (INSDC:dna:text) CS_NATIVE (bool) FIXED_SPOT_LEN (INSDC:coord:len) MAX_SPOT_ID (INSDC:SRA:spotid_t) How to use vdb-dump to inspect a cSRA-archive ( part 2 ) How to restrict the output to certain columns? $vdb-dump example.csra -T SEQUENCE C READ,QUALITY READ: CAGGGCGGGCAGCGGGCCTGCCCCCCACCCCCGCGCCCCATGACCCGC QUALITY: 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, READ : AGGACACAATTACAAGGTGCTGGCCCAACTACTTTCAGTGTACCGTCT QUALITY: 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, How to restrict the row-range of the output? $vdb-dump example.csra T SEQUENCE R 10-20 C READ READ: TGATCCATCAGCATCGGCCTCCCAAAGTGCTGGGATTACAGGTGT... READ: AGCCAGGCGTGGTGGTGCGACCCTGTAATCCCAGCTACTTGGGAG... READ: TAGTGGAGGCCGGCGCAGGAACAGGTTGAACAGTCTACCCTCCCT... READ: ACTCCAGCCTGGGCAACAGAGCAAGATTCTGACTCAAAAAAAAAA... READ: TTCTTTCTAAGACAGGGTCTCACTCTGTCGCCCAGGCTGGAGTGC... READ: TTTCTTTCTCTCTCTCTCTCTTTTTTTTTTTTTTTGAGACAGGGT...

How to use vdb-dump to inspect a cSRA-archive ( part 3 ) How many rows are in a table? $vdb-dump example.csra -T SEQUENCE r id-range: first-row = 1, row-count = 51863105 How to create tab-separated output? $vdb-dump example.csra -T SEQUENCE C CATGTGACTGAACTCTTCACCCCAGTC 30, 30, AAGAGATCCGACATCAAGTGCCCACCT 30, 30, CTCTGTCTCTGCCCCCAGCATCACATT 30, 30, TCCCACAGCTTTAATCACCATCTAAAA 30, 30, TGACTCCCACCTTCACTCTCCCATGTC 30, 30, READ,QUALITY f tab 30, 30, 30, 30 30, 30, 30, 30 30, 30, 30, 30 30, 30, 30, 30 30, 30, 30, 30 How to output phred33-quality ? $vdb-dump example.csra -T SEQUENCE C (INSDC:quality:text:phred_33)QUALITY ???????????????5???????5???????5?????????+????5 ??????????????????????????????????????????????? ??????????????????????????????????????????????? ???????????5?+???55+55????5?+??5?55???5+??5++5+ +?+?+++555+55+?++555?+??++++55++55+5?+?++?5?5++ General BAM Alignment Process

REFERENCE PRIMARY ALIGNMENT SEQUENCE SECONDARY ALIGNMENT unaligned READS READS The reference feeds into the primary alignment table which in turn feed data into the sequence table. The secondary alignment table takes data from the reference and the sequence tables to form the alignment data. The sequence table can also includes unaligned reads Complete Genomics BAM Alignment Process REFERENCE EVIDENCE INTERNALS PRIMARY ALIGNMENT SEQUENCE SECONDARY ALIGNMENT

unaligned READS ALLELES EVIDENCE ALIGNMENT READS

Recently Viewed Presentations

  • Impact of the National Policy (NP) on Quebec

    Impact of the National Policy (NP) on Quebec

    Impact of the National Policy (NP) on Quebec. AGRICULTURE in Quebec 1870s-1890s. Problems. Not enough fertile land for farmers. Farming techniques that were not efficient . Farmersweresometimes not able to produceenough, even for theirfamilies. AGRICULTURE.
  • Thinking Globally, Acting Locally: What Can I Do to Increase ...

    Thinking Globally, Acting Locally: What Can I Do to Increase ...

    The First Year Matters: In Pursuit of Excellence in the Two-Year Beginning College Experience. New Directions in Student DevelopmentPiedmont Technical CollegeGreenwood, South Carolina. March 8, 2012. John N. Gardner.
  • The U.S. Economy in the Inner War Period

    The U.S. Economy in the Inner War Period

    From BOOM to BUST (1920-1941) Impact of WWI on the Americas. Our next major topic of study is "The Move to Global War" which focuses on the rise of dictators and their aggressive actions in the Inner-War period.
  • Return to Vimy 2017: Paris, Vimy, Normandy &

    Return to Vimy 2017: Paris, Vimy, Normandy &

    Vimy Ridge 100th Anniversary Celebration. Vimy 100th Anniversary Event. Day 5 Vimy —Normandy . Transfer via Dieppe to Normandy. Hill 62 (Sanctuary Wood) Canadian Cemetery visit. Visit one of the Canadian landing beaches in Dieppe. Day 6 D-Day Beaches ....
  • Compiling Bayesian Networks with Local Structure

    Compiling Bayesian Networks with Local Structure

    Slide 14 Friends and Smokers (Richardson & Domingos, 2004) Students (Pasula & Russell, 2001) Slide 17 Pedigrees + Phenotype + Genotype Slide 19 Slide 20 Slide 21 Tutorial Outline Theoretical Foundations Multi-Linear Functions Arithmetic Circuits Factoring Multi-linear Functions (MLFs) Slide...
  • PowerPoint Presentation

    PowerPoint Presentation

    www.cse.yorku.ca\~jeff\courses\4111. It deeply saddens me when a third of the class does not learn the material sufficiently to pass. ... We have 27 TA hours.These will primarily be for marking.But he may have some extra time for talking to you....
  • PAM Assist Your Employee Assistance Programme Presentation to

    PAM Assist Your Employee Assistance Programme Presentation to

    Webchat. Management Information identifying usage rates & trends. Range of supporting promotional materials incl both hard copy and . online. Access to wide range of additional ad hoc resources. Understand Employee Needs. Deliver Help & Assistance. EAP Overview
  • Infection Prevention Toolkit for Nurses - NJSNA

    Infection Prevention Toolkit for Nurses - NJSNA

    It was named "difficult clostridium" because of difficulty related to its isolation and growth on conventional media. C. difficile can exist in spore and vegetative forms. Outside the colon, it survives in spore form; spores are resistant to heat, acid,...