The cSRA file format example.csra - A cSRA-file

The cSRA file format example.csra - A cSRA-file

The cSRA file format example.csra - A cSRA-file contains a serialized file-structure It is a read-only archive file format, similar to a tar-file The tool kar can extract the directories and files All tools in the sra-toolkit can access the data inside directly without prior extracting cSRA compresses the sequence data by replacing aligned base pairs with reference data example.csra SEQUENCE PRIMARY_ALIGNMENT SECONDARY_ALIGNMENT REFERENCE - The cSRA-archive contains the above 4 tables - The SEQUENCE-table is mandatory for the archive - The PRIMARY and SECONDARY ALIGNMENT tables can be missing ( if there is no aligned data in the archive ) - The PRIMARY and SECONDARY ALIGNMENT tables depend on the REFERENCE table - An archive that has ALIGNMENT tables - but no REFERENCE table is broken - The SECONDARY_ALIGNMENT table can be missing ( if the archive does not contain secondary aligned data ) example.csra SEQUENCE PRIMARY_ALIGNMENT SECONDARY_ALIGNMENT REFERENCE The vdb-dump tool can display what tables are inside a cSRA archive

>vdb-dump example.csra -E >>> enumerating the tables of database >example.csra< tbl #1: PRIMARY_ALIGNMENT tbl #2: REFERENCE tbl #3: SEQUENCE example.csra SEQUENCE Reassembles most of the data that came from the original spot PRIMARY_ALIGNMENT Information about where and how a primary alignment occurs SECONDARY_ALIGNMENT Information about where and how secondary alignments occur REFERENCE Contains the reference locally or points to an external reference PRIM. SEC. SEQUENCE A

Spot A has 2 reads: both are primary aligned B Spot B has 2 reads: the 1st read has a primary and a secondary alignment the 2nd read is primary aligned only C Spot C has 2 reads: the 1st read is primary aligned the 2nd read is not aligned A B C This slide shows how the Sequence table points to data in the primary and secondary tables with three different use cases The following slides explain the columns in the cSRA file format. The more important columns are highlighted and all other columns support those SEQUENCE part 1 ALIGNMENT_COUNT

vector of integers, how many alignments per read CS_NATIVE flag, to say that the sequence was produced in color-space BASE_COUNT how many bases are in the whole table FIXED_SPOT_LEN flag, set if all reads have the same length BIO_BASE_COUNT how many bases (excluded adapters) are in the whole table MAX_SPOT_ID id of the last spot in the table CMP_BASE_COUNT how many unaligned bases are in the whole table MIN_SPOT_ID id of the first spot in the table CMP_READ compressed read, only the unaligned reads NAME name of the spot, generated from the row-id COLOR_MATRIX static field to describe the translation between color-space and base-space PLATFORM name of the platform used to sequence the table

CSREAD translated READ-column into color-space PRIMARY_ALIGNMENT_ID pointer back to primary alignment table through row-id in prim. alignment table CS_KEY key for translation between color-space and base-space QUALITY stored quality values, in the direction how it was sequenced usually a static column (same value for all rows in the table) SEQUENCE part 2 READ assembled or stored bases, in the direction t was sequenced READ_FILTER vector of flags, one for each read, compatibility with SRA READ_LEN vector of integers, one for each read, length of each read READ_SEG vector of integer-pairs, one for each read [ zero-based start offset of read, length of read] READ_START vector of integers, one for each read, zero-based start offset of read SPOT_GROUP

describes grouping in the reads, equivalent to read-group in BAM SPOT_ID row-id of each spot ( 1 based ) SPOT_LEN how many bases are in this spot TRIM_LEN length of the section not subject to be trimmed TRIM_START start of the section not subject to be trimmed READ_TYPE vector of flags, one for each read, tells if read is biological or adapter and the direction it was sequenced SIGNAL_LEN compatibility with SRA, lengths of recorded signal SPOT_COUNT how many spots are in the table ( == MAX_SPOT_ID ) usually a static column (same value for all rows in the table) PRIMARY_ALIGNMENT SECONDARY_ALIGNMENT part 1 ALIGN_ID row-id EDIT_DISTANCE number of mismatches BASE_COUNT

how many bases are in the whole table GLOBAL_REF_START global position in the reference table BIO_BASE_COUNT how many bases (excluded adapters) are in the whole table HAS_MISMATCH bitfield of mismatches CIGAR_LONG long form of the cigar-string HAS_REF_OFFSET bitfield of offsets in the reference, used to represent indels CIGAR_SHORT short form of the cigar-string LABEL label of this alignment, for future compatibility to represent multi - ploid alignment COLOR_MATRIX static field to describe the translation between color-space and base-space LABEL_LEN length of the label-part to be used CS_KEY key for translation between color-space and base-space LABEL_START

start offset of the label-part to be used CS_NATIVE flag, to say that the sequence was produced in color-space MAPQ mapping quality usually a static column (same value for all rows in the table) PRIMARY_ALIGNMENT SECONDARY_ALIGNMENT part 2 MATE_ALIGN_ID row-id of the mate of this read ( if any ) MATE_REF_POS mate position on the reference MATE_CIGAR_LONG long form of the cigar-string of the mate MAX_SPOT_ID id of the last spot in the table MATE_CIGAR_SHORT short form of the cigar-string of the mate MIN_SPOT_ID id of the first spot in the table

MATE_EDIT_DISTANCE number of mismatches in the mate MISMATCH base values of the mismatches MATE_REF_ID row-id in the reference-table in the mate MISMATCH_QUAL qualities of the mismatches MATE_REF_LEN mate alignment lines in reference coordinates NAME auto-generated name of the alignment from row-id MATE_REF_NAME reference-name to which the mate is aligned PLATFORM name of the platform used to sequence the table MATE_REF_ORIENTATION orientation of the mate QUALTIY quality of the aligned sequence in the direction of the reference usually a static column (same value for all rows in the table) PRIMARY_ALIGNMENT

SECONDARY_ALIGNMENT part 3 RAW_READ original sequence read in the direction of sequencing REF_LEN length of alignment in reference coordinates RD_FILTER vector of flags, one for each read, compatibility with SRA REF_NAME name of the reference READ sequence read in the direction of the reference REF_OFFSET orientation of original sequence to the reference READ_FILTER vector of flags, one for each read, compatibility with SRA REF_POS position on the reference to the start of alignment READ_LEN vector of integers, one for each read, length of each read REF_READ chunk of the reference on which alignment is projected READ_START

vector of integers, one for each read, zero-based start offset of read REF_SEQ_ID sequence id of the reference READ_TYPE vector of flags, one for each read, tells if read is biological or adapter and the direction it was sequenced REF_ID row-id in the reference table REF_START offset in the row-id of the reference where alignment starts REF_TABLE name of the reference table PRIMARY_ALIGNMENT SECONDARY_ALIGNMENT part 4 SAM_FLAGS flags to be used in SAM-format SPOT_LEN how many bases are in this spot SAM_QUALITY quality converted to ascii presentation from sequence-row-id TEMPLATE_LEN size of the template

SEQ_NAME auto-generated name of the sequence from sequence-row-id TRIM_LEN length of the section not subject to be trimmed SEQ_READ_ID read-id of sequence being aligned TRIM_START start of the section not subject to be trimmed SPOT_COUNT how many spots are in the table ( == MAX_SPOT_ID ) SPOT_GROUP describes grouping in the reads, equivalent to read-group in BAM SEQ_SPOT_ID sequence spot id usually a static column (same value for all rows in the table) REFERENCE part 1 BASE_COUNT how many bases are in the whole table CMP_READ locally stored reference BIO_BASE_COUNT how many bases (excluded adapters) are in the whole table

COLOR_MATRIX static field to describe the translation between color-space and base-space CGRAPH_HIGH maximum depths of coverage in this chunk CGRAPTH_INDELS total number of indels in this chunk CGRAPH_LOW minimum depths of coverage in this chunk CGRAPH_MISMATCHES total number of mismatches between sequence and this chunk CIRCULAR flag if this reference is circular CMP_BASE_COUNT number of bases stored locally usually a static column (same value for all rows in the table) CSREAD translated READ-column into color-space CS_KEY key for translation between color-space and base-space CS_NATIVE flag, to say that the sequence was produced in color-space LABEL description of this chunk LABEL_LEN length of description LABEL_START start offset of description REFERENCE

part 2 MAX_SEQ_LEN maximum size for the chunks in this table MAX_SPOT_ID id of the last chunk in this table MIN_SPOT_ID id of the first chunk in this table NAME name of the sequence, equivalent what BAM used in the reference-sequence-name-field NAME_RANGE technical column, used for index lookup internally PRIMARY_ALIGNMENT_IDS list of row-ids from primary alignment table which start their alignment in this chunk QUALTIY stores the quality of the reference, auto-generated when not available RD_FILTER vector of flags, one for each read, compatibility with SRA READ the sequence of the reference, merges remote and local reference into one column READ_FILTER vector of flags, one for each read, compatibility with SRA READ_LEN vector of integers, one for each read, length of each read READ_START vector of integers, one for each read, zero-based start offset of read READ_TYPE vector of flags, one for each read, tells if read is biological or

adapter and the direction it was sequenced SECONDARY_ALIGNMENT_IDS list of row-ids from secondary alignment table which start their alignment in this chunk SEQ_ID id of remotely stored sequence, used as a key to find the sequence usually a static column (same value for all rows in the table) REFERENCE part 3 SEQ_LEN the length of the chunk from the remotely stored sequence SEQ_START the start of this chunk on the remote sequence SPOT_COUNT number of spots SPOT_GROUP describes grouping in the reads, equivalent to read-group in BAM SPOT_ID row-id of current chunk SPOT_LEN length of this chunk, used for compatibility with SRA TRIM_LEN length of the section not subject to be trimmed TRIM_START start of the section not subject to be trimmed The following slides show how the sequences are reconstructed from the data stored in cSRA. Play

the PowerPoint slides to see the full animation effect case: MISMATCH reference A C G T A C G sequence A C G A A C

G HAS_MISMATCH 0 0 0 1 0 0 0 HAS_REF_OFFSET 0 0 0 0 0 0 0 MISMATCH

A REF_OFFSET case: INSERT reference A C G T A C G sequence A C G A T

A C HAS_MISMATCH 0 0 0 1 0 0 0 HAS_REF_OFFSET 0 0 0 0 1 0 0

MISMATCH A REF_OFFSET -1 case: DELETE reference A C G T A C sequence A C G A

C G HAS_MISMATCH 0 0 0 0 0 0 HAS_REF_OFFSET 0 0 0 1 0 0 MISMATCH REF_OFFSET

+1 G case: COMBINED reference A C G T A C G sequence A A G A T C

G HAS_MISMATCH 0 1 0 1 0 0 0 HAS_REF_OFFSET 0 0 0 0 1 1 0

MISMATCH A A REF_OFFSET -1 +1 case: SOFTCLIP defined by ref_pos reference A C G T A C G sequence T

A G T A T A HAS_MISMATCH 1 1 0 0 0 1 1 HAS_REF_OFFSET 1 0

0 0 0 0 0 MISMATCH T A T A REF_OFFSET -2 The next slides show the conversion between exploded file structure (created by the loader) and the kar format exploded storage more storage space used many directories and files read- and writable static kar storage

less storage space used only one file read only exploded storage static kar storage kar c karfile_to_create d path_of_exploded_storage kar x karfile_to_extract_from d path_to_be_created Difference between SRA and cSRA formats SRA One table Containing one submission Available as exploded storage or as kar-file Self-containing, no need of external files to extract data cSRA Up to 4 tables Containing one BAM-file Available as exploded storage or as kar-file Requires external / remote files to extract all data How to use vdb-dump to inspect a cSRA-archive ( part 1 ) What tables are in the cSRA-achive? $vdb-dump example.csra E >>> enumerating the tables of database >example.csra< tbl #1: PRIMARY_ALIGNMENT

tbl #2: REFERENCE tbl #3: SEQUENCE What columns are available in a table? $vdb-dump example.csra T SEQUENCE o ALIGNMENT_COUNT (U8) BASE_COUNT (U64) BIO_BASE_COUNT (U64) CMP_BASE_COUNT (U64) CMP_READ (INSDC:dna:text) COLOR_MATRIX (U8) CSREAD (INSDC:color:text) CS_KEY (INSDC:dna:text) CS_NATIVE (bool) FIXED_SPOT_LEN (INSDC:coord:len) MAX_SPOT_ID (INSDC:SRA:spotid_t) How to use vdb-dump to inspect a cSRA-archive ( part 2 ) How to restrict the output to certain columns? $vdb-dump example.csra -T SEQUENCE C READ,QUALITY READ: CAGGGCGGGCAGCGGGCCTGCCCCCCACCCCCGCGCCCCATGACCCGC QUALITY: 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, READ : AGGACACAATTACAAGGTGCTGGCCCAACTACTTTCAGTGTACCGTCT QUALITY: 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, How to restrict the row-range of the output? $vdb-dump example.csra T SEQUENCE R 10-20 C READ READ: TGATCCATCAGCATCGGCCTCCCAAAGTGCTGGGATTACAGGTGT... READ: AGCCAGGCGTGGTGGTGCGACCCTGTAATCCCAGCTACTTGGGAG... READ: TAGTGGAGGCCGGCGCAGGAACAGGTTGAACAGTCTACCCTCCCT... READ: ACTCCAGCCTGGGCAACAGAGCAAGATTCTGACTCAAAAAAAAAA... READ: TTCTTTCTAAGACAGGGTCTCACTCTGTCGCCCAGGCTGGAGTGC... READ: TTTCTTTCTCTCTCTCTCTCTTTTTTTTTTTTTTTGAGACAGGGT...

How to use vdb-dump to inspect a cSRA-archive ( part 3 ) How many rows are in a table? $vdb-dump example.csra -T SEQUENCE r id-range: first-row = 1, row-count = 51863105 How to create tab-separated output? $vdb-dump example.csra -T SEQUENCE C CATGTGACTGAACTCTTCACCCCAGTC 30, 30, AAGAGATCCGACATCAAGTGCCCACCT 30, 30, CTCTGTCTCTGCCCCCAGCATCACATT 30, 30, TCCCACAGCTTTAATCACCATCTAAAA 30, 30, TGACTCCCACCTTCACTCTCCCATGTC 30, 30, READ,QUALITY f tab 30, 30, 30, 30 30, 30, 30, 30 30, 30, 30, 30 30, 30, 30, 30 30, 30, 30, 30 How to output phred33-quality ? $vdb-dump example.csra -T SEQUENCE C (INSDC:quality:text:phred_33)QUALITY ???????????????5???????5???????5?????????+????5 ??????????????????????????????????????????????? ??????????????????????????????????????????????? ???????????5?+???55+55????5?+??5?55???5+??5++5+ +?+?+++555+55+?++555?+??++++55++55+5?+?++?5?5++ General BAM Alignment Process

REFERENCE PRIMARY ALIGNMENT SEQUENCE SECONDARY ALIGNMENT unaligned READS READS The reference feeds into the primary alignment table which in turn feed data into the sequence table. The secondary alignment table takes data from the reference and the sequence tables to form the alignment data. The sequence table can also includes unaligned reads Complete Genomics BAM Alignment Process REFERENCE EVIDENCE INTERNALS PRIMARY ALIGNMENT SEQUENCE SECONDARY ALIGNMENT

unaligned READS ALLELES EVIDENCE ALIGNMENT READS

Recently Viewed Presentations

  • Motion Planning for Robotic Manipulation of Deformable Linear

    Motion Planning for Robotic Manipulation of Deformable Linear

    We also did a robustness test. We generated a manipulation plan for a given rope, tried to tie knots with other ropes different from the original rope in terms of thickness and material. With the same manipulation plan were able...
  • Evolutionary Patterns - Houston Community College

    Evolutionary Patterns - Houston Community College

    It had fins, scales, and gills like fish, but it also had wrist bones, fingers, an amphibian-like skull, and a true neck, like tetrapods. * Rare mass extinctions have altered the course of evolution. Extinctions eliminate many previously important groups...
  • November Identity Narrative : Giving Thanks for Simple Gifts

    November Identity Narrative : Giving Thanks for Simple Gifts

    November Identity Narrative : Giving Thanks for Simple Gifts . Directions: Create a page about gifts that you have been given that aren't wrapped in packages. For example, gifts of time, patience, second chances, listening, laughter, understanding, etc. Include notes...
  • Population growth theories and projections

    Population growth theories and projections

    Cornucopians. Once again believe that humans and technology can increase the worlds growth and we can overcome any shortages of resources. Evidence is from the industrial revolutions and agricultural revolutions
  • The Effect of Land Use and Stormwater Control Measures in the ...

    The Effect of Land Use and Stormwater Control Measures in the ...

    The Effect of Land Use and Stormwater Control Measures in the Jordan Lake Watershed. Celia Jackson, Drew Hoag, Maddie Omeltchenko, Aditya Shetty, Naomi Lahiri
  • Presenter: - GitHub Pages

    Presenter: - GitHub Pages

    The SHIELD system components (II) Data Analysis and Remediation Engine (DARE) DARE is an information-driven IDPS platform . capable of predicting specific vulnerabilities and attacks by relying on Big Data, Threat Monitoring and Machine Learning.
  • The Search for the Higgs Boson - Utah State University

    The Search for the Higgs Boson - Utah State University

    27-kilometer ring of superconducting magnets and structures to boost the particle energy. Thousands of magnets direct the beams around the accelerator. 1232 dipole magnets, 15m long, bend the beams. 392 quadrupole magnets, 5-7m long, focus the beams. Other magnets squeeze...
  • CAS - IB Diploma

    CAS - IB Diploma

    CAS Stages. 1. Investigation. 2. Preparation. 3. Action. 4. Reflection. 5. Demonstration. A singular CAS experience may begin with stage 1, 2 or 3. A CAS project should involve stages 1-3. All students will accomplish stages 4 and 5 through...