DNA Classifications with Self-Organizing Maps (SOMs) Thanakorn Naenna Mark J. Embrechts Robert A. Bress May 2003 IEEE International Workshop on Soft Computing in Industrial Application 1 Presentation Outline Introduction to DNA Splice Junctions
Data Collection Introduction to SOMs SOM for DNA Splice Junction Classification Results Conclusions 2 3 Human genome in a nutshell Human : 23 chromosomes Chromosomes thousands of genes
Gene info : exons , comments : introns Splice junction are like /* comment flags */ in C-code Exons and introns codons Codon bases 4 DNA Splice Junctions DNA billions of nucleotides ( A, C, G, T) Genes sequences of amino acids (exons) that are often interrupted by non-coding nucleotides (introns) <.1% of human DNA is made up of exons 99% of splice junctions have the same motif, for
Exon to intron it is GT Intron to exon it is AG Intron Splice Junction Exon Splice Junction Intron .GTGAAGGTTAA AGATGTAGAT GT ATTG 5 Data Collection: HTML Browser + Perl scripts BioBrowser Download HTML
ExtractLinks() Download HTML - data ExtractData() TranslateData() 6 7 DNA Splice Junction (Cont.) A complete gene is made up of different exons Splice junction identification aids in the discovery of new genes
The dataset used for this study is made up of 1,424 sequences Data were created ab initio from GENBANK Each sequence is 32 nucleotides long with regions comprising -15 to +15 nucleotides from the splice-junction Left Regions Intron Exon Unknown Intron Splice Junction AG GT AG or GT Right Regions Exon Inron Unknown
Splice Junction Classes A B C Exon TGTAAGG AG ACGAGTT 8 Self-Organizing Maps (SOM) Network
Unsupervised learning neural network Projects high-dimensional input data onto twodimensional output map Preserves the topology of the input data Visualizes structures and clusters of the data i wi1 Component 1 Component 2 Component 3 Component 4 wi2 wi3
Use of SOM for DNA Splice Junction Classification Model Neuronidentification identificationmethods methods Neuron Highestfrequency frequencyclass class - -Highest DNA test set Closestneuron neuron - -Closest U-Matrix Map DNA training set
SOM Classification Map B SOM C A Classification Classification ClassA: A:intron intron totoexon exon Class ClassB: B:exon exontotointron intron Class
ClassC: C:no notransition transition Class 10 The U-matrix of the DNA Training Set 11 SOM Results for DNA Splice Junction Data The U-matrix of the DNA training set B C A Confusion matrix of 424-DNA test set DNA sequences Class A
Class B Class C Total Class A 102 (93%) 0 (0%) 4 (2%) 106 Classified to Class B 2 (2%) 90 (91%) 6 (3%) 98 Class C 6 (5%) 9 (9%) 205 (95%)
220 Total 110 99 215 424 12 Conclusions SOM is effective in DNA splice junction classification SOM is powerful visualization for high dimensional data 13 Demo with Analyze Code 800 training data, 324 test data (160 features)
96% correct overall classification on test data Confusion Matrix IE 98 5 2 IE FALSE EI FALSE 0 111 3 EI 0 3 102
9 18 6 20000 50000 0.9 0.05 1 // // // // // // // // K L max_neighborhood
num_its num_fine_its alpha_max alpha_min LVQ_flag 14 THE END GATCAATGAGGTGGACACCAGAGGCGGGGACTTGTAAATAACACTGGGCTGTAGGAGT GA TGGGGTTCACCTCTAATTCTAAGATGGCTAGATAATGCATCTTTCAGGGTTGTGCTTCT A TCTAGAAGGTAGAGCTGTGGTCGTTCAATAAAAGTCCTCAAGAGGTTGGTTAATACGC AT GTTTAATAGTACAGTATGGTGACTATAGTCAACAATAATTTATTGTACATTTTTAAATAG CTAGAAGAAAAGCATTGGGAAGTTTCCAACATGAAGAAAAGATAAATGGTCAAGGGAA TG GATATCCTAATTACCCTGATTTGATCATTATGCATTATATACATGAATCAAAATATCACA CATACCTTCAAACTATGTACAAATATTATATACCAATAAAAAATCATCATCATCATCTCC ATCATCACCACCCTCCTCCTCATCACCACCAGCATCACCACCATCATCACCACCACCATC
ATCACCACCACCACTGCCATCATCATCACCACCACTGTGCCATCATCATCACCACCACTG TCATTATCACCACCACCATCATCACCAACACCACTGCCATCGTCATCACCACCACTGTCA TTATCACCACCACCATCACCAACATCACCACCACCATTATCACCACCATCAACACCACCA CCCCCATCATCATCATCACTACTACCATCATTACCAGCACCACCACCACTATCACCACCA CCACCACAATCACCATCACCACTATCATCAACATCATCACTACCACCATCACCAACACCA CCATCATTATCACCACCACCACCATCACCAACATCACCACCATCATCATCACCACCATCA CCAAGACCATCATCATCACCATCACCACCAACATCACCACCATCACCAACACCACCATC A CCACCACCACCACCATCATCACCACCACCACCATCATCATCACCACCACCGCCATCATCA TCGCCACCACCATGACCACCACCATCACAACCATCACCACCATCACAACCACCATCATC A CTATCGCTATCACCACCATCACCATTACCACCACCATTACTACAACCATGACCATCACCA CCATCACCACCACCATCACAACGATCACCATCACAGCCACCATCATCACCACCACCACC A CCACCATCACCATCAAACCATCGGCATTATTATTTTTTTAGAATTTTGTTGGGATTCAGT ATCTGCCAAGATACCCATTCTTAAAACATGAAAAAGCAGCTGACCCTCCTGTGGCCCCC T EXTRACTING KNOWLEDGE GATCAATGAGGTGGACACCAGAGGCGGGGACTTGTAAATAACACTGGGCTGTAGGAGTGA TGGGGTTCACCTCTAATTCTAAGATGGCTAGATAATGCATCTTTCAGGGTTGTGCTTCTA
CCACCACCACCACCATCATCACCACCACCACCATCATCATCACCACCACCGCCATCATCA TCGCCACCACCATGACCACCACCATCACAACCATCACCACCATCACAACCACCATCATCA CTATCGCTATCACCACCATCACCATTACCACCACCATTACTACAACCATGACCATCACCA CCATCACCACCACCATCACAACGATCACCATCACAGCCACCATCATCACCACCACCACCA CCACCATCACCATCAAACCATCGGCATTATTATTTTTTTAGAATTTTGTTGGGATTCAGT GGTCAGGTTAGACTA ATCTGCCAAGATACCCATTCTTAAAACATGAAAAAGCAGCTGACCCTCCTGTGGCCCCCT TTTTGGGCAGTCATTGCAGGACCTCATCCCCAAGCAGCAGCTCTGGTGGCATACAGGCAA CCCACCACCAAGGTAGAGGGTAATTGAGCAGAAAAGCCACTTCCTCCAGCAGTTCCCTGT 16 NUCLEOTIDES DNA is double-stranded A & C are Complements G & T are Complements
A T C G 17 AMINO ACIDS Sequences of three nucleotides CODONS code for amino acids There are 20 different amino acids Each amino acid can be translated between 1 and 6 different ways Amino acids make up the part of DNA known as exons 18
PROTEINS Proteins are made up of sequences of amino acids Generally responsible for some biological function May have complicated folding patterns that are difficult to predict 19 GENES 30,000 100,000 genes exist in the human genome Most genes have not yet been discovered Genes are made up of sequences of amino acids Genes are interrupted by non-coding regions of DNA Introns
20 CHROMOSOMES 21 READING FRAMES Reading frames may be difficult to determine ACG TAGAT Reading frames may be shifted by splice junctions 22 GENE STRUCTURE Start Codon (ATG) Exon sequence (amino acid string)
Intron sequence (junk DNA) Stop Codon (3 possible) 23 SPLICE JUNCTIONS Segments of DNA that join coding and non-coding regions 24
How will DeCA make POGs available to industry? Downloadable site, Store level. Industry experience building planograms is crucial in driving results. What role do you see industry playing moving forward? Once the POGs are set at store level, how are...
Building support network. A more responsive climate is needed for integrating work and family responsibilities for women to participate in an equal basis with their male colleagues in higher education.
Partnered with AASB last spring to submit a joint list. Non-financial ways to help education. List provided. Will work from that list and the information you just provided recently. Priority focus - JPS/ASA Resolutions. 10/1/15. Dr. Elizabeth Parady
* This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Basic Energy Sciences. This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy.
Machu Picchu, Peru The Andes Mountains run through entire west coast of South America, from Colombia to Chile. Length: 4,350 miles The Amazon River runs into the north of South America, through Brazil, Colombia, Ecuador, and Peru. Length: 4,200 miles...
Happy Marvelous Monday! September 14, 2015 DIRECTIONS: Write the following sentence and tell me what you notice, i.e. adjectives, sentence structure, etc.. He was also the village storyteller, and she loved to watch the expressions on his face as he...
Competition. HP Offers Developer Tools for New Cloud Platform - SaaS. Salesforce Aims to Obsolete the CMS with Site.com Launch . compose the entire site, including layout template and content, and host the site including the database on the Force.com...
Ready to download the document? Go ahead and hit continue!