Data Mining Applications for Self-Organizing Maps

Data Mining Applications for Self-Organizing Maps

DNA Classifications with Self-Organizing Maps (SOMs) Thanakorn Naenna Mark J. Embrechts Robert A. Bress May 2003 IEEE International Workshop on Soft Computing in Industrial Application 1 Presentation Outline Introduction to DNA Splice Junctions

Data Collection Introduction to SOMs SOM for DNA Splice Junction Classification Results Conclusions 2 3 Human genome in a nutshell Human : 23 chromosomes Chromosomes thousands of genes

Gene info : exons , comments : introns Splice junction are like /* comment flags */ in C-code Exons and introns codons Codon bases 4 DNA Splice Junctions DNA billions of nucleotides ( A, C, G, T) Genes sequences of amino acids (exons) that are often interrupted by non-coding nucleotides (introns) <.1% of human DNA is made up of exons 99% of splice junctions have the same motif, for

Exon to intron it is GT Intron to exon it is AG Intron Splice Junction Exon Splice Junction Intron .GTGAAGGTTAA AGATGTAGAT GT ATTG 5 Data Collection: HTML Browser + Perl scripts BioBrowser Download HTML

ExtractLinks() Download HTML - data ExtractData() TranslateData() 6 7 DNA Splice Junction (Cont.) A complete gene is made up of different exons Splice junction identification aids in the discovery of new genes

The dataset used for this study is made up of 1,424 sequences Data were created ab initio from GENBANK Each sequence is 32 nucleotides long with regions comprising -15 to +15 nucleotides from the splice-junction Left Regions Intron Exon Unknown Intron Splice Junction AG GT AG or GT Right Regions Exon Inron Unknown

Splice Junction Classes A B C Exon TGTAAGG AG ACGAGTT 8 Self-Organizing Maps (SOM) Network

Unsupervised learning neural network Projects high-dimensional input data onto twodimensional output map Preserves the topology of the input data Visualizes structures and clusters of the data i wi1 Component 1 Component 2 Component 3 Component 4 wi2 wi3

wi4 wc1 wc2 wc3 wc4 c wc5 wi5 Component 5 Input layer Output layer 9

Use of SOM for DNA Splice Junction Classification Model Neuronidentification identificationmethods methods Neuron Highestfrequency frequencyclass class - -Highest DNA test set Closestneuron neuron - -Closest U-Matrix Map DNA training set

SOM Classification Map B SOM C A Classification Classification ClassA: A:intron intron totoexon exon Class ClassB: B:exon exontotointron intron Class

ClassC: C:no notransition transition Class 10 The U-matrix of the DNA Training Set 11 SOM Results for DNA Splice Junction Data The U-matrix of the DNA training set B C A Confusion matrix of 424-DNA test set DNA sequences Class A

Class B Class C Total Class A 102 (93%) 0 (0%) 4 (2%) 106 Classified to Class B 2 (2%) 90 (91%) 6 (3%) 98 Class C 6 (5%) 9 (9%) 205 (95%)

220 Total 110 99 215 424 12 Conclusions SOM is effective in DNA splice junction classification SOM is powerful visualization for high dimensional data 13 Demo with Analyze Code 800 training data, 324 test data (160 features)

96% correct overall classification on test data Confusion Matrix IE 98 5 2 IE FALSE EI FALSE 0 111 3 EI 0 3 102

9 18 6 20000 50000 0.9 0.05 1 // // // // // // // // K L max_neighborhood

num_its num_fine_its alpha_max alpha_min LVQ_flag 14 THE END GATCAATGAGGTGGACACCAGAGGCGGGGACTTGTAAATAACACTGGGCTGTAGGAGT GA TGGGGTTCACCTCTAATTCTAAGATGGCTAGATAATGCATCTTTCAGGGTTGTGCTTCT A TCTAGAAGGTAGAGCTGTGGTCGTTCAATAAAAGTCCTCAAGAGGTTGGTTAATACGC AT GTTTAATAGTACAGTATGGTGACTATAGTCAACAATAATTTATTGTACATTTTTAAATAG CTAGAAGAAAAGCATTGGGAAGTTTCCAACATGAAGAAAAGATAAATGGTCAAGGGAA TG GATATCCTAATTACCCTGATTTGATCATTATGCATTATATACATGAATCAAAATATCACA CATACCTTCAAACTATGTACAAATATTATATACCAATAAAAAATCATCATCATCATCTCC ATCATCACCACCCTCCTCCTCATCACCACCAGCATCACCACCATCATCACCACCACCATC

ATCACCACCACCACTGCCATCATCATCACCACCACTGTGCCATCATCATCACCACCACTG TCATTATCACCACCACCATCATCACCAACACCACTGCCATCGTCATCACCACCACTGTCA TTATCACCACCACCATCACCAACATCACCACCACCATTATCACCACCATCAACACCACCA CCCCCATCATCATCATCACTACTACCATCATTACCAGCACCACCACCACTATCACCACCA CCACCACAATCACCATCACCACTATCATCAACATCATCACTACCACCATCACCAACACCA CCATCATTATCACCACCACCACCATCACCAACATCACCACCATCATCATCACCACCATCA CCAAGACCATCATCATCACCATCACCACCAACATCACCACCATCACCAACACCACCATC A CCACCACCACCACCATCATCACCACCACCACCATCATCATCACCACCACCGCCATCATCA TCGCCACCACCATGACCACCACCATCACAACCATCACCACCATCACAACCACCATCATC A CTATCGCTATCACCACCATCACCATTACCACCACCATTACTACAACCATGACCATCACCA CCATCACCACCACCATCACAACGATCACCATCACAGCCACCATCATCACCACCACCACC A CCACCATCACCATCAAACCATCGGCATTATTATTTTTTTAGAATTTTGTTGGGATTCAGT ATCTGCCAAGATACCCATTCTTAAAACATGAAAAAGCAGCTGACCCTCCTGTGGCCCCC T EXTRACTING KNOWLEDGE GATCAATGAGGTGGACACCAGAGGCGGGGACTTGTAAATAACACTGGGCTGTAGGAGTGA TGGGGTTCACCTCTAATTCTAAGATGGCTAGATAATGCATCTTTCAGGGTTGTGCTTCTA

TCTAGAAGGTAGAGCTGTGGTCGTTCAATAAAAGTCCTCAAGAGGTTGGTTAATACGCAT GTTTAATAGTACAGTATGGTGACTATAGTCAACAATAATTTATTGTACATTTTTAAATAG AAAAGCATTGGGAA CTAGAAGAAAAGCATTGGGAAGTTTCCAACATGAAGAAAAGATAAATGGTCAAGGGAATG GATATCCTAATTACCCTGATTTGATCATTATGCATTATATACATGAATCAAAATATCACA CATACCTTCAAACTATGTACAAATATTATATACCAATAAAAAATCATCATCATCATCTCC ATCATCACCACCCTCCTCCTCATCACCACCAGCATCACCACCATCATCACCACCACCATC ATCACCACCACCACTGCCATCATCATCACCACCACTGTGCCATCATCATCACCACCACTG TCATTATCACCACCACCATCATCACCAACACCACTGCCATCGTCATCACCACCACTGTCA GGTTC TTATCACCACCACCATCACCAACATCACCACCACCATTATCACCACCATCAACACCACCA CCCCCATCATCATCATCACTACTACCATCATTACCAGCACCACCACCACTATCACCACCA CCACCACAATCACCATCACCACTATCATCAACATCATCACTACCACCATCACCAACACCA CCATCATTATCACCACCACCACCATCACCAACATCACCACCATCATCATCACCACCATCA CCAAGACCATCATCATCACCATCACCACCAACATCACCACCATCACCAACACCACCATCA CCGTTGAAC

CCACCACCACCACCATCATCACCACCACCACCATCATCATCACCACCACCGCCATCATCA TCGCCACCACCATGACCACCACCATCACAACCATCACCACCATCACAACCACCATCATCA CTATCGCTATCACCACCATCACCATTACCACCACCATTACTACAACCATGACCATCACCA CCATCACCACCACCATCACAACGATCACCATCACAGCCACCATCATCACCACCACCACCA CCACCATCACCATCAAACCATCGGCATTATTATTTTTTTAGAATTTTGTTGGGATTCAGT GGTCAGGTTAGACTA ATCTGCCAAGATACCCATTCTTAAAACATGAAAAAGCAGCTGACCCTCCTGTGGCCCCCT TTTTGGGCAGTCATTGCAGGACCTCATCCCCAAGCAGCAGCTCTGGTGGCATACAGGCAA CCCACCACCAAGGTAGAGGGTAATTGAGCAGAAAAGCCACTTCCTCCAGCAGTTCCCTGT 16 NUCLEOTIDES DNA is double-stranded A & C are Complements G & T are Complements

A T C G 17 AMINO ACIDS Sequences of three nucleotides CODONS code for amino acids There are 20 different amino acids Each amino acid can be translated between 1 and 6 different ways Amino acids make up the part of DNA known as exons 18

PROTEINS Proteins are made up of sequences of amino acids Generally responsible for some biological function May have complicated folding patterns that are difficult to predict 19 GENES 30,000 100,000 genes exist in the human genome Most genes have not yet been discovered Genes are made up of sequences of amino acids Genes are interrupted by non-coding regions of DNA Introns

20 CHROMOSOMES 21 READING FRAMES Reading frames may be difficult to determine ACG TAGAT Reading frames may be shifted by splice junctions 22 GENE STRUCTURE Start Codon (ATG) Exon sequence (amino acid string)

Intron sequence (junk DNA) Stop Codon (3 possible) 23 SPLICE JUNCTIONS Segments of DNA that join coding and non-coding regions 24

Recently Viewed Presentations

  • Good Morning DeCA

    Good Morning DeCA

    How will DeCA make POGs available to industry? Downloadable site, Store level. Industry experience building planograms is crucial in driving results. What role do you see industry playing moving forward? Once the POGs are set at store level, how are...
  • Dia 1 - Metropolia Ammattikorkeakoulu

    Dia 1 - Metropolia Ammattikorkeakoulu

    True friends What culture has to do with ethics. Culturalencounters. Dilemma #1. Who is right, who is wrong: ethnorelativism. Aboveand below the surface. For tomorrow. intercultural quiz
  • Why are Women Absent…. or are they?

    Why are Women Absent…. or are they?

    Building support network. A more responsive climate is needed for integrating work and family responsibilities for women to participate in an equal basis with their male colleagues in higher education.
  • The Upcoming 2016 Legislative Session 01/16/2020 Dr. Elizabeth

    The Upcoming 2016 Legislative Session 01/16/2020 Dr. Elizabeth

    Partnered with AASB last spring to submit a joint list. Non-financial ways to help education. List provided. Will work from that list and the information you just provided recently. Priority focus - JPS/ASA Resolutions. 10/1/15. Dr. Elizabeth Parady
  • Spallation Neutron Source Proton Power Upgrade High Power

    Spallation Neutron Source Proton Power Upgrade High Power

    * This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Basic Energy Sciences. This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy.
  • September 26, 2019 Global I History Agenda: 1.

    September 26, 2019 Global I History Agenda: 1.

    Machu Picchu, Peru The Andes Mountains run through entire west coast of South America, from Colombia to Chile. Length: 4,350 miles The Amazon River runs into the north of South America, through Brazil, Colombia, Ecuador, and Peru. Length: 4,200 miles...
  • Happy Marvelous Monday! September 14, 2015 DIRECTIONS: Write

    Happy Marvelous Monday! September 14, 2015 DIRECTIONS: Write

    Happy Marvelous Monday! September 14, 2015 DIRECTIONS: Write the following sentence and tell me what you notice, i.e. adjectives, sentence structure, etc.. He was also the village storyteller, and she loved to watch the expressions on his face as he...
  • Melbourne Azur April 2012 news by David McGhee,

    Melbourne Azur April 2012 news by David McGhee,

    Competition. HP Offers Developer Tools for New Cloud Platform - SaaS. Salesforce Aims to Obsolete the CMS with Site.com Launch . compose the entire site, including layout template and content, and host the site including the database on the Force.com...