ACM KDD Cup A Survey: 1997-2011 Qiang Yang

ACM KDD Cup A Survey: 1997-2011 Qiang Yang

ACM KDD Cup A Survey: 1997-2011 Qiang Yang (partly based on Xinyue Lius slides @SFU, and Nathan Lius slides @hkust) Hong Kong University of Science and Technology 1 About KDD Cup (1997 2011) Competition is a strong mover for Science and Engineering: ACM Programming Contest

World College level Programming skills ROBOCUP World Robotics Competition 2 About ACM KDDCUP ACM KDD: Premiere Conference in knowledge discovery and data mining ACM KDDCUP: Worldwide competition in conjunction with ACM KDD conferences. It aims at:

showcase the best methods for discovering higher-level knowledge from data. Helping to close the gap between research and industry Stimulating further KDD research and development 3 Statistics Participation in KDD Cup grew steadily Average person-hours per submission: 204 Max person-hours per submission: 910 Year Submissions 97 98 16 21 99

24 2000 2005 2011 30 32 1000+ 4 Algorithms (up to 2000) Algorithms Tried vs Submitted 20 18 16 Entries 14 12 Tried 10 Submitted 8 6

4 2 0 Algorithm 5 KDD Cup 97 A classification task to predict financial services industry (direct mail response) Winners Charles Elkan, a Prof from UC-San Diego with his Boosted Naive

Bayesian (BNB) Silicon Graphics, Inc with their software MineSet Urban Science Applicatio ns, Inc . with their software gain, Direct Marketing Selection System 6 MineSet (Silicon Graphics Inc.) A KDD tool that combines data access, transformation, classification, and visualization. 7 KDD Cup 98: CRM Benchmark URL:

www.kdnuggets.com/meetings /kdd98/kdd-cup-98.html A classification task to analyze fund raising mail responses to a non-profit organization Winners Urban Science Applications, Inc. with their software GainSmarts. SAS Institute, Inc. with their software SAS Enterprise Miner Quadstone Limited with their software Decisionhouse 8

KDDCUP 1998 Results $70,000 $65,000 $60,000 $55,000 $50,000 100% Maximum Possible Profit Line ($72,776 in profits with 4,873 mailed) 80% Mail to Everyone Solution ($10,560 in profits with 96,367 mailed) $45,000 $40,000 $35,000 GainSmarts SAS/Enterprise Miner $30,000 $25,000 $20,000 $15,000 $10,000 $5,000 $-

90% Quadstone/Decisionhouse 70% 60% 50% 40% 30% 20% 10% 0% ACM KDD Cup 1999 URL: www.cse.ucsd.edu/users/elk an/kdresults.html Problem To detect network intrusion and protect a computer

network from unauthorized users, including perhaps insiders Data: from DoD Winners SAS Institute Inc. with their software Enterprise Miner. Amdocs with their Information Analysis Environment 10 KDDCUP 2000: Data Set and Goal: Data collected from Gazelle.com, a legwear and legcare Web retailer Pre-processed Training set: 2 months Test sets: one month Data collected includes:

Click streams Order information The goal to design models to support website personalization and to improve the profitability of the site by increasing customer response. Questions - When given a set of page views, characterize heavy spenders characterize killer pages characterize which product brand a visitor will view in

the remainder of the session? 11 KDDCUP 2000: The Winners Question 1 & 5 Winner: Amdocs Question 2 & 3 Winner: Salford Systems Question 4 Winner: e-steam 12 KDD Cup 2001 3 Bioinformatics Tasks

Dataset 1: Prediction of Molecular Bioactivity for Drug Design half a gigabyte when uncompressed Dataset 2: Prediction of Gene/Protein Function (task 2) and Localization (task 3) Dataset 2 is smaller and easier to understand 7 megabytes uncompressed

A total of 136 groups participated to produce a total of 200 submitted predictions over the 3 tasks: 114 for Thrombin, 41 for Function, and 45 for Localization. 13 2001 Winners Task 1, Thrombin: Jie Cheng (Canadian Imperial Bank of Commerce). Bayesian network learner and classifier

Task 2, Function: Mark-A. Krogel (University of Magdeburg). Task 2: the genes of one particular type of organism A gene/protein can have more than one function, but only one localization. Inductive Logic programming Task 3, Localization: Hisashi Hayashi, Jun Sese, and Shinichi Morishita

(University of Tokyo). K nearest neighbor 14 molecular biology : Two Winners: tasks Task 1: Document extraction from biological articles Task 2: Classification of proteins based on

gene deletion experiments Task 1: ClearForest and Celera, USA Yizhar Regev and Michal Finkelstein Task 2: Telstra Research Laboratories , Australia Adam Kowalczyk and Bhavani Raskutti 15 2003 KDDCUP Information Retrieval/Citation Mining of

Scientific research papers based on a very large archive of research papers First Task: predict how many citations each paper will receive during the three months leading up to the KDD 2003 conference Second Task: a citation graph of a large subset of the archive from only the LaTex sources Third Task: each paper's popularity will be estimated based on partial download logs Last Task: devise their own questions

16 2003 KDDCUP: Results Task 1: Task 2: 1st place: David Vogel AI Insight Inc. Task 3:

Claudia Perlich, Foster Provost, Sofus Kacskassy New York University Janez Brank and Jure Leskovec Jozef Stefan Institute, Slovenija Task 4: Amy McGovern, Lisa Friedland, Michael Hay, Brian Gallagher, Andrew Fast, Jennifer Neville, and David Jensen University of Massachusetts Amherst, USA 17 2004 Tasks and Results

Particle physics; plus protein homology prediction David S. Vogel, Eric Gottschalk, and Morgan C. Wang Bernhard Pfahringer, Yan Fu ( ), RuiXiang Sun, Qiang Yang ( ), Simin He, Chunli Wang, Haipeng Wang, Shiguang Shan, Junfa Liu, Wen Gao. 18 Past KDDCUP Overview: 20052010 Year Host

Task Technique Winner 2005 Microsoft Web query categorization Feature Engineering, Ensemble HKUST 2006 Siemens Pulmonary emboli detection Multi-instance, NonIID sample, Cost

sensitive, Class Imbalance, Noisy data AT&T, Budapest University of Technology & Economics 2007 Netflix Consumer recommendation Collaborative Filtering, Time series, Ensemble IBM Research, Hungarian Academy of Sciences 2008

Siemens Breast cancer detection from medical images Ensemble, Class imbalance, Score calibration IBM Research, National Taiwan University 2009 Orange Customer relationship prediction in telecom Feature selection, Ensemble

IBM Research, University of Melbourne 2010 PSLC Data Student Feature engineering, National KDDCUP11 Dataset 11 years of data Rated items are Tracks Albums Artists

Genres Items arranges in a taxonomy Two tasks Track 1 Track 2 #ratings 263M 63M #items 625K 296K #users 1M 249K Items in a Taxonomy Track 1 Details

Track 1 Highlights Largest publicly available dataset Large number of items (50 times more than Netflix) Extreme rating sparsity (20 times more sparse than Netflix) Taxonomy can help in combating sparsely rated items. Fine time stamps with both date and time allow sophisticated temporal modeling. Track 2 Details Track 2 Highlights

Performance metric focus on ranking/ classification, which differs from traditional collaborative filtering. No validation data provided, need to selfconstruct binary labeled data from rating data. Unlike track 1, track 2 removed time stamps to focus more than long term preference rather than short term behaviors. Submission Stats Winners Track 1 Track 2 1st place National Taiwan University National Taiwan University 2nd place Commendo (Netflix Prize Winnder)

Chinese Academy of Science, Hulu Labs 3rd place Hong Kong University of Science and Technology, Shanghai Jiaotong University Commendo (Netflix Prize Winnder) Chinese Teams at KDDCUP (NTU, CAS, HKUST) Key Techniques Track 1:

Blending of multiple techniques Matrix factorization models Nearest neighbor models Restricted Bolzmann machines Temporal modelings Track 2: Importance sampling of negative instances Taxonomical modelings Use of pairwise ranking objective functions Summary To place on top of KDDCUP requires

Team work Expertise in domain knowledge as well as mathematical tools Often done by world famous institutes and companies Recent trends: Dataset increasingly more realistic Participants increasingly more professional Tasks are increasingly more difficult 30 Summary KDD Cup is an excellent source to learn the state-of-art KDD techniques

KDDCUP dataset often becomes the standard benchmark for future research, development and teaching Top winners are highly regarded and respected 31 References Elkan C. (1997). Boosting and Naive Bayesian Learning. Technical Report No. CS97-557, September 1997, UCSD. Decisionhouse (1998). KDD Cup 98: Quadstone Take Bronze Miner Award. Retrieved March 15, 2001 from http://www.kdnuggets.com/meetings/kdd98/quadston e/index.html Urbane Science (1998). Urbane Science wins the KDD-98 Cup. Retrieved March 15, 2001 from http://www.kdnuggets.com/meetings/kdd98/gain-kddc up98-release.html Georges, J. & Milley, A. (1999). KDD99 Competition: Knowledge Discovery Contest. Retrieved March 15, 2001 from http://www.cse.ucsd.edu/users/elkan/saskdd99.pdf Rosset, S. & Inger A. (1999). KDD-Cup 99 : Knowledge Discovery In a Charitable Organizations Donor Database. Retrieved March 15, 2001 from http://www.cse.ucsd.edu/users/elkan/KDD2.doc

32 References (Cont.) Sebastiani P., Ramoni M. & Crea A. (1999). Profiling your Customers using Bayesian Networks. Retrieved March 15, 2001 from http://bayesware.com/resources/tutorials/ kddcup99/kddcup99.pdf Inger A., Vatnik N., Rosset S. & Neumann E. (2000). KDDCup 2000: Question 1 Winners Report. Retrieved March 18, 2000 from http://www.ecn.purdue.edu/KDDCUP/amdocs-slides-1.ppt Neumann E., Vatnik N., Rosset S., Duenias M., Sasson I. & Inger A. (2000). KDD-Cup 2000: Question 5 Winners Report. Retrieved March 18, 2000 from http://www.ecn.purdue.edu/KDDCUP/amdocs-slides-5.ppt Salford System white papers: http://www.salford-systems.com/whitepaper.html Summary talk presented at KDD (2000) http://robotics.stanford.edu/~ronnyk/kddCupTalk.ppt 33 References (cont)

http://www.cs.wisc.edu/~dpage/kddcup2001/Cheng.pdf http://www.cs.wisc.edu/~dpage/kddcup2001/Krogel.pdf http://www.cs.wisc.edu/~dpage/kddcup2001/Hayashi.pdf 34

Recently Viewed Presentations

  • Jeopardy - Mrs. W-B\'s Science

    Jeopardy - Mrs. W-B\'s Science

    $500 Answer Cytokinesis and Mitosis $100 Question The division of the cytoplasm is called _____ $100 Answer Cytokinesis $200 Question In plants, a _____ forms to divide the nuclei $200 Answer Cell wall $300 Question In animal cells, the cell...
  • Communication and Aging Chp. 2 Attitudes and Ageism Chelsea ...

    Communication and Aging Chp. 2 Attitudes and Ageism Chelsea ...

    COMMUNICATION AND AGING CHP. 2 ATTITUDES AND AGEISM CHELSEA DAVIS COMM 165 VIEWS OF AGING PROCESS The difference between the way young adults view elderly people and the way elderly view themselves well over 200 empirical studies.
  • Branches of Government Jeopardy - Bryan High School

    Branches of Government Jeopardy - Bryan High School

    Branches of Government Jeopardy Take the Challenge!! Branches of Government Jeopardy 100 200 100 200 200 100 100 300 300 200 400 400 300 300 400 400 Executive Judicial Legislative General Executive for 100 Who is the head of the...
  • Objectives Learn the formal elements of graphic design

    Objectives Learn the formal elements of graphic design

    The Formal Elements of Design Any graphic designer must have a foundation in two-dimensional design and color. The formal elements are the building blocks of two-dimensional design. Line Shape Color Texture Line A point or dot is the smallest unit...
  • Vendor Management - NOSHE

    Vendor Management - NOSHE

    VENDOR MANAGEMENT Troy Kyle President / CEO Vendor Credentialing Service [email protected] OBJECTIVES Government Watch List (what they are, laws, and enforcement actions) HIPAA, how it affects facilities from a vendor perspective, BAA's, etc. Immunization testing, what is required, CDC and...
  • UCL Centre for Behaviour Change - Economic Pluralism

    UCL Centre for Behaviour Change - Economic Pluralism

    UCL's Grand Challenges. Cultivate cross-disciplinary collaborations to explore joined-up solutions in 6 areas related to matters of pressing societal concern: Global Health . Cultural Understanding . Sustainable Cities . Human Wellbeing . Justice and Equality . Transformative Technology
  • Notes on DMIMO Protocols

    Notes on DMIMO Protocols

    Similarly as in the multiple MAC sublayer management entity, the different CSMA/CA functions across the multiple co-located MAC sublayers of a multi-band capable device are uncoordinated, and do not share related information, such as. NAV state.
  • collinsed.com

    collinsed.com

    ©Bill Atwood Collins Education Associates 2017. Mini Lesson 1: Choosing the best words to Cite. Objective: After reading 2 poems and a prompt, students will work to identify which words and phrases to quote to help make their argument.