Building A Big Data Platform With A Hadoop Ecosystem

Building A Big Data Platform With A Hadoop Ecosystem

BIG DATA AND THE HADOOP ECOSYSTEM TOM ROGERS NORTHWESTERN UNIVERSITY FEINBERG SCHOOL OF MEDICINE DEPARTMENT OF ANESTHESIOLOGY WHAT IS BIG DATA? The 3 Vs WHAT IS BIG DATA? Volume Terabytes

Petabytes Exabytes WHAT IS BIG DATA? Volume Velocity System Logs Medical Monitors Machinery Controls WHAT IS BIG DATA? Volume

Velocity Variety Varacity XML Social Media RDBMS JSON Documents

IoT Variability Value How do we collect, store and process all this data? Open Source Apache Software. Distributed processing across clusters of computers. Designed to scale to thousands of computers. Local computation and storage. Expects hardware failure which is handled at the application layer. A cute yellow elephant HADOOP ECOSYSTEM OVERVIEW

Distributed storage and processing. Runs on commodity server hardware. Scales horizontally for seamless failover. Hadoop is open source software. TRADITIONAL DATA REPOSITORIES Very structured in 3NF or Star topologies. Is the enterprise Single Source of Truth Optimized for operations reporting requirements. Scales vertically. Limited interaction with external or unstructured data sources. Complex management schemes and protocols. TRADITIONAL DATA SOURCES IN

HEALTHCARE Data for the Healthcare EDW originates from the functional clinical and administrative responsibilities. Sources can be as sophisticated as highly complex on-line systems or as simple as Excel spreadsheets. Complex validation and transformation processes before inclusion into the EDW. Staging of the data transformation requires separate storage and processing space, but is often times done on the same physical hardware as the EDW. INTEGRATION OF HADOOP AND TRADITIONAL IT Hadoop does is not replace traditional

storage or processing technologies. Hadoop can include data from traditional IT sources to discover new value. Compared to traditional IT, setting up and operating a Hadoop platform can be very inexpensive. Can be seen as very expensive when adding to existing traditional IT environments. EMERGING AND NON-TRADITIONAL DATA New knowledge is discovered by applying known experience in context with unknown or new experience. New sources of data are being created in a seemingly unending manner. Social media and mobile computing

provide sources of new data unavailable in the past. Monitors, system logs, and document corpus all provide new ways of capturing and expressing the human experience that cannot be captured or analyzed by traditional IT methodologies. INTEGRATION OF HADOOP AND NON-TRADITIONAL DATA Hadoop is designed to store and process non-traditional data sets. Optimized for unstructured file based data sources. Core applications developed specifically for different storage, processing, analysis and display activities. Development of metadata definitions and

rules combined with data from disparate data sources can be used for deeper analytic discovery. DATA ANALYSIS Inspecting, transforming and modeling data to discover knowledge, make predictions and suggest conclusions. 3rd party data analysis can be integrated into traditional IT environments or big data solutions. Traditionally conducted by working on discrete data sets in isolation from the decision making process. Data scientists are integrated into core business processes to create solutions for critical business problems using big data

platforms. COMPLETE HADOOP ECOSYSTEM Integration between traditional and nontraditional data is facilitated by the Hadoop ecosystem. Data is stored on a fault tolerant distributed file system in the Hadoop cluster. Data is processed close to where the data is located to reduce latency and time consuming transfer processes. The Hadoop Master controller or NameNode monitors the processes of the Hadoop cluster and automatically executes actions to continue processing when failure is detected.

HADOOP CORE COMPONENTS Storage Managemen t HDFS Zoo Keeper Processing MapReduce Integration Programmin

g Insight Pig Mahout Hive QL Hue Jaql Beeswax Sqoop

Avro Hive Oozie Spark Hbase Whirr Flume CORE COMPONENT - STORAGE HDFS A distributed file system designed to run on commodity grade hardware in the Hadoop computing ecosystem. This file system is highly fault tolerant and provides very high throughput to data and is suitable for very large data sets. Fault tolerance is enabled by making redundant copies of data sectors and distributing them throughout the Hadoop cluster. Key Characteristics Include:

Streaming data access Designed for batch processing instead of interactive use. Large data sets Typically in gigabytes to terabytes in size. Single Coherency Model - To enable high throughput access. Moving computational process is cheaper than moving data. Designed to be easily portable. Hive A data warehouse implementation in Hadoop that facilities the query and management of large datasets kept in the distributed storage. Key Features: Tools for ETL A methodology for providing structure for multiple data formats. Access to files stored in HDFS or Hbase Executes queries via the MapReduce application. CORE COMPONENT STORAGE .. HBase A distributed, scalable big data database. For random access realtime read/write access to big data. Key Features: Modular scalability.

Strict consistent reads and writes. Automatic sharding of tables (partitioning tables to smaller more manageable parts). Automatic failover. CORE COMPONENT - MANAGEMENT Zoo Keeper A centralized service for maintaining configurations, naming providing distributed synchronization and group services. Avro A data serialization program. Oozie A Hadoop workflow Scheduler Whirr A cloud neutral library for running cloud services. CORE COMPONENT - PROCESSING MapReduce An implementation for processing and generating large data sets with a parallel, distributed algorithm on a Hadoop cluster. Key Features: Automatic parallelization and distribution Fault-tolerance

I/O Scheduling Status Monitoring CORE COMPONENT - INTEGRATION Sqoop a utility designed to efficiently transfer bulk data between Hadoop and relational databases. Flume A service, based on streaming data flows, for collecting, aggregating and moving large amounts of system log data. CORE COMPONENT PROGRAMMING Pig A high level language for analyzing very large data sets and is designed is able to efficiently utilize parallel processes to achieve its results. Key Properties: Ease of programming Complex tasks are explicitly encoded as data flow sequences making them easy to understand and implement. Significant optimization opportunities the system optimizes execution automatically. Extensibility Users can encode their own functions. HiveQL A SQL like query language for data stored in Hive Tables which converts queries into MapReduce

jobs. Jaql A data processing and query language used to processing JSON on Hadoop. CORE COMPONENT - INSIGHT Mahout A library of callable machine learning algorithms which uses the MapReduce paradigm. Supports four main data use cases: Collaborative filtering analyzes behavior and make recommendations. Clustering organizes data into naturally occurring groups. Classification learns from known characteristics of existing categorizations and makes assignments of unclassified items into a category. Frequent item or market basket mining analyzes data items in transactions and identifies items which typically occur together. Hue Is a set of web applications that enable a user to interact with a Hadoop cluster. Also lets the user browse and interact with Hive, Impala, MapReduce jobs and Oozie workflows. Beeswax An application which allows the user to perform queries on the Hive data warehousing application. You can create Hive tables, load data, run queries and download results in Excel spreadsheet format or CSV format.

HADOOP DISTRIBUTIONS Amazon Web Services Elastic MapReduce One of the first Hadoop commercial offerings Has the largest commercial Hadoop market share Includes strong integration with other AWS cloud products Auto scaling and support for NoSQL and BI integration Cloudera 2nd largest commercial marketshare Experience with very large deployments Revenue model based on software subscriptions Aggressive innovation to meet customer demands HortonWorks Strong engineering partnerships with flagship companies. Innovation driven through the open source community.

Is a key contributor to the Hadoop core project. Commits corporate resources to jump start Hadoop community projects. HADOOP DISTRIBUTIONS International Business Machines Vast experience in distributed computing and data management. Experience with very large deployments. Has advanced analytic tools, and global recognition. Integration with vast array of IBM management and productivity software. MapR Technologies Heavy focus and early adopter of enterprise features. Supports some legacy file systems such as NFS. Adding performance enhancements for HBase, high-availability and disaster recovery. Pivotal Spin off from EMC and VMWare.

Strong cadre of technical consultants and data scientists. Focus on MPP SQL engine and EDW with very high performance. Has an appliance with integrated Hadoop, EDW and data management in a single rack. HADOOP DISTRIBUTIONS Teradata Specialist and strong background in EDW. Has a strong technical partnership with HortonWorks. Has very strong integration between Hadoop and Teradatas management and EDW tools. Extensive financial and technical resources allow creation of unique and powerful appliances. Microsoft Windows Azure HDInsight

A product designed specifically for the cloud in partnership with HortonWorks. The only Hadoop distribution that runs in the Windows environment. Allows SQL Server users to also execute queries that include data stored in Hadoop. Unique marketing advantage for offering the Hadoop stack to traditional Windows customers. RECOMMENDATION Commitment and Leadership in the Open Source Community Innovative Strong Engineering Partnerships

Innovation driven from the community Secure Big Data/Health Research Collaboration CLUSTER DIAGRAM NameNode is a single master server which manages the file system and file system operations. Data Nodes are slave servers that manage the data and the storage attached to the data. NameNode is a single point of

failure for the HDFS Cluster. A SecondaryNameNode can be configured on a separate server in the cluster which creates checkpoints for the namespace. SecondaryNameNode is not a failover NameNode. CLUSTER HARDWARE CONFIGURATION AND COST Factor/Specification Option 1 Option 2 Replication Factor

3 3 Size of Data to Move 500 TB 500 TB Workspace Factor 1.25 1.25 Compression

1 (no compression) 3 Hadoop Storage Requirement 1875 TB 625 TB Storage Per Node 16 TB 16 TB

Rack Size 42U 42U Node Unit $4000 $4000 Rack Unit Cost $1500 $1500

Node (1 NameNode & DataNodes) (119 nodes * $4,000) = $480,500 (41 nodes * $4000) = $164,000 Rack Cost (3 racks * $1,500) = $4,500 (1 Rack * $1500) = $1,500 Total Cost $480,500

$165,500 HADOOP SANDBOX IN ORACLE VIRTUALBOX Host Specification Windows 10 Intel Core i7-4770 CPU @ 3.40GHz 16GB Installed RAM 64-bit OS, x64 1.65 TB Storage VM Specification

Cloudera Quickstart Sandbox Red Hat Intel Core i7-4770 CPU @ 3.40GHz 10GB Allocated RAM 32MB Video Memory 64-bit OS 64GB Storage

Shared Clipboard: Bidirectional DragnDrop: Bidirectional CLOUDERA HADOOP DESKTOP & INTERFACE Opening Cloudera interface and view of the CDC Healthy People 2010 data set that was uploaded to the Redhat OS HUE FILE BROWSER Folder List File Contents Displayed file content is from the Vulnerable Population and

Environmental Health data of the Healthy People 2010 data set. ADDING DATA TO HIVE Folder List File Contents Displayed file content is from the Vulnerable Population and Environmental Health data of the Healthy People 2010 data set. ADDING DATA TO HIVE Choosing a delimiter type

Defining columns ADDING DATA TO HIVE Hive Table List Table properties HIVE QUERY EDITOR BIBLIOGRAPHY

"2015/03/18 - Apache Whirr Has Been Retired." Accessed January 24, 2016. https://whirr.apache.org/. "Apache Avro 1.7.7 Documentation." Apache Avro 1.7.7 Documentation. Accessed January 24, 2016. https://avro.apache.org/docs/current/. "Apache HBase Apache HBase Home." Apache HBase Apache HBase Home. Accessed January 24, 2016. https://hbase.apache.org/. "Apache Mahout." Hortonworks. Accessed January 24, 2016. http://hortonworks.com/hadoop/mahout/. "Best Practices for Selecting Apache Hadoop Hardware - Hortonworks." Hortonworks. September 01, 2011. Accessed January 26, 2016. http://hortonworks.com/blog/bestpractices-for-selecting-apache-hadoop-hardware/. "CDH3 Documentation." Beeswax. Accessed January 24, 2016. http://www.cloudera.com/documentation/archive/cdh/3-x/3u6/Hue-1.2-User-Guide/hue1.html. "CDH4 Documentation." Introducing Hue. Accessed January 24, 2016. http://www.cloudera.com/documentation/archive/cdh/4-x/4-2-0/Hue-2-User-Guide/hue2.html. "Data Analysis." Wikipedia. Accessed January 24, 2016. https://en.wikipedia.org/wiki/Data_analysis. "Hadoop Is Transforming Healthcare." Hortonworks. Accessed January 24, 2016. http://hortonworks.com/industry/healthcare/. "HDFS Architecture Guide." HDFS Architecture Guide. Accessed January 24, 2016. https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#Introduction. "Healthy People." Centers for Disease Control and Prevention. January 22, 2013. Accessed January 30, 2016. http://www.cdc.gov/nchs/healthy_people.htm. "Home - Apache Hive - Apache Software Foundation." Home - Apache Hive - Apache Software Foundation. Accessed January 24, 2016. https://cwiki.apache.org/confluence/display/Hive/Home;jsessionid=A2FE8C570A86815B0B4890A923872351. "How the 9 Leading Commercial Hadoop Distributions Stack Up." CIO. Accessed January 24, 2016. http://www.cio.com/article/2368763/big-data/146238-How-the-9-LeadingCommercial-Hadoop-Distributions-Stack-Up.html. "How the 9 Leading Commercial Hadoop Distributions Stack Up." CIO. Accessed January 24, 2016. http://www.cio.com/article/2368763/big-data/146238-How-the-9-LeadingCommercial-Hadoop-Distributions-Stack-Up.html#slide1. "Map Reduce (MR) Framework." [Gerardnico]. Accessed January 24, 2016. http://gerardnico.com/wiki/algorithm/map_reduce. "Oozie - Apache Oozie Workflow Scheduler for Hadoop." Oozie - Apache Oozie Workflow Scheduler for Hadoop. Accessed January 24, 2016. http://oozie.apache.org/.

"Sizing Your Hadoop Cluster." - For Dummies. Accessed January 26, 2016. http://www.dummies.com/how-to/content/sizing-your-hadoop-cluster.html. "Sqoop -." Sqoop -. Accessed January 24, 2016. http://sqoop.apache.org/. "TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing." Preparing Data for Analytics: Making It Easier and Faster. Accessed January 24, 2016. https://tdwi.org/articles/2015/04/14/preparing-data-for-analytics.aspx. "UC Irvine Health Does Hadoop. With Hortonworks Data Platform." Hortonworks. Accessed January 26, 2016. http://hortonworks.com/customer/uc-irvine-health/. "Welcome to Apache Flume." Welcome to Apache Flume Apache Flume. Accessed January 24, 2016. https://flume.apache.org/. "Welcome to Apache Pig!" Welcome to Apache Pig! Accessed January 24, 2016. https://pig.apache.org/. "Welcome to Apache ZooKeeper." Apache ZooKeeper. Accessed January 24, 2016. https://zookeeper.apache.org/. Wikipedia. Accessed January 24, 2016. https://en.wikipedia.org/wiki/Jaql.

Recently Viewed Presentations

  • Hannun and Luberto, 2000. Ceramide in the eukaryotic

    Hannun and Luberto, 2000. Ceramide in the eukaryotic

    Sphingosine 1-phosphate signaling in mammalian cells. Biochem. J. 349:385-402. Spiegel & Milstein, 2000. Sphingolipid metabolites in signal transduction. From Signaling networks and cell cycle control, Ed. S. Gutkind, Humana Press, NJ, pp379-391. ... Protein kinase C x 5. C-Raf1 Hannun...
  • Mobile Mobile Crisis Crisis Intervention Intervention Services Services

    Mobile Mobile Crisis Crisis Intervention Intervention Services Services

    Beacon Health Options. 1P. ... Loss of primary . caregiver/caregiver disability. Changes in physical or mental health status . ... in" socially may give rise to episodes of anxiety and depression. 23J. Autism Spectrum Disorder: Cognitive Functioning:
  • 2015 Code Update Training Property Maintenance Code SIGNIFICANT

    2015 Code Update Training Property Maintenance Code SIGNIFICANT

    104.5.4 Manufactured Home Park Tenant Notification. When providing a Notice of Violation to a manufactured home park owner that jeopardizes the health or safety of tenants of the park, a copy of the notice shall be provided to each affected...
  • Automata and Formal Languages (Final Review) Hongwei Xi

    Automata and Formal Languages (Final Review) Hongwei Xi

    Times New Roman Comic Sans MS Wingdings Arial Symbol Nature Automata and Formal Languages (Final Review) Regular Languages Regular Languages Nonregular Languages Context-free Languages Non-context-free Languages Computability Theory Computability Theory Decidability Undecidability More Undecidable Problems Time Complexity Time Complexity The...
  • Accessing Grades for English 2040/2070 - 69eisenhower.csub.edu

    Accessing Grades for English 2040/2070 - 69eisenhower.csub.edu

    For more information on how to navigate MyWritingLabPlus and master a topic, please see the English 2070/4070 PowerPoint on our website at csub.edu/mwl MyWritingLabPlus Headquarters We have a MyWritingLabPlus Headquarters on campus if you ever need any help in person...
  • Scholastic Keys logo

    Scholastic Keys logo

    Students can even adjust the speed and volume as the text is read aloud to them. Let's look at MaxShow now… MaxShow works with Microsoft PowerPoint, just as MaxWrite works with Word. You'll immediately notice that the toolbar has been...
  • Poetry - birdvilleschools.net

    Poetry - birdvilleschools.net

    A five-line "nonsense" poem with: one couplet (two lines that rhyme) and one triplet (three lines that rhyme) Rhyme scheme =A, A, B, B, A. ... You're halfway done writing an awesome limerick! Epitaph. Headstone writing that reflects on someone's...
  • Medical Marijuana in the Connecticut Workplace

    Medical Marijuana in the Connecticut Workplace

    (b) Unless required by federal law or required to obtain federal funding: (3) No employer may refuse to hire a person or may discharge, penalize or threaten an employee solely on the basis of such person's or employee's status as...