Transcription

Expert Reference Series of White PapersLearning How to LearnHadoop1-800-COURSES www.globalknowledge.com

Learning How to Learn HadoopRich Morrow, IT Consultant, Developer, System Administrator, Trainer,Mentor, and Team BuilderIntroductionHadoop’s Value PropositionLearning how to program and develop for the Hadoop platform can lead to lucrative new career opportunitiesin Big Data. But like the problems it solves, the Hadoop framework can be quite complex and challenging. JoinGlobal Knowledge instructor and Technology Consultant Rich Morrow as he leads you through some of thehurdles and pitfalls students encounter on the Hadoop learning path. Building a strong foundation, leveragingonline resources, and focusing on the basics with professional training can help neophytes across the Hadoopfinish line.If I’ve learned one thing in two decades of IT, it’s that the learning never ends.As the leader of an independent consulting firm, one of the most important decisions I make throughout theyear is choosing what technologies I and other consultants need to learn. If we can identify and quickly ramp upon technologies that really move the needle for our clients, then everyone wins. In the following pages, I’d liketo walk you through the path that I took in identifying Hadoop as a “must have” skill that our clients will need,and how I quickly got ramped up on the technology.Hadoop is a paradigm-shifting technology that lets you do things you could not do before – namely compile andanalyze vast stores of data that your business has collected. “What would you want to analyze?” you may ask.How about customer click and/or buying patterns? How about buying recommendations? How about personalized ad targeting, or more efficient use of marketing dollars?From a business perspective, Hadoop is often used to build deeper relationships with external customers, providing them with valuable features like recommendations, fraud detection, and social graph analysis. In-house,Hadoop is used for log analysis, data mining, image processing, extract-transform-load (ETL), network monitoring – anywhere you’d want to process gigabytes, terabytes, or petabytes of data.Hadoop allows businesses to find answers to questions they didn’t even know how to ask, providing insightsinto daily operations, driving new product ideas, or putting compelling recommendations and/or advertisementsin front of consumers who are likely to buy.Copyright 2013 Global Knowledge Training LLC. All rights reserved. 2

The fact that Hadoop can do all the above is not the compelling argument for it’s use. Other technologieshave been around for a long, long while which can and do address everything we’ve listed so far. What makesHadoop shine, however, is that it performs these tasks in minutes or hours, for little or no cost versus the days orweeks and substantial costs (licensing, product, specialized hardware) of previous solutions.Hadoop does this by abstracting out all of the difficult work in analyzing large data sets, performing its work oncommodity hardware, and scaling linearly. -- Add twice as many worker nodes, and your processing will generally complete 2 times faster. With datasets growing larger and larger, Hadoop has become the solitary solutionbusinesses turn to when they need fast, reliable processing of large, growing data sets for little cost.Because it needs only commodity hardware to operate, Hadoop also works incredibly well with public cloudinfrastructure. Spin up a large cluster only when you need to, then turn everything off once the analysis is done.Some big success stories here are The New York Times using Hadoop to convert about 4 million entities to PDFin just under 36 hours, or the infamous story of Pete Warden using it to analyze 220 million Facebook profiles,in just under 11 hours for a total cost of 100. In the hands of a business-savvy technologist, Hadoop makes theimpossible look trivial.The Pillars of Hadoop - HDFS and MapReduceArchitecturally, Hadoop is just the combination of two technologies: the Hadoop Distributed File System (HDFS)that provides storage, and the MapReduce programming model, which provides processing[1] [2].HDFS exists to split, distribute, and manage chunks of the overall data set, which could be a single file or adirectory full of files. These chunks of data are pre-loaded onto the worker nodes, which later process them inthe MapReduce phase. By having the data local at process time, HDFS saves all of the headache and inefficiencyof shuffling data back and forth across the network.In the MapReduce phase, each worker node spins up one or more tasks (which can either be Map or Reduce).Map tasks are assigned based on data locality, if at all possible. A Map task will be assigned to the worker nodewhere the data resides. Reduce tasks (which are optional) then typically aggregate the output of all of the dozens, hundreds, or thousands of map tasks, and produce final output.The Map and Reduce programs are where your specific logic lies, and seasoned programmers will immediately recognize Map as a common built-in function or data type in many languages, for example,map(function,iterable) in Python, or array map(callback, array) in PHP. All map does is run a userdefined function (your logic) on every element of a given array. For example, we could define a functionsquareMe, which does nothing but return the square of a number. We could then pass an array of numbers toa map call, telling it to run squareMe on each. So an input array of (2,3,4,5) would return (4,9,16,25), and ourcall would look like (in Python) map(“squareMe”,array(‘i’,[2,3,4,5]).Hadoop will parse the data in HDFS into user-defined keys and values, and each key and value will then bepassed to your Mapper code. In the case of image processing, each value may be the binary contents of your image file, and your Mapper may simply run a user-defined convertToPdf function against each file. In this case,Copyright 2013 Global Knowledge Training LLC. All rights reserved. 3

you wouldn’t even need a Reducer, as the Mappers would simply write out the PDF file to some datastore (likeHDFS or S3).This is what the New York Times did when converting their archives.Consider, however, if you wished to count the occurrences of a list of “good/bad” keywords in all customerchat sessions, twitter feeds, public Facebook posts, and/or e-mails in order to gauge customer satisfaction. Yourgood list may look like happy, appreciate, “great job”, awesome, etc., while your bad list may look like unhappy,angry, mad, horrible, etc., and your total data set of all chat sessions and emails may be hundreds of GB. In thiscase, each Mapper would work only on a subset of that overall data, and the Reducer would be used to compile the final count, summing up outputs of all the Map tasks.At its core, Hadoop is really that simple. It takes care of all the underlying complexity, making sure that each record is processed, that the overall job runs quickly, and that failure of any individual task (or hardware/networkfailure) is handled gracefully. You simply bring your Map (and optionally Reduce) logic, and Hadoop processesevery record in your dataset with that logic.Solid Linux and Java Will Speed Your SuccessThat simplicity can be deceiving though, and if you’re assigned a project involving Hadoop you may find yourselfjumping a lot of hurdles that you didn’t expect or foresee.Hadoop is written in Java and is optimized to run Map and Reduce tasks that were written in Java as well. Ifyour Java is rusty, you may want to spend a few hours with your Java for Dummies book before you even beginlooking at Hadoop. Although Java familiarity also implies good coding practices (especially Object-OrientedDesign (OOD) coding practices), you may want to additionally brush up on your Object Oriented skills and havea clear understanding of concepts like Interfaces, Abstract Objects, Static Methods, and Variables, etc. Weekendafternoons at Barnes and Noble, or brownbag sessions with other developers at work are the quickest ways Iknow to come up to speed on topics like this.Although it does offer a Streaming API, which allows you to write your basic Map and Reduce code in any language of your choice, you’ll find that most code examples and supporting packages are Java-based, that deeperdevelopment (such as writing Partitioners) still requires Java, and that your Streaming API code will run up to25% slower than a Java implementation.Although Hadoop can run on Windows, it was built initially on Linux, and the preferred method for both installing and managing Hadoop. The Cloudera Distribution of Hadoop (CDH), is only officially supported on Linuxderivatives like Ubuntu and RedHat . Having a solid understanding of getting around in a Linux shell willalso help you tremendously in digesting Hadoop, especially with regards to many of the HDFS command lineparameters. Again, a Linux for Dummies book will probably be all you need.Once you’re comfortable in Java, Linux, and good OOD coding practices, the next logical step would be gettingyour hands dirty by either installing a CDH Virtual Machine, or a CDH distribution. When you install a Clouderaprovided CDH distribution of Hadoop, you’re getting assurance that some of the best minds in the Hadoop com-Copyright 2013 Global Knowledge Training LLC. All rights reserved. 4

munity have carefully reviewed and chosen security, functionality, and supporting patches; tested them together;and provided a working, easy-to-install package.Although you can install Hadoop from scratch, it is both a daunting and unnecessary task that could burn upseveral weeks of your time. Instead, download either a local virtual machine (VM), which you can run on yourworkstation, or install the CDH package (CDH4 is the latest) for your platform – either of which will only take 10minutes. If you’re running in a local VM, you can run the full Hadoop stack in pseudo-distributed mode, whichbasically mimics the operation of a real production cluster right on your workstation. This is fantastic for jumping right in and exploring.Using Hadoop Like a BossOnce you’re doing real development, you’ll want to get into the habit of using smaller, test datasets on yourlocal machine, and running your code iteratively in Local Jobrunner Mode (which lets you locally test anddebug your Map and Reduce code); then Pseudo-Distributed Mode (which more closely mimics the production environment); then finally Fully-Distributed Mode (your real production cluster). By doing this iterativedevelopment, you’ll be able to get bugs worked out on smaller subsets of the data so that when you run onyour full dataset with real production resources, you’ll have all the kinks worked out, and your job won’t crashthree-quarters of the way in.Remember that in Hadoop, your Map (and possibly Reduce) code will be running on dozens, hundreds, orthousands of nodes. Any bugs or inefficiencies will get amplified in the production environment. In addition toperforming iterative Local,Pseudo,Full development with increasingly larger subsets of test data, you’ll alsowant to code defensively, making heavy use of try/catch blocks, and gracefully handling malformed or missingdata (which you’re sure to).Chances are also very high that once you or others in your company come across Pig or Hive, that you’ll neverwrite another line of Java again. Pig and Hive represent two different approaches to the same issue: that writinggood Java code to run on Map Reduce is hard and unfamiliar to many. What these two supporting productsprovide are simplified interfaces into the MapReduce paradigm, making the power of Hadoop accessible to nondevelopers.In the case of Hive, a SQL-like language called HiveQL provides this interface. Users simply submit HiveQL queries like SELECT * FROM SALES WHERE amount 100 AND region ‘US’, and Hive will translate thatquery into one or more MapReduce jobs, submit those jobs to your Hadoop cluster, and return results. Hive washeavily influenced by MySQL, and those familiar with that database will be quite at ease with HiveQL.Pig takes a very similar approach, using a high-level programming language called PigLatin, which containsfamiliar constructs such as FOREACH, as well as arithmetic, comparison, and boolean comparators, and SQLlike MIN, MAX, JOIN operations. When users run a PigLatin program, Pig converts the code into one or moreMapReduce jobs and submits it to the Hadoop cluster, the same as Hive.Copyright 2013 Global Knowledge Training LLC. All rights reserved. 5

What these two interfaces have in common is that they are incredibly easy to use, and they both create highlyoptimized MapReduce jobs, often running even faster than similar code developed in a non-Java language viathe Streaming API.If you’re not a developer, or you don’t want to write your own Java code, mastery of Pig and Hive is probablywhere you want to spend your time and training budgets. Because of the value they provide, it’s believed thatthe vast majority of Hadoop jobs are actually Pig or Hive jobs, even in such technology-savvy companies asFacebook.It’s next to impossible, in just a few pages, to both give a good introduction to Hadoop as well as a good path tosuccessfully learning how to use it. I hope I’ve done justice to the latter, if not the former. As you dig deeper intothe Hadoop ecosystem, you’ll quickly trip across some other supporting products like Flume, Sqoop, Oozie, andZooKeeper, which we didn’t have time to mention here. To help in your Hadoop journey, we’ve included severalreference resources, probably the most important of which is Hadoop, the Definitive Guide, 3rd edition, by TomWhite. This is an excellent resource to flesh out all of the topics we’ve introduced here, and a must-have book ifyou expect to deploy Hadoop in production.Applying the Knowledge at WorkAt this point, your education could take one of many routes, and if at all possib