CSE 491/891 Lecture 23 (Pig and Hive) 1 Outline Last lecture (on Pig Latin) Relational operators FOREACH, GROUP, UNION, ORDER BY, DISTINCT, FILTER,
SPLIT, JOIN Functions TOKENIZE, FLATTEN, etc How to count number of words in documents using Pig Todays lecture Macros User-defined functions
Intro to Hive 2 Macros Macros provide a way to package reusable code DEFINE () RETURNS { Pig latin code for macro definition } Example: wordcount.pig DEFINE wordcount(text) RETURNS counts { tokens = foreach $text generate TOKENIZE($0) as terms;
wordlist = foreach tokens generate FLATTEN(terms) as word, 1 as freq; groups = group wordlist by word; $counts = foreach groups generate group as word, SUM(wordlist.freq) as freq; } 3 Macros You can either type in the macro definition in the grunt
shell directly or load it from a script file Example: At runtime, Pig will expand the macro using the macro definition above before executing it 4 User Defined Functions
Before writing your own functions, you may want to look at some examples first Piggy Bank is a repository of Pig functions shared by the Pig community. Pig functions are written as Java classes and archived in jar files You can download the source files (piggybank.tar) from the class webpage 5 User Defined Functions
There are different types of functions: Eval function: a function that takes one or more expression and returns another expression (e.g., MAX) results = FOREACH grp GENERATE group, myEvalFn(data.val) Filter function: a function that returns a Boolean value (to be used with FILTER operation) results = FILTER records BY myFilterFn(val) Store/Load function: a function that specifies how to load or store data mydata = LOAD mydata.txt USING myLoadFn()
6 Writing Your Own Function 1. Write your function as a Java program that extends the following Pig abstract classes Your Function Type 2. Pig Abstract Class
Eval function EvalFunc Filter function FilterFunc Load function LoadFunc
Compile your program: javac cp CLASSPATHFor AWS, the CLASSPATH is /usr/lib/pig/pig-0.16.0amzn-0-core-h2.jar:/usr/lib/hadoop/hadoop-aws-2.7.3amzn-1.jar (check first to make sure the file exists) 7 Writing Your Own Function 3. Next, create a jar file to archive your program shell> jar cvf
4. To use the function, you must register it first. grunt> REGISTER myfunction.jar 5. You can now start using the function in the grunt shell or in your Pig script file 8
Example Consider the problem of finding frequent words in a corpus of documents by filtering infrequent words that appear fewer times than a threshold Pig Latin: data = LOAD datafile AS (text:chararray); counts = wordcount(data); -- this uses the wordcount -- macro was defined on slide #3 REGISTER PopularTerms.jar; results = FILTER counts BY PopularTerms(freq,2); -- assume threshold is equal to 2 dump results;
User-defined filter function 9 PopularTerms.java This program takes 2 input arguments Word frequency Minimum frequency threshold (user parameter) It returns the value true if the word frequency is greater than or equal to the threshold
results = FILTER counts BY PopularTerms(freq,thres); 10 PopularTerms.java (Overall Structure) import java.io.*; import java.util.*; import org.apache.pig.FilterFunc; import org.apache.pig.backend.executionengine.ExecException; import org.apache.pig.data.Tuple; import org.apache.pig.impl.logicalLayer.FrontendException;
public class PopularTerms extends FilterFunc { @Override public Boolean exec (Tuple tuple) throws IOException { // This code will filter tuples with frequency below threshold // tuple contains the parameter values passed to the function } } 11 PopularTerms.java public Boolean exec (Tuple tuple) throws IOException { if (tuple == null || tuple.size() == 0) {
return false; } try { Object field1 = tuple.get(0); // get frequency Object field2 = tuple.get(1); // get threshold if (field1 == null || field2 == null) { return false; } int count = ((Long) field1).intValue(); int threshold = (Integer) field2; return (count >= threshold); } catch (ExecException e) { throw new IOException(e); }
} 12 Make sure you check the types of the fields first in grunt (using DESCRIBE) PopularTerms.java To compile and archive
Script to run the job (filterWords.pig); you can also run it interactively by typing into grunt 13 PopularTerms.java Submit the job 14 Summary Pig is a high level programming framework built
on top of Hadoop to simplify the task of writing programs to manipulate data (analogous to SQL) Limitation: syntax is still quite different from SQL Its a procedural language instead of a declarative language like SQL 15 CSE 491/891 Intro to Hive
16 Hive A data warehousing framework built on top of Hadoop Created to make it possible for analysts with strong SQL skills (but little Java programming skills) to run queries on large volumes of data Developed by Jeff Hammerbacher and his team at Facebook 17
Main Components of Hive Hive Server A service that enables remote clients to execute Hive query and retrieve the results HiveQL An SQL-like query language for interacting with Hive Hive Compiler Convert Hive queries into map-reduce jobs
Hive Client A command line interface (CLI) to interact with Hive Older Hive version: HiveCLI (now deprecated) Current Hive version: Beeline 18 Hive Separation of data from the table schema Data is stored on HDFS (raw data files)
Schema is stored in a metastore Unlike relational database (SQL), you cant simply update or delete an individual record in a table Hadoop is used for batch processing, it is not an online transaction database (where you can add/update/ remove any rows in real-time) 19 Launching Beeline Client After creating an AWS EMR cluster, you may
need to wait for > 15 min. to ensure hive is fully installed Otherwise, you may run into different types of errors when executing Hive query commands To launch the beeline hive client: ** This usually take about 10 minutes to get it up and running 20 Useful Commands !connect: connecting to Hive server
!help: get help !close: close the server connection !close all: close all server connections !quit: exit 21 Launching Hive Client Use jdbc to connect to the Hive server You can use the default database Username and password can be blank (press Enter)
Once connected, you can start entering SQL commands for creating, altering, querying, and dropping tables in the database 22 Other Example Commands List all the beeline commands available To exit the client program 23
Data Storage By default, Hive will store your data under the /user/hive/warehouse directory on HDFS You may request Hive to add compression and special directory structures (such as partitions) to the data to improve query performance However, you can also store your data in another directory on HDFS Hive will take the data as it is and will not try to optimize your data storage during query processing
24 Internal and External Tables Internal tables: When a new table is created, the raw data will be moved from its original location to /user/hive/warehouse When table is dropped, both the metadata (its schema) and the actual raw data will both be removed External tables: When table is created, a directory with the same name as the table name is created at /user/hive/warehouse,
but the data is kept at its original location When a table is dropped, its directory is removed from / user/hive/warehouse (along with meta data), but the original data (which is stored elsewhere) is not removed 25 CREATE TABLE Internal table: CREATE TABLE ( ) ...;
LOAD DATA INPATH INTO TABLE ; External table: CREATE EXTERNAL TABLE ( ) LOCATION ; -- specify the location of the original data 26 Example: Wikipedia Revisions Input data (wiki_edit.txt)
Columns are separated by white space Schema: (RevisionID, Article, EditDate, Username) 27 CREATE TABLE Syntax looks very similar to SQL: Each command is terminated by semicolon The above create table statement is incomplete You cant do INSERT command like SQL You need to specify the original data file and how the
fields within each record are delimited (separated) 28 CREATE INTERNAL TABLE To load data from an existing file, you need to specify the delimiter character for the fields 29 Example You can use hadoop fs to check the /user/hive/warehouse directory
Note that the command will create a subdirectory named wiki under /user/hive/warehouse But there is no data file in the directory since we did not specify where the original data is located 30 LOAD DATA INTO INTERNAL TABLE Input data is wiki_edit.txt (make sure it is uploaded to HDFS in a directory named wiki) filepath
This will move the data files in /user/hadoop/wiki on HDFS to the / user/hive/warehouse/wiki directory OVERWRITE forces the content of the target table to be deleted and replaced with files referred to by the inpath Hive does not parse the file to an internal database format, i.e., files are stored in its original format on HDFS Conversion from raw bytes in a file to appropriate fields occur when a retrieval or update query is processed 31 CREATE EXTERNAL TABLE
Data will NOT be moved to /user/hive/warehouse; they will remain in the external directory /user/hadoop/wiki 32 Data Types 33 Data Types
34 SHOW and DESCRIBE TABLES To display list of tables available To show the schema of a table 35 SQL Queries 36
Another Example: Bestbuy queries Upload data (bbuy.csv) to HDFS into a directory named bestbuy Schema: (user,sku,category,query,clicktime,querytime) 37 Another Example: Bestbuy queries Create an external table:
File location on HDFS If you drop the external table temp, the original data file still exist on /user/hadoop/bestbuy (but the table no longer exists) 38 DROP and ALTER TABLE Use drop table to delete the data and metadata for a table Drop table If you want to delete the data but keep the table
definition, simply remove the files in HDFS (without removing the directory) Use Alter table to modify the tables schema ALTER TABLE abc RENAME TO xyz ALTER TABLE abc ADD COLUMNS (newcol STRING) 39