Derieving interesting data is always tied to a time period. We may want to extract interesting information from the whole life time of the data or we want to perform the same on a given time period say the last month or week. To specify such options, we have user defined functions in pig. This allows us to write filters and operations that we want pig to perform on each entry of the data set. This gives more control on the data filtering, flow. Some of the nuances of pig udf are explained in the example below.
Pig Version of this example: Apache Pig version 0.10.0 (r1328203)
Objective: You want to write a filter function in PIG to filter data rows according to a date range that you are interested in. You want to invoke the script from a scheduler which passes in the date range as command line parameters. A pig script is shown in the image below.
1) Passing command line parameters to pig script: You need to pass command line arguments like this pig -param datefrom='2012-10-20 00:00:00' -param dateto='2012-10-23 00:00:00' -x mapreduce user-based-analytics.pig. (I am actually calling the script from Python, which we will see in the next post).
Here I am using these two date parameters to build my Java UDF. If you are passing parameters with space character, it has to be like this otherwise pig will throw an error saying that, it cannot create your UDF java class.
2) With in the pig script: You refer to the command line parameters using the format '$
3) Defining your UDF reference using DEFINE pig keyword: This allows you to create a reference to your UDF which you can call using an alias. For example, the script defines a reference to the UDF as follows,
define date-based-filter com.home.pig.udfs.TimeStampFilter('$datefrom', '$dateto')
where date-based-filter is the alias that I will use to call my UDF com.home.pig.udfs.TimeStampFilter java class.
4) Calling your UDF filter in a pig script using FILTER keyword: Pig does not have a boolean data type. But, expressions are evaluated to boolean true or false. You need to call your UDF as follows, with the alias for your UDF. Here we are checking for datebasedfilter(ts) == TRUE i.e does my UDF 'com.home.pig.udfs.TimeStampFilter' acting on the current row with 'dateto' and 'datefrom' return Java Boolean true or false.
filter-by-date = filter site-data by date-based-filter(ts) == TRUE;
5) Now the Java Class that does the filtering.
b) Override the public Boolean exec(Tuple arg0) member function to define how this filter will handle tuples from the script. Here I just get the date from the string and check if it is within the range.
Why use Pig and UDFs? Writing UDFs can be easy and saves a lot of time compared to writing a MapReduce Java program or any other option. Plus, if you have a ton of data or will end up with one this is better option since Hadoop will scale and Pig will do the jobs like data groupings, filtering for you.