Hortonworks Sandbox Pig tutorial

I just completed the Hortonworks Pig tutorial. Seemed very straight forward, yet I ran into one problem.

The PIG script as specified was:

batting = load ‘Batting.csv’ using PigStorage(‘,’);
runs = FOREACH batting GENERATE $0 as playerID, $1 as year, $8 as runs;
grp_data = GROUP runs by (year);
max_runs = FOREACH grp_data GENERATE group as grp,MAX(runs.runs) as max_runs;
join_max_run = JOIN max_runs by ($0, max_runs), runs by (year,runs);
join_data = FOREACH join_max_run GENERATE $0 as year, $2 as playerID, $1 as runs;
dump join_data;

Yet it generated an error. I wasn’t able to understand the logs well enough (yet!) to debug it, so fell back to Google’ing it and found this.

http://hortonworks.com/community/forums/topic/error-while-running-sand-box-tutorial-for-pig-script/

Best I can understand, the input data has column headers yet the script assumes no column headers. So the fix is to filter out any row with non-numeric data.

batting = load ‘Batting.csv’ using PigStorage(‘,’);
runs_raw = FOREACH batting GENERATE $0 as playerID, $1 as year, $8 as runs;
runs = filter runs_raw by runs > 0;
grp_data = GROUP runs by (year);
max_runs = FOREACH grp_data GENERATE group as grp,MAX(runs.runs) as max_runs;
join_max_run = JOIN max_runs by ($0, max_runs), runs by (year,runs);
join_data = FOREACH join_max_run GENERATE $0 as year, $2 as playerID, $1 as runs;
dump join_data;

I suppose that there’s also a way to filter out the first row but my Pig isn’t anywhere near good enough for that.

Other than that, Pig seems interesting. Sort of a procedural programming language version of a subset of what the next tutorial shows us in Hive.

Advertisements

Comments are closed.