I much preferred this tutorial in Hive, rather than the previous one using Pig. Using the same dataset in each example made the comparison clearer.
Pig makes sense for sequential steps, such as an ETL job. Hive seemed better suited for tasks comparable to ones in which we’d write stored procedures within a more traditional database server.
Another difference came with debugging.
- The Pig editor bundled into the Hortonworks sandbox isn’t very sophisticated as IDEs go. No breakpoints, viewing of data, etc. Perhaps there’s a way to accomplish this, but (thankfully) it isn’t covered in such an early stage tutorial. There’s a button to upload a UDF jar, so I’ve got to research how one develops that jar outside of the Pig script editor.
- The Hive tutorial makes it easier to view progress at each step, since you can think of each step as an independent SQL (actually HiveQL) statement. If the programming task were far more complex, I could see myself structuring the Pig scripts in a way that might be easier to debug than Hive.
- Hive seemed good for an ad-hoc query and Pig for a complex procedural task.
- The next tutorial combines Pig and Hive. I’ll see how that shapes my perceptions.