- Hive is a SQL-like layer on top of Hadoop
- Use it when you have some sort of structure to your data.
- You can use JDBC and ODBC drivers to interface with your traditional systems. However, it’s not high performance.
- Originally built by (and still used by) Facebook to bring traditional database concepts into Hadoop in order to perform analytics. Also used by Netflix to run daily summaries.
- Pig is sometimes compared to Hive, in that they are both “languages” that are layered on top of Hadoop. However, Pig is more analogous to a procedural language to write applications, while Hive is targeted at traditional DB programmers moving over to Hadoop.
Sqoop is a tool for efficient and large loads/extracts between RDMS and Hadoop.
This ecosystem has enough made up words that it’s important to get the commonplace industry standard words correct — “JDBC Driver” and “JDBC Connector”.
- Driver is a JDBC driver.
- Connector could be generic or vendor specific
- Sqoop’s Generic JDBC connector is always available as part of the standard distribution.
- Also includes connectors for MySQL, PostgreSQL, Oracle, MS SQL, IBM DB2, and Neteza. However, the DB vendors (or someone else) might have customized/optimized connectors.
- If the programmer doesn’t select a connector, or if the data source is not known until runtime, Sqoop can try to figure out what the appropriate connector is. Sometimes this is easy, such as if the url to access the data is something like jdbc::myslq//…
- Tool for bi-directional data between Hadoop and relational database using JDBC.
- Optimized drivers for specific database vendors are available.
- Command line tool
Flume and FlumeNG (Next Generation)
- Enables realtime streaming into HDFS and HBase.
- The use case for Flume is for streaming of data, such as continual input from web server logs.