Sqoop is a tool for efficient and large loads/extracts between RDMS and Hadoop.
This ecosystem has enough made up words that it’s important to get the commonplace industry standard words correct — “JDBC Driver” and “JDBC Connector”.
- Driver is a JDBC driver.
- Connector could be generic or vendor specific
- Sqoop’s Generic JDBC connector is always available as part of the standard distribution.
- Also includes connectors for MySQL, PostgreSQL, Oracle, MS SQL, IBM DB2, and Neteza. However, the DB vendors (or someone else) might have customized/optimized connectors.
- If the programmer doesn’t select a connector, or if the data source is not known until runtime, Sqoop can try to figure out what the appropriate connector is. Sometimes this is easy, such as if the url to access the data is something like jdbc::myslq//…
This article from Cloudera offers up use cases (such as customer sentiment) and a tutorial for using Apache Flume for near-real-time indexing (as emails arrive on your mail server) or MapReduce (actually MapReduceIndexerTool) for batch indexing of email archives. The two methods can be combined if you decide to do real-time, but later decide to add another MIME header field into the index.
Cloudera Search is based on Apache Solr (which contains components like Apache Lucene, SolrCloud, Apache Tika, and Solr Cell).
The email (including the MIME header) is parsed (with the help of Cloudera Morphlines), then uses Flume to push the messages into HDFS, as Solr intercepts and indexes the contents of the email fields.
Searching and viewing the results can be done using the Solr GUI or Hue’s search application.
Posted in apache, cloudera, Flume, HDFS, Hue, MapReduce, Solr, Tika, tutorial, Use Case
Tagged cloudera.com, github.com, lucene.apache.org, tika.apache.org
Within the classic MapReduce is the Job Tracker component. Yarn splits Job Tracker into two further components: Resource Manager (aka RM) (allocating cpu, ram, etc) and Node Manager (aka NM) (which operates at the level of a single node/machine). The Application Manager (aka AsM) negotiates resources from the Resource Manger and with the Node Manager to execute tasks. Job Tracker is already an ancient architecture — five years old!!
Yarn is sometimes referred to as MapReduce 2.0 or MRv2.
Resource Manager supports hierarchical application queues to guarantee allocation ratios of cluster resources. However, it does not enable recovery from application or hardware failures. It does not monitor. It only schedules. Scheduling methods include FIFO (default) and Capacity. Fair is not currently supported.
ZookKeeper monitors Resource Manager in order to switch to a secondary if Resource Manager itself fails. In a failover scenario, running applications are restarted and the queue continues. Preservation of state within currently running applications is handled by checkpoints stored by the Application Master within HDFS.
Rather than having specific containers to execute Map jobs and Reduce jobs, Yarn enables containers for more generic jobs, which enables developers to write other applications that run on the cluster.
It’s unclear whether Yarn will make the system run faster or slower. Generalization and modularization usually comes at a cost. However, Yarn allows for more complete utilization of CPU and RAM resources so in theory can squeeze every last bit of capacity out of a cluster, whereas the fixed size containers in MapReduce 1.0 could have left some resources idle. Yarn does not mange I/O which is typically a bigger bottleneck than RAM. There’s also no management of network bandwidth in Yarn. (Note to self, got to figure this out: I saw another article that says that Yarn does manage cpu, disk and network, yet didn’t mention RAM).
Another benefit of a more modularized architecture is that it makes the system easier to maintain. Any updates to MapReduce 1.0 requires the replacement of a pretty big chunk of software. Being able to run multiple versions of MapReduce within a cluster of thousands of nodes is important. Significant downtime would otherwise be required for upgrades.
I’m summarizing this article. For specifics (such as how to configure split machines across racks to better configure the network switches) see the article. None of this content is operating system or hardware vendor specific, but generally the discussions assume Linux.
Goal is to minimize data movement and process on the same machine that stores the data. Therefore each machine in the cluster needs appropriate CPU and disk. Problem is that when building the cluster the nature of the queries and the resulting bottlenecks may not yet be known. If a business is building its first Hadoop cluster, it may not yet fully understand the types of business problems that will eventually be solved by it. That’s in contrast to a business deploying it’s Nth Oracle server.
Types of bottlenecks:
- IO: reading from disk (or a network location)
- data import/export
- data transformation
- CPU: processing the Map query
- text mining
- natural language processing
Other issues, since a cluster could eventually scale to hundreds or thousands off machines
The Cloudera Manager can provide realtime statistics about how a currently running MapReduce job impacts the CPU, disk, and network load.
Roles of the components within a Hadoop cluster:
- Name Node (and Standby Name Node): coordinating data storage on the cluster
- Job Tracker: coordinating data processing
- Task Tracker
- Data Node
Data Node and Task Tracker
- The vast majority of the machines in a cluster will only peform the roles of Data Node and Task Tracker, which should not be run on the same nodes as Name and Job.
- Other components (such as HBase) should only be run on the Data Nodes if they operate on data. You want to keep data local as much as possible. HBase needs about 16 GB Heap to avoid garbarge collection timeouts. Impala will consume up to 80% of available RAM.
- Assumed to be lower performance machines than the Name Node and Job Tracker
Name Node and Job Tracker
- Standby Name Node should (obviously) not be on the same machine as the Name Node.
- Name Node (and Standby Name Node) and Job Tracker should be enterprise class machines (redundant power supplies, enterprise class raid’ed disks)
- Name Node should have RAM in proportion to number of data blocks in the cluster. 1GB RAM for every 1 million blocks in HDFS. With 100 Data Node cluster, 64 GB RAM is fine. Since the machine’s tasks will be disk intensive, you’ll want enough RAM to minimize virtual memory swapping to disk.
- 4 – 6 TB of disk, not raid’ed (JBOD configuration)
- 2 CPUs (at least quad code). Recommend more CPUs and/or cores as opposed to faster CPU speed, since in a large cluster the higher speed will draw more power and generate more heat, yet not scale as well as if there were simply more CPUs or better yet nodes.
Cloudera has defined four standard configurations
- Light Processing (I’m not sure what the use case is for this. Prototype? Sandbox?)
- Balanced Compute (recommeded for your 1st cluster, since it’s not likely you’ll properly identify which configuration is best suited for your use case)
- Storage Heavy
- Compute Heavy
Posted in cloudera, DataNode, hardware, HBase, Impala, JobTracker, Linux, MapReduce, NameNode, TaskTracking, tutorial, UNIX
I’d been playing with the sandbox from Cloudera, but just discovered that there’s also one available from HortonWorks. Runs on Oracle VirtualBox (recommended), VMWare Fusion or Player, and Microsoft Hyper-V.
by Mile Olson (Cloudera CEO) titled “HADOOP: Scalable, Flexible Data Storage and Analysis”
Hadoop is an open source Apache project, but a lot of the contributions come from Cloudera.
The Cloudera Distribution of Hadoop (CDH) appears to be the defacto standard, although other vendors such as IBM have their own. Cloudera provides a downloadable VM with a fully configured single node of Hadoop. I was able to get this up an running on my own MacBook Pro running Oracle Virtual Box in about 15 minutes.
Cloudera claims that they have more customers and more experience thatn any other Hadoop vendor.