I just read a very funny (and informative) article on InfoWorld about clueless “cloud experts”. Very easy to translate into any tech vertical, but made me recall so may examples of people who don’t understand Big Data.
- I’ve built Big Data applications years ago.
I have a good friend (who I hope never reads this) who insists that he built a Big Data application in 1992 using Apple Hypercard with both executable and data distributed on one CDROM. Of course that was “a lot” of data in 1992. So one question if we want to be pedantic: If you don’t use Hadoop can it be a Big Data application?
- Big Data has no privacy. Isn’t that what the NSA proved?
This misconception is the exact opposite of the truth. The NSA uses Accumulo, a very secure Hadoop distribution, and siphons data from all sorts of systems all over the planet. Sure, it probably pulls from some Hadoop systems, but for the NSA to get so much data doesn’t it make sense that the vast majority must be coming from ordinary non-Hadoop systems?
- Big Data is the answer for everything.
I know a guy who suggested using Hadoop (running the Teradata distribution no less!) to store data feeds that we’re not ready to run ETL on yet. Wouldn’t a simple fileshare be a lot easier?
A research paper from Cornell University discusses scheduling Hadoop jobs based upon an analysis of available network bandwidth. Typically a Hadoop cluster only considers server node availability when scheduling. Software Defined Networking (SDN) is assumed. SDN is a new front in virtualization technology and critical for dynamic scaling of clouds.
Normally we’d like to think of Hadoop running on hundreds of racks of commodity hardware, but that doesn’t mean that we should forget all of the reasons why we love virtualization.
This case study explains that how & why, and provides benchmarks of the experiment of running Hadoop on VMWare. Of course the experiment was successful, as the study was published by VMWare.
The moral of the story is that just because Hadoop can run on commodity hardware doesn’t mean that it has to, or that it’s the best way to deploy.
Xplenty offers Hadoop as a Service for Amazon Web Services in all AWS global regions. This HaaS offering promises a “coding free design environment”, of course in additional to AWS hardware free environment.
- Hive is a SQL-like layer on top of Hadoop
- Use it when you have some sort of structure to your data.
- You can use JDBC and ODBC drivers to interface with your traditional systems. However, it’s not high performance.
- Originally built by (and still used by) Facebook to bring traditional database concepts into Hadoop in order to perform analytics. Also used by Netflix to run daily summaries.
- Pig is sometimes compared to Hive, in that they are both “languages” that are layered on top of Hadoop. However, Pig is more analogous to a procedural language to write applications, while Hive is targeted at traditional DB programmers moving over to Hadoop.
At its core, what the NSA is doing is finding anti-patterns. Crunching through huge sets of non-interesting data is the only way to find the interesting data.
Also, the Department of Defense sees the success that NSA is having with Hadoop technologies, and is considering using it (most likely Accumulo) to store large amounts of unstructured and non-schema data.
In 2013 Cloudera acquired a company called Myrrix, which has morphed into project (not yet a product) called Oryx. The system still uses MapReduce, which is not optimal. Before is becomes a product it’ll be rewritten using Spark.
Oryx will enable construction of machine learning models that can process data in real time. Possible use cases are spam filters and recommendation engines (which seems to be its sweet spot).
This competes with Apache Mahout, which processes in batch mode only.