I just read a very funny (and informative) article on InfoWorld about clueless “cloud experts”. Very easy to translate into any tech vertical, but made me recall so may examples of people who don’t understand Big Data.
- I’ve built Big Data applications years ago.
I have a good friend (who I hope never reads this) who insists that he built a Big Data application in 1992 using Apple Hypercard with both executable and data distributed on one CDROM. Of course that was “a lot” of data in 1992. So one question if we want to be pedantic: If you don’t use Hadoop can it be a Big Data application?
- Big Data has no privacy. Isn’t that what the NSA proved?
This misconception is the exact opposite of the truth. The NSA uses Accumulo, a very secure Hadoop distribution, and siphons data from all sorts of systems all over the planet. Sure, it probably pulls from some Hadoop systems, but for the NSA to get so much data doesn’t it make sense that the vast majority must be coming from ordinary non-Hadoop systems?
- Big Data is the answer for everything.
I know a guy who suggested using Hadoop (running the Teradata distribution no less!) to store data feeds that we’re not ready to run ETL on yet. Wouldn’t a simple fileshare be a lot easier?
At its core, what the NSA is doing is finding anti-patterns. Crunching through huge sets of non-interesting data is the only way to find the interesting data.
Also, the Department of Defense sees the success that NSA is having with Hadoop technologies, and is considering using it (most likely Accumulo) to store large amounts of unstructured and non-schema data.
Sqrrl is powered by Apache Accumulo, which was originally developed for the NSA in 2008, is a low latency NoSQL database using Hadoop as its file system.
- Support for both role based and attribute based security controls
- Encryption at rest and in motion
- Can use multiple keys
- Trust boundaries limit the admin’s access to data
- Impact of encryption is only about 10% performance degradation
Most of the discussions about NSA data collection are devoid of technical facts. The media just likes to throw around the word “metadata” as if that means nothing to those of us who work all day with nothing other than metadata.
Here’s an article that doesn’t talk down to us, but explains how simple it is to replicate the HDFS nodes from Yahoo and Google data centers.
The problem seems to be that Yahoo and Google encrypt data in motion, but not data at rest. Would Accumulo solve the encryption problem for data at rest? However, Accumulo was originally developed for the NSA, who can likely break the encryption using the processing power of huge Hadoop clusters.