Interesting article with examples of presentations made by large corporations of how they use Hadoop. Most presentations at this conference were about standalone big data.
HSBC created a 360 degree view of the customer, but it was for “agile reporting” not the traditional sort that would be used in a call center or from a data warehouse. There wasn’t, however, a plan on reconciling Hadoop and the data warehouse. They were parallel and standalone.
Many presentations avoided core enterprise concerns such as governance. Some seemed “proud” to bypass this as somehow being exempt from an inflexible model.
1. Big Data Exploration
I don’t agree with the author’s category. He admits that this is a “one size fits all category”. Almost seems like he had four use cases, and decided to make it into five by says adding that you can search, visualize, and understand data from multiple sources to help decision making. Haven’t we been doing this all along, with whatever database tools we’ve had?
2. Enhanced 360 degree view of the customer
From my own experience I had a project where we did this for a call center. However, the key was that we did real time queries to generate the 360 degree view when the call center agent took the call from the customer. The problem there was that in order to produce the view in only a couple of seconds we were very limited in what sort of data we had access to, and how we could analyze this. The Big Data perspective of 360 degrees assumes that the Hadoop repository retains a persistent copy of the data, something that many organizations don’t want. For example, the data will likely not be real time. However, having a copy of the data, and having the time to crunch it in batch mode will give a deeper insight into the customer. Perhaps what’s needed is a hybrid of realtime and batch, sort of like what Twitter is doing with Storm.
3. Security/Intelligence Extension
Searching for past occurrences of fraud, or creating a predictive model of possible future occurrences is very much a batch operation, and Hadoop works well on this since the scope of the analysis is limited only by the depth of the data and the duration of operations upon it.
4. Operations Analysis
I think that the author’s example of the “internet of things” might be a stretch, but commingling and analysis of unstructured and/or semi-structured server and application logs is a perfect use case for Hadoop. This is especially true if the log data streams in, so that the results of your analysis are updated as each batch cycle completes.
5. Data Warehouse Augmentation
Some data can be pre-processed in Hadoop before loading into a traditional data warehouse. Other data can be analyzed without needing to load into a data warehouse at all, where it might just clutter up other queries. Hadoop lets you dump everything in, and sort it out later. Data warehouses are intended to be kept tidy.
In 2001 The Meta Group (later acquired by Gartner) defined “Big Data” using the “3 Vs”
- Amount of data
- By one estimate in 2013, 90% of all digital data has been created since 2011
- The Square Kilometer Array Telescope (http://www.skatelescope.org/) will generate approx 1 exabyte of data per day
- IDC defines Big Data projects as having at least 100 terabytes, which is naïve since this would preclude the vast majority of organizations from justifying a Big Data project.
- Frequency of data in and out
- Not just daily batch uploads
- Google stores meta data about searches that people perform – approx 2.5 million per minute
- Range of data types
- Range of data sources
- Most data created is unstructured. Some is semi-structured (spreadsheets?). Very little is structured (forms).
- An example of “variety” is a bakery which has many kinds of bread
Another article defines an additional 4 Vs
- Processes to ensure that the data is correct. This seems intuitive for all database, but the scale of Big Data also scales the impact of data inconsistencies.
- Data that changes meaning based on context (of other data, or of time). This requires far more complexity of analysis than can be done using a traditional SQL relational database system.
- An example of “variability” is a bakery in which the sourdough bread is just a little bit different every few days on an unpredictable schedule. The variety is the same but attributes of it are variable. If one could predict on which days one could expect specific tastes, then it would simply be variety.
- This becomes difficult to do only because of the previous five Vs. With simple data, the difficulty of visualization scales linearly. With Big Data, the difficulty of visualization is a factor of each of the 7 Vs.
- Ability for analysis of the data to be assigned as worth a lot of money. I don’t agree that this should be a V at all, since it is dependant upon or created by dependencies for any of the other 6 Vs. Can’t simple data also have great value? Just because data is Big, doesn’t imply that it is possible to extract value.
This is in contrast to IBM’s definition of Small Data, which seems to be simply “not 3 Vs”
- Low Volume
- Big Data projects are viable even if the volume is only a few gigabytes. In this case you have only a few nodes in your cluster. There’s no shame in having a Big Data infrastructure that isn’t thousands of nodes.
- Batch Velocity
- Sometimes the source can generate only in batches, not real time, so should not disqualify the project if criteria for other Vs are met.
- Structured Data
- Web Server log files are often an input into a Big Data system yet are not structured. At best they are semi-structured.