In 2001 The Meta Group (later acquired by Gartner) defined “Big Data” using the “3 Vs”
- Amount of data
- By one estimate in 2013, 90% of all digital data has been created since 2011
- The Square Kilometer Array Telescope (http://www.skatelescope.org/) will generate approx 1 exabyte of data per day
- IDC defines Big Data projects as having at least 100 terabytes, which is naïve since this would preclude the vast majority of organizations from justifying a Big Data project.
- Frequency of data in and out
- Not just daily batch uploads
- Google stores meta data about searches that people perform – approx 2.5 million per minute
- Range of data types
- Range of data sources
- Most data created is unstructured. Some is semi-structured (spreadsheets?). Very little is structured (forms).
- An example of “variety” is a bakery which has many kinds of bread
Another article defines an additional 4 Vs
- Processes to ensure that the data is correct. This seems intuitive for all database, but the scale of Big Data also scales the impact of data inconsistencies.
- Data that changes meaning based on context (of other data, or of time). This requires far more complexity of analysis than can be done using a traditional SQL relational database system.
- An example of “variability” is a bakery in which the sourdough bread is just a little bit different every few days on an unpredictable schedule. The variety is the same but attributes of it are variable. If one could predict on which days one could expect specific tastes, then it would simply be variety.
- This becomes difficult to do only because of the previous five Vs. With simple data, the difficulty of visualization scales linearly. With Big Data, the difficulty of visualization is a factor of each of the 7 Vs.
- Ability for analysis of the data to be assigned as worth a lot of money. I don’t agree that this should be a V at all, since it is dependant upon or created by dependencies for any of the other 6 Vs. Can’t simple data also have great value? Just because data is Big, doesn’t imply that it is possible to extract value.
This is in contrast to IBM’s definition of Small Data, which seems to be simply “not 3 Vs”
- Low Volume
- Big Data projects are viable even if the volume is only a few gigabytes. In this case you have only a few nodes in your cluster. There’s no shame in having a Big Data infrastructure that isn’t thousands of nodes.
- Batch Velocity
- Sometimes the source can generate only in batches, not real time, so should not disqualify the project if criteria for other Vs are met.
- Structured Data
- Web Server log files are often an input into a Big Data system yet are not structured. At best they are semi-structured.