Big Data is a topic which I find very exciting. Data provides knowledge; it provides insight; it highlights trends; lets us glimpse the future; helps us to understanding the past. Data allows us to learn from our mistakes. It allows us to make expert decisions at difficult times. It can keep loved ones alive whether through mission critical air traffic control systems or cutting edge medical research.
However, data can only enable us to do these things if it contains the correct fields, or collection of fields, which influenced any given scenario. As humans we cannot consistently and accurately predict all of the pertinent information to collect to enable us to retrospectively dissect a given scenario. What’s more, we don’t always know what we want to find out until the opportunity for tailoring the data collection process has long since passed. We need mechanisms which allow us to gather huge amounts of data which can be stored, managed and analysed in volumes we have never considered before. This is the realm of Big Data.
Over a series of posts I’m going to explore Big Data. I will consider: what is Big Data; what concepts and technologies enable Big Data; what problems can it solve which we current methods cannot; what skills are required to leverage its power; and how might Big Data change the IT landscape both from a consumer and from an IT professionals perspective. Along the way I will also explore what is being done with Big Data in the world right now.
For this, the first post in the series, I will take a high level look at what Big Data is.
What is Big Data?
There are a number of competing definitions of what Big Data is however, at the point of writing this post, there doesn’t seem to be one universally accepted definition. Some opinions argue that Big Data is:
- Working with data sizes ranging from 100’s of terabytes –> petabytes or more
- Working with data which is too big for to be handled by traditional tooling/technologies
- Any computing which involves Hadoop (the MapReduce programming model at the core of Hadoop is a key feature on the landscape of Big Data, but not all implementations of the MapReduce model actively target this as their core business – e.g. RavenDB)
It is also worth noting that many consider that data volume in itself is not the sole indicator of Big Data. Other pertinent factors are velocity of data and variety of data, collectively referred to as the 3 V’s.
Competing definitions aside, it is fair to say that you cannot claim to work with Big Data without diving into distributed/parallel processing and the associated concepts and technologies. Specifically, Big Data wouldn’t be achievable in its current guise if it wasn’t for two seminal papers:
- The Google File System, published by Google in 2003 describing a tolerant, scalable, cluster based file system
- MapReduce: Simplified Data Processing on Large Clusters published by Google in 2004 which presents the MapReduce programming model
I will explore these models in a subsequent post covering the concepts and technologies enabling Big Data.
Why is Big Data Important?
In his lecture “The unreasonable effectiveness of Data”, Peter Norvig, Director of Research at Google, highlights that you gain much better insight from running relatively simple algorithms on large datasets than you do from running complex algorithms on smaller datasets. Simply put, greater volumes of data can provide much better insights.
In addition to this, ongoing advances in computing have resulted in the rise of continually cheaper computing power and storage. Simultaneously our cloud based technologies and architectures are maturing. This has increased our capability to store and process much larger volumes and complexities of data: and thanks to the cloud this capability is not only available to the large companies with huge budgets.
If you combine the insights in Mr Norvigs lecture along with the technological advances is storage and processing of recent years, we are now in a position where Big Data is not only a real possibility, but is actually being used successfully in market by many companies. Some areas which may benefit from this are:
- Medicine: being able to carry out intensive statistical analysis on volumes and breadths of data previously not possible may result in breakthroughs in a wide range of specialities
- Stock market predictions: A “Mood of the Nation” type study in Oct 2010 analysed emotive words in millions of Tweets to predict the DJIA several days in advance. It was proven to be 87.6% accurate (read Twitter mood predicts the stock market). Big Data would undoubtedly increase the opportunity for these types of studies
- Web 3.0: Search engine capabilities, social networking advances and perhaps even the rise of the semantic web? Big data would undoubtedly be required to handle the huge volumes of metadata required to create a semantic web – if that would in fact be part of 3.0?
One famous, and somewhat controversial, application of Big Data is by the NSA. They were reported by USA Today to have to collected hundreds of billions of records of telephone calls made by U.S. citizens. The data is believed to include detailed call information which was intended to be used for traffic analysis and social network analysis: undoubtedly a requirement for the application of Big Data tools and techniques. This activity was claimed by Bloomberg News to have began seven months before the September 11 attacks. Perhaps such an application of Big Data will someday thwart atrocities such a 9/11 – although it does, in this context, raise the question of civil liberties.
Is Big Data Relevant Now?
We have highlighted a couple of real life uses of Big Data in the previous section, i.e. stock market predictions from social network data and the NSA’s call database, but what are other key organisations and figures in the industry predicting about Big Data? Microsoft are currently working with Hortonworks (formed by the key architects and committers from Yahoo! Hadoop software engineering team) to bring Hadoop on Azure. This does demonstrate Microsoft’s belief in, and commitment to, Big Data and the Hadoop project. Similar offerings of confidence come from Amazon Web Services, IBM, Greenplum and of course Google and Yahoo have always been at the heart of Big Data. An impressive list of products that include Hadoop can be found here.
Some other interesting quotes on Big Data are:
“IDC expects the Big Data technology and services market to grow at a 39.4% compound annual growth rate through 2015.” – Dan Vesset, program vice president for IDC’s Business Analytics Solutions.
“Many organizations are becoming overwhelmed with the volumes of unstructured information – audio, video, graphics, social media messages – that falls outside the purview of their ‘traditional databases’,” – Joe McKendrick, analyst at Unisphere Research
“Deloitte predicts that in 2012, “big data”will likely experience accelerating growth and market penetration.” – Deloite.com Billions and billions: big data becomes a big deal
I think it is fair to say that their is a clear case for Big Data and the technologies driving it are progressing at an incredible rate. There are real life offerings of Big Data analytics in the commercial world, for example just look at the case studies on Big Data analytics company Sumerian’s website here, an you will find that this really is an area which is growing quickly.
I think that Big Data offers some extremely exciting opportunities for computing. It will enable the IT industry as a whole to provide a much greater service to all industries which call on it: both directly though data analytics as well as indirectly through pushing at the forefront of computing frontiers. For example, Big Data may play a pivotal role in the emergence of Web 3.0 by enabling a much greater web of metadata to support a semantic web.
I am looking forward to writing further posts diving deeper into this interesting and extremely promising area…
Happy coding! 😉