Big Data is not just Hadoop. There are several other ways in which Big Data can be harnessed. There are many different kinds of technologies, to make sense of Big Data. Thus Big Data is not technology also. It simply is a use case. And there are several of them out there. How effectively can we take business decisions using Big Data is the main premise here. There are 3 A’s of Big Data : Analytics, Attribution and Algorithms. While Analysis deals with the ‘What’ part of data (correlation), the Attribution focuses on the ‘Why’ (causation). To do both, we require Algorithms or simply programs. While we have been doing Analysis for the past 20-30 years, the attribution is what catches the fancy of people. We can be content with ‘Analytics’ as it gives us actionable insights to data. But as we are human beings, we need to know the history of the pattern. Sometimes it helps to know that, but most of the time it doesn’t. You really don’t require maths and stats professionals only. What you require is a broad range of people who understand domains (businesses) and can communicate well with the teams, besides their PhD in applied maths and statistics.
Big Data is a revolution by itself, and we are at the tip of the iceberg. Imagine yourself being overwhelmed by the data that is thrown at you, from everywhere. If you need to figure it out, you need to know where to look and then ask the right questions. The answers can be found out from the haystack, but what is important is asking the right questions.The data received is so huge that it is not feasible anymore to transport this data over a network. Instead, you move your programs to the location of the data. So typically compute happens on the same machine that holds the data in it. The results of the compute can be transmitted over a network as it will be considerably smaller than the data itself. Like an expert system which learns the rules of processing data, the big data programs can also learn. There are essentially 2 ways in which they do: Supervised and Unsupervised. While the former requires intervention of human experts, the latter can learn all the rules by inference. Both these modes of learning are a cornerstone of the new evolving technology called Machine Learning.
All of this may seem interesting, but let’s talk about why Big Data is not just Hadoop. For this, we have to understand both Big Data and Hadoop. First of all, Big Data is a loose term given to data that has volume, velocity and variety. When we talk of volumes, it may be of the order of petabytes and above. Velocity maybe in the order of GB/s and variety entails storing information that is structured, semi-structured and un-structured. Hadoop (HDFS – the storage of Hadoop) is a good solution to storing all this streaming data. It’s basically a ‘schema-on-read’ filestore, hence you can store it in an as-is format. You can apply whatever schema you want to apply while reading it. This you will recall is the biggest stumbling block of RDBMS’s. They have a strict schema. Here’s a comparison between Hadoop and RDBMS’s.
As you can see above, Hadoop is well suited for OLAP kind of applications which have lots of queries, but it’s not suited for OLTP applications which have a lot of Insert’s, Update’s and Delete’s. Also the nature of data to be stored in Hadoop can be to any minute level, while RDBMS’s are not good at storing all different kinds of data.
For Big Data to work, it has to work with a combination of structured and unstructured data. Also, it has to serve both OLAP and OLTP purposes. Hence what we will see in the future is a combination of Hadoop and traditional systems. There is one more solution called NoSQL (these are Not Only SQL databases). They are also like Hadoop, in the sense that we can do fast OLAP operations on them, (typically Column Oriented Databases). We have 4 kinds of NoSQL databases. They are
- Key Value
- Document Oriented
- Column Family
All of them have their pros and cons. The market out there has a lot of vendors providing multiple solutions to address various use cases. The market for Hadoop also is very fragmented. While HDFS (storage of Hadoop) has not changed, the layer above it like Hbase, Pig, Hive etc has changed dramatically from vendor to vendor. This remains an area of concern. Because most of these offerings are open source, vendors add features like it were there own product. And we can’t blame them. Because very often competition brings out the best among vendors. And what excels, becomes a standard.
The future belongs to a combination of these solutions to address various use cases. While we will see the rise of Hadoop, it just cannot be the single answer for all problems. Hence if I may use the term, polyglot (which means many) solutions co-existing with each other will be the key to the future of Big Data.