For many years now, Hadoop has stayed the undisputed champion helping corporates tame their biggest mammoth, the Big Data. However, time and again the champion might need to prove its worth and place. To organizations today, making that decision to call it quits on Hadoop is hard, and rightly so because there is a lot at stake. So here is one thing that a typical organization needs to be worried about – Is Hadoop slow for your big data? The answer is:
No:
No, if what your organization faces is huge data, then Hadoop can still probably help. Huge data here is unsurmountable data that you have to breakdown, curate, analyze, and make sense of. It is the first dilemma that enterprises need to first get an answer to – is your data really big. Figure out that to begin with. If your data is megabytes, gigabytes or even a couple of gigabytes in size, then it is not that big. There are simple tools and scripting languages like Python that can help you sort out the data.
If your data is truly big and menaces you with words like terabytes and petabytes then Hadoop is definitely not slow for your requirements. For example, consider that you have deal with six terabytes of data and this keeps adding up every fortnight, then you need something that can run longer and look at hundreds and thousands of tables, then Hadoop does the job just perfectly.
Yes:
If, the volume of data you are processing is relatively small, then Hadoop is slow for you to benefit from! Hadoop is easily scalable and HDFS is known for high throughput. But HDFS works with large blocks of data, which makes it ideal for large datasets. Similarly, MapReduce maps data, reduces unwanted tasks by taking control of the distribution part of the processing. But all of these are benefits only if your big data processing is large scale. Coding to run processes is complicated compared to other simple scripts that work easy with small data. HDFS is fundamentally not designed to process small data. It relies on Hadoop’s indexing, which does not complement the randomness involved in handling small data. This is further extended by MapReduce’s constant need of real-time data access, which it needs to form use cases.
So in short, Hadoop has its lines drawn and follows them with utmost dedication, not having to worry about the time elapsed, and means the data processed in that time need to be too huge, that the time is negligible. If the size of the data is relatively small, Hadoop is not the best fit.
Abhishek is a former Happiest Mind and this content was created and published during his tenure.