Big Data Analytics with Hadoop 3
上QQ阅读APP看书,第一时间看更新

Distributed computing using Apache Hadoop

We are surrounded by devices such as the smart refrigerator, smart watch, phone, tablet, laptops, kiosks at the airport, ATMs dispensing cash to you, and many many more, with the help of which we are now able to do things that were unimaginable just a few years ago. We are so used to applications such as Instagram, Snapchat, Gmail, Facebook, Twitter, and Pinterest that it is next to impossible to go a day without access to such applications. Today, cloud computing has introduced us to the following concepts:

  • Infrastructure as a Service
  • Platform as a Service
  • Software as a Service

Behind the scenes is the world of highly scalable distributed computing, which makes it possible to store and process Petabytes (PB) of data:

  • 1 EB = 1024 PB (50 million Blu-ray movies)
  • 1 PB = 1024 TB (50,000 Blu-ray movies)
  • 1 TB = 1024 GB (50 Blu-ray movies)

The average size of one Blu-ray disc for a movie is ~ 20 GB.

Now, the paradigm of distributed computing is not really a genuinely new topic and has been pursued in some shape or form over decades, primarily at research facilities as well as by a few commercial product companies. Massively parallel processing (MPP) is a paradigm that was in use decades ago in several areas such as oceanography, earthquake monitoring, and space exploration. Several companies, such as Teradata, also implemented MPP platforms and provided commercial products and applications.

Eventually, tech companies such as Google and Amazon, among others, pushed the niche area of scalable distributed computing to a new stage of evolution, which eventually led to the creation of Apache Spark by Berkeley University. Google published a paper on MapReduce as well as Google File System (GFS), which brought the principles of distributed computing to be used by everyone. Of course, due credit needs to be given to Doug Cutting, who made it possible by implementing the concepts given in the Google white papers and introducing the world to Hadoop. The Apache Hadoop framework is an open source software framework written in Java. The two main areas provided by the framework are storage and processing. For storage, the Apache Hadoop framework uses Hadoop Distributed File System (HDFS), which is based on the GFS paper released on October 2003. For processing or computing, the framework depends on MapReduce, which is based on a Google paper on MapReduce released in December 2004 MapReduce framework evolved from V1 (based on JobTracker and TaskTracker) to V2 (based on YARN).