Hadoop is a clustering and big data processing system. It’s an open source and java based technology developed on Yahoo that now belongs to the Apache Software Foundation.
Is a reality that all of us are more connected hence we produce more data and we will keep on that path, so, Hadoop arises from the need of storing and processing large amounts of structured and unstructured data (around terabytes to petabytes) and because of the need of reliability over that data. Companies like Facebook produces daily around 500TB of data and most of it is unstructured that can’t be stored nor processed by traditional ways. So, How a company like facebook could do such thing as well as to keep reliability over that huge amount of data, at low-cost, with high speed availability and high fault tolerance? Well, Hadoop arises as a answer to these Big Data needs. But, How?
To understand Hadoop we need to understand two fundamental things about it:
- How Hadoop store data?
- How Hadoop processes data?
How does Hadoop stores data?
Storing data is done and accomplished by the HDFS which stands for Hadoop Distributed File System, a FS that allows applications to run across multiple servers (nodes). Let’s understand some concepts and its functioning by reviewing the architecture:
There are some fundamentals components of HDFS, which are:
- Node(s): Computer or server.
- Rack(s): A collection of nodes, around 30 to 40 physically stored together and connected to the same network switch.
- Hadoop cluster: A collection of racks.
- Blocks: Data broken into smaller pieces.
- Datanodes: Manage the storage that is attached to the nodes on which they run (One datanode per node – Slaves)
- Namenodes: Server that regulates access to files by clients and executes file system namespace operations (Master or like a coordinator)
So, how does this works?
A Hadoop cluster is composed by ‘Nodes’ and ‘Racks’, nodes are computers or servers across the network, you can have as many as you need and racks are a group of nodes (computers or servers), this all together form a Hadoop cluster. When data gets into the Hadoop cluster it is broken into pieces called blocks and this blocks and copies of blocks are distributed to the cluster in a process called replication. HDFS has a master/slave architecture, and the Namenode is the master, let’s say like the coordinator of the cluster that regulates the access to data by clients, it also knows where the data is located and what’s happening throughout all the cluster, it is actually the Namenode the one that makes all decisions regarding replication, like the number of replications per blocks. This information handled by the Namenode is actually the metadata of the files stored in the HDFS, example:
Metadata (Name, Replicas, …):
/home/foo/data, 3, …
*The data stored in /home/foo/data will be replicated 3 times, this means into three nodes across the network.
So, Namenode is the boss around the cluster, but, What exactly is the function of the Datanode?
Datanodes are the ones responsible for serving read and write requests from the file system’s client, they also perform block creation, deletion, and replication upon instruction from the NameNode. This make the Datanodes like the employees who execute all the processing actions over the data into each node.
The relation between Namenode and Datanodes is a close one, Datanodes periodically sends heartbeats and blockreports to the Namenode, it’s kind of a boss/employee relationship. Heartbeats sent from the Datanodes is a sign of proper functioning.
Replication (The how to store data core in Hadoop)
We talked about replication and below image shows a very understandable example of replication:
As I mentioned before, how many time a files will be replicated is something set up in the file metadata handled by the Namenode. So in the image above, you can see a file “File 1”, which in order to be stored into the Hadoop Cluster is broken into pieces called “Blocks”, in this case 4 blocks, and then those are replicated through the nodes in the cluster. This is how data is stored in a Hadoop cluster and replication is also the reason that gives to Hadoop high fault tolerance.
How does Hadoop processes data?
Data processing in Hadoop is accomplished by a transformation process called MapReduce, where applications are divided into self-contained units of work. A MapReduce program is known as a ‘job’ and it has two stages, the ‘Map’ and ‘Reduce’ which are executed as ‘tasks’, the self-contained units of work.
When an application submit a job to a specific node (Master node) this will be handled by a program called ‘jobTracker’, this last one will communicate to the Namenode to locate the data across the cluster. The job is broken into tasks (map and reduce tasks) for each node and the ‘jobTracker‘ will schedule these tasks on the cluster where the data is stored, instead of sending data for processing across the network. The ‘Map Phase‘ starts when the MapReduce job splits the input data into chucks that will be processed in parallel by ‘Map Tasks‘, these chunks are called ‘tuples‘ (key/value pairs). Once this is done, the ‘Reduce phase‘ starts, with the ‘Reduce Tasks‘ taking as input the output from the ‘Map Tasks‘ (Map Phase) and combining the tuples into smaller sets of tuples, providing this way an answer to the user/application processing request. The tasks are monitored by the TaskTrackers, which are a set of programs/agents that run continuously into each node and in case of task failure they report back to the JobTracker which reschedules the task onto another node in the cluster.
Understanding how Hadoop works allows us to see the benefits of using it. Let’s review some of them:
- Fault Tolerance: Because of its replication capabilities
- Scalability: Simply by adding new nodes
- Low cost: Open-source framework that uses commodity hardware, the only node (computer-server) where you need to invest is the master node where the namenode is located, because this is the single point of failure.
- High speed processing: MapReduce tasks run into each node, which means that processing is being performed into each node and only processes data is being transported into the network as result.
- Flexibility: You can process structured and unstructured data
The list of where Hadoop is implemented is large but I’ll mention only one: Infosphere BigInsights which is a Hadoop based software designed to understand and analyse massive volumes of unstructured data as easily as smaller volumes of data. This allows to derive business value from complex and unstructured information, but how? Infosphere BigInsights (IBM Product) provides below scenarios:
- Predictive modeling: Refers to uncovering patterns to help make business decisions.
- Customer sentiment insight: Helps to derive customer sentiment from social media messages, product forums, online reviews, blogs, and other customer-driven content.
- Research and business development: Collect and analyze information that is critical to implementing future projects, and turning project visions into reality.
Above scenarios are a result of massive big data analysis which produces value content for companies to keep making business of it. What could be call unvalued data, after processing becomes really valued data that keeps growing business.
What you won’t accomplish by using Hadoop?
Hadoop it’s not a Database and it won’t replace a Database. Hadoop is a batch analytics processing system. Taking that into account, you have to remember that Hadoop stores data in files, not indexes like databases do, so everytime you want to process something you are running a MapReduce job which will go through all the data and this will take time compared to how Databases work with realtime applications.
I hope this brief blog about Hadoop has helped you out to understand the basics about it, but if you want to keep digging on it, I recommend you to visit this blog: Introduction to Apache Hadoop. So, when speaking about BigData, files around Terabytes and Petabytes of unstructured data, fault tolerance and scalability needs, etc., then we need to think on Hadoop as a Solution. We will see more of this because we are not going away from internet, we are getting more and more connected, and the more we spend in here, the more we produce data and more tools like Hadoop will emerge. There are actually already a bunch of other technologies flying around Hadoop and I’ll talk of some of them in next posts. Thanks for reading.