What is Apache Storm?
Storm is a free and open source distributed realtime computation system that makes easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. It was created by Nathan Marz and team at BackType and written predominantly in the Clojure programming language. The project was open sourced after being acquired by Twitter in September 2011.
What is Storm used for?
One of the objects of Analytics products is to help companies to understand their historically and real time impact on social media. There is a variety of options out there and Storm is one of them. Storm exist because of real-time data coming from highly dynamic sources so it works over streaming data that is expected to be continues. One use case I can mention is actually from the company that open sourced the project, Twitter, a company who’s users generate around 140 millions tweets per day.
One way Twitter uses Storm is on the generation of trend information. Twitter extracts emerging trends from the fire hose of tweets and maintains them at the local and national level, but when the story begins to emerge, Twitter’s trending topics algorithm identifies the topic in real time. This real-time algorithm is implemented within Storm as a continuous analysis of Twitter data. But that’s not the only use case of Storm into Twitter. Storm powers a wide variety of Twitter systems, ranging in applications from discovery, realtime analytics, personalization, search, revenue optimization, and many more. Storm integrates with the rest of Twitter’s infrastructure, including database systems (Cassandra, Memcached, etc), the messaging infrastructure, Mesos, and the monitoring/alerting systems. Storm’s isolation scheduler makes it easy to use the same cluster both for production applications and in-development applications, and it provides a sane way to do capacity planning.
- Simple API
- Storm has simple and easy to use API.
- Benchmarked as processing one million 100 byte messages per second per node.
Storm topologies are inherently parallel and run across a cluster of machines. Different parts of the topology can be scaled individually by tweaking their parallelism. Storm’s inherent parallelism means it can process very high throughputs of messages with very low latency.
- Fault tolerant
- When workers die, Storm will automatically restart them. If a node dies, the worker will be restarted on another node. The Storm daemons, Nimbus and the Supervisors, are designed to be stateless and fail-fast. So if they die, they will restart like nothing happened.
- Reliable – Guarantees data processing
- Storm guarantees every tuple will be fully processed. One of Storm’s core mechanisms is the ability to track the lineage of a tuple as it makes its way through the topology in an extremely efficient way.
- Use with any language
- Storm was designed from the ground up to be usable with any programming language. At the core of Storm is a Thrift definition for defining and submitting topologies. Since Thrift can be used in any language, topologies can be defined and submitted from any language. Similarly, spouts and bolts can be defined in any language.
- Easy to deploy and operate
- Storm clusters are easy to deploy, requiring a minimum of setup and configuration to get up and running. Additionally, Storm is easy to operate once deployed. Storm has been designed to be extremely robust – the cluster will just keep on running, month after month.
A Storm cluster is similar to Hadoop cluster but on Storm instead of running “Map-Reduce” jobs, you run “Topologies”. One key difference between them is that Map-Reduce jobs eventually finishes whereas Topology processes messages forever or until you kill it.
Storm manages two kinds of nodes:
- Nimbus node (master node similar to Hadoop job tracker)
- Central component of Apache Storm, its main job is to distribute code around the cluster, assigning tasks to machines, and monitoring for failures.
- Supervisor nodes
- Supervisors follow instructions given by the Nimbus and delegate the tasks to worker processes. Supervisor will govern worker processes to complete the tasks. Each worker process executes a subset of a topology; a running topology consists of many worker processes spread across many machines.
A worker process will execute tasks related to a specific topology but it won’t execute them by himself instead it will create and ask to ‘executors’ to perform the task. The ‘worker process’ can have as many executors as needed. The executor is a single thread spawn by a worker process and it runs a one or more tasks for a specific spout or bolt. Finally the actual data processing is done by the ‘tasks’ either is a spout or a bolt.
Zookeper cluster – All coordination between Nimbus and the Supervisors is done here. Nimbus daemon and Supervisor daemons are fail-fast and stateless; all state is kept in Zookeeper or on local disk.
Some important concepts:
- Tuples – Ordered list of elements.
- Streams – Unbounded sequence of tuples.
- Spouts – Source of streams in a computation.
- Bolt – Are logical processing units. Spout pass input streams to bolts and bolts process and produce new output streams.
- Topology – Is the overall calculation. Is a network of spouts and bolts that run forever (Think of it as roughly analogous to a MapReduce job in Hadoop).
So, summarizing and taking into account the Twitter use case regarding trend information, Spout will read the tweets of the users using Twitter Streaming API and output as a stream of tuples. Each tuple will have a Twitter username and tweet as comma separated values. The tuples will be forwarded to the Bolt and this last one will split the tweet into individual work, calculate the word and persist the information into a configured datasource to finally get the result by querying the datasource.
Storm, Spark or Hadoop?
You might have read my older posts around BigData and IoT, more precisely about Hadoop and Spark. You might know by now than when we talk about BigData we quickly think on Hadoop, because it is considered the king of BigData. However, Hadoop just started the whole revolution around BigData and a whole ecosystem was developed and launched after it. Part of that ecosystem is Spark and Storm and even when this post is all about Storm I want to make also mention of Spark because as part of the Hadoop’s ecosystem the three of them are Analytics tools and BigData processing systems.
Knowing that all of them were design for similar purposes we might then ask ‘How would I choose between them to perform my own BigData processing?’. Choosing around them is all about the specific task we want to accomplish and what makes them different from each other is the way they work and their core design. So taking that into account, no matter is they have overlapping capabilities they will bring us different results depending on how are we going to use them and what kind of results are we looking for, because they each have distinctive features and different roles to play.
First of all, we have to know that from the beginning they were designed focused on different types of processing:
Storm is a complex event processor (CEP) and was from the beginning being called the Hadoop of real time processing because it was designed to do the apposite of Hadoop. Storm as a complete stream processing engine that supports micro-batching is the best option when you want to do Streaming, because even when Spark can do Streaming too, Storm is faster than Spark at it because that’s what at Storm was designed to do for. How?, well, basically, Spark streams events in small batches that come in short time window before it processes them whereas Storm processes the events one at a time. Thus, Spark has a latency of few seconds whereas Storm processes an event with just millisecond latency.
Spark can do batch (as Hadoop) and real-time processing (streaming like Storm), but it’s main strength comes from the fact that it can process more than plain data because of its MLlib (Machine learning library) as well as process GrahpX, it’s basically an all on one platform and when implementing it, you’re only maintaining one platform that offers you multi types processing.
Finally Hadoop, is designed and focused on batch processing and even when Spark can do the same at a faster speed, it does well when data fits into memory while Hadoop does well when data does not fit into memory besides the fact that Hadoop is also a distributed storage system.
When no to use Storm?
Storm is not useful when a project requires an interactive mode for data exploration through API calls, in this case, Spark has to be used. And of course Storm was not designed for batch processing although it does it, is not efficient as could be Spark or Hadoop, but is one of the best if not the best at streaming, because it brings unlike Spark, the concept of streaming one event at a time.
Storm has a short time out there, but it has made a pretty good first impression and because of that is growing so fast and it’s being adopted day by day by a growing number of companies like Twitter, Yahoo, Spotify, etc. for streaming processes purposes. But when you review the use cases you will find that Storm it’s not exactly working alone. Along with it there is bunch of other technologies complementing what Storm does, because remember, streaming processing is not the only need out there, and Storm can’t do everything by itself, for starts it doesn’t have a distributed storage system, it is completely a computation system and it has a purpose which is being the best at streaming.
So, for me that has been the major challenge while researching about BigData – Analitycs processing tools, defining how are they different and how and when implement them one instead of other or even together. Till now I’ve been researching about Storm, Spark and Hadoop and the three of them are important and have something that makes them unique at the degree that the three of them have a large community support. So, I can say by now, that they have a specific role to play and they might play a role together in an enterprise, focusing each one of them on specific issues. I can summarize by telling you:
- Hadoop has a batch processing model and a distributed storage system. It supports batch processing and processes one job at a time and lately is being used as a data warehousing.
- Spark has a batch processing model and no distributed storage system. It supports batch processing and streaming (not in the strictest sense, because its streaming is in reality micro-batching) and offers machine library and process GraphX, all in the same platform.
- Storm has a streaming processing model and no distributed storage system. It supports streaming (better than spark) and micro-batch processing. Offers sub-second latency without data loss.
Finally, Storm, Yet another BigData processing tool?, well yes, but as you saw trough this blog, is not just another one, is one for streaming if not the best till now. Storm is part of Hadoop ecosystem so it will be solving and offering what others on the same ecosystem can’t. There is no way of telling you which one is better because they are designed to address specific issues and they have different capabilities that yes, might seem similar but when you look at the deepest aspects of each one and also by reviewing exactly the needs of your project/company you will find a best suitable option for you if not that all are suitable for you. But, hey, they are open source so go ahead and mix them up as you wish.