What is Spark?
Spark is a open source cluster computing framework originally developed by UC Berkeley (2009) but later donated to the Apache Software Foundation (2010). It is a Big Data analytics engine created after Hadoop technology but improving and adding more capabilities. Spark comes as an alternative to Hadoop-MapReduce to make easier to build and run fast and sophisticated big data applications.
What is Spark used for?
Spark has been adopted by enterprises like Yahoo, Baidu and Tencent and it has been massively deployed processing multiple Petabytes of data on clusters of around 8000 nodes. It has become one of the largest open source community in BigData with over 750 contributors from 200+ organizations.
One of the uses of spark into Yahoo is for personalizing news pages for Web visitors. For news personalization, the company uses ML algorithms running on Spark to figure out what individual users are interested in, and also to categorize news stories as they arise to figure out what types of users would be interested in reading them. When personalization is done, you need to react fast to what the user is doing and the events happening in the outside world and you need to learn something about users as they click around to figure out that they’re interest in a topic. To do this, Yahoo wrote a Spark ML (Machine-Learning) algorithm 120 lines of Scala (previously, its ML algorithm for news personalization was written in 15,000 lines of C++.) With just 30 minutes of training on a large, hundred million record data set, the Scala ML algorithm was ready for business.
- Spark run programs up to 100x faster than Hadoop MapReduce in memory or 10x faster on disk.
- Easy of use
- Applications can be written in Java, Scala, Python and R.
- Combine SQL, streaming and complex analytics (machine learning and graph algorithms). All provided by Spark and can run in a single workflow.
- Runs everywhere
- Spark can run on Hadoop, Mesos, standalone or in the Cloud and it can access diverse datasources like HDFS, Cassandra, HBase and S3.
Spark’s RDD alternative to MapReduce
To understand Spark, we have to understand the RDD concept and its two types of operations (Transformations and Actions).
- RDD (Resilient distributed datasets). RDD’s are a representation of the data coming into the system in an object format that allows to do computations on top of it. RDDs are resilient because they have a long lineage. Whenever there’s a failure in the system, they can recompute themselves using the prior information using lineage.
- Transformations. Transformations is the operations ran over RDDs to get other resilient RDDs. Examples of transformations would be operations such as map, filter, join, union, and so on that would then create other resilient RDDs.
- Actions. These are operations like reduce, count, first, and so that will provide an answer after a computation ran on a RDD.
Something interesting about Spark is that transformations are “lazy”, meaning that when an RDD comes into the system they are not transformed (computed) right away. Instead, the dataset and the operation to perform are “remembered”, they are only transformed (computed) when an action is called and then the result is returned to the driver program. This design enables Spark to run more efficiently.
Spark is also based on another key concept, DAG (Directed acyclic graph) execution engine which basically helps to eliminate the MapReduce multi-stage execution model and offers significant performance improvements.
Spark core libraries
Spark has its own ecosystem just like Hadoop and this ecosystem is what makes of spark a very attractive alternative to Hadoop. Part of this ecosystem are its libraries which are major contributors for Spark to be a desirable option because they make of Spark a unified engine that increase developer capabilities and can be combined to create complex workflows.
- Spark SQL – Is a module for structured data processing. It provides a programming abstraction called DataFrames and can also act as distributed SQL query engine. Spark SQL
- Spark Streaming – Provides the ability to process and analyze not only batch data, but also streams of new data in real time. Spark streaming enables powerful interactive and analytical applications across both streaming and historical data, while inheriting Spark’s easy of use and fault tolerance characteristics. Spark Streaming
- MLlib (Machine learning) – Is a scalable machine learning library that delivers both high-quality algorithms and blazing speed. MLlib
- GraphX – Graph computation engine built on top of spark that enables users to interactively build, transform and reason about graph structured data at scale. GraphX
Spark & Hadoop differences
Hadoop is an ecosystem and Spark is part of that ecosystem along with Cassandra and some others. The above described Spark benefits are the differences between Hadoop and Spark and also the weak points of Hadoop-MapReduce addressed by Spark. So, we could say that Spark fits into the Hadoop open-source community, building on top of the Hadoop Distributed File System (HDFS). However, Spark provides an easier to use alternative to Hadoop MapReduce and offers performance up to 10 times faster than previous generation systems like Hadoop MapReduce for certain applications.
Spark solves similar problems as Hadoop MapReduce does but with a fast in-memory approach and a clean functional style API. With its ability to integrate with Hadoop and inbuilt tools for interactive query analysis (Shark), large-scale graph processing and analysis (Bagel), and real-time analysis (Spark Streaming), it can be interactively used to quickly process and query big data sets. So, while Hadoop can be used for batch processing jobs, Spark can be used for both, batch processing and real time applications along with combining SQL, streaming and complex analytics. To make programming faster, Spark provides clean, concise APIs in Scala, Java and Python. You can also use Spark interactively from the Scala and Python shells to rapidly query big datasets. Spark is also the engine behind Shark, a fully Apache Hive-compatible data warehousing system that can run 100x faster than Hive. Spark does not comes with its own distributed storage system, therefore, although you run it as standalone (which won’t be a affective way of BigData implementation) you will need a distributed storage system like HDFS, Cassandra, etc.
Spark or Hadoop?
When we talk about a framework for analytics Big data processing we might think on Hadoop and we are right about doing that cause Hadoop was one of the first big data processing engines technologies. Hadoop represented a new paradigm on Big Data and we might say that it started a revolution for Big Data technologies. Of course even when Hadoop represented a huge big data input and solved BigData issues, it did have its own limitations. One of the problems about Hadoop was the fact that, because it was a great input, people thought that it (map-reduce) was “the” way for BigData processing, which of course was a mistake. Apache spark begins as an alternative for map-reduce and as an improve of Hadoop.
So, even when Hadoop and Spark are both BigData frameworks they are used with different purposes. Hadoop as a distributed data infrastructure distributes data across multiple nodes on commodity hardware and keeps track of data. Spark on the other hand is a data processing tool that operates on top of those distributed data collections, it does not do distributed system, remember that Spark will need a distributed storage system. So, far from being enemies, Spark and Hadoop are thought to work together. Spark was actually designed to work with Hadoop’s distributed storage system (HDFS) to improve upon map-reduce technology and it improves easy of use, speed and versatility but it does not replace Hadoop, it enhances it.
Of course you can used them separately, Hadoop with its distributed storage system and map-reduce paradigm won’t need Spark to get the processing done. And you can also use Spark without Hadoop, cause even when Spark will need a distributed storage system, it does not has to be HDFS, although it was designed for HDFS. So they might work better together.
When not to use Spark?
It’s possible that you may not need Spark speed, in that case, Map-reduce processing can be fine for you if your data processing are more static and you can wait for batch-processing mode. Also, Spark uses more RAM instead of network and disk I/O and it uses large RAM, so it needs high end physical machines for effective results, so you should think on that too.
After going through all kind of sources to understand Spark, my personal conclusion is that at least for now Spark is not and might never be an enemy for Hadoop. Spark is an alternative after Hadoop and other similar technologies, an addition to what already technologies like Hadoop offer. It is an on top technology, versatile, complete and ready to be implemented with other technologies out there.
Getting back to Yahoo’s Spark implementation, the fact they implemented Spark was because they needed ‘machine-learning’ algorithms for news personalization and that’s one of the offerings of Spark, but it does not mean they do not use Hadoop into their organization, in fact, Hadoop plays a central role for Yahoo along with Spark. At the end of the day it’s all about the variables on which the decision about using Spark along with Hadoop or other distributed storage system or cluster manager relays on.
So, if you need streaming on data, like from sensors (any IoT device) or applications that require multiple operations like machine-learning algorithms, then you will go for spark and it will be up to you to decide how to combine Spark with other technologies, a prof of concept would be a good idea to define this.
Finally, think of Spark like an extra to your current BigData processing system which comes to make partnership with other technologies like Hadoop and to solve the lack of capabilities Hadoop and others might have against certain BigData needs, in this context, partnering Spark with something like Hadoop, makes of it ‘an all in one BigData’ addition, that being said, it’s not about switching technologies, it’s about using the best to your needs and that might means using them together.