Analytics, Big Data, Technical Stuff

Cassandra, a brief exploration into it

cassandra

What is Cassandra?

Cassandra is an open source BigData technology, more specific, it is a NoSQL distributed database. It started on Facebook but was mainly developed into the Apache Software Foundation. It is based on Dynamo and Big Table papers. Amazon Dynamo paper was created for distributed database technology and Google’s Big Table paper was created for storing huge amounts of data.

What is Cassandra used for?

Cassandra is used for real-time applications which means a continual input, process and output of data. Better said, it refers to immediate processing, cause data processing must be made in a very short time near “real time”, which is a difference from technologies like Hadoop (Analytics processing system), where processing is base on batch or jobs and data is collected, entered, processed and then the batch produces a result in a scheduled time. An example for Hadoop use would be Billing processes.

Cassandra is mainly used because of Bigdata where projects need massive scalability, high availability and fault tolerance. There are around 1500 companies already using Cassandra.
One of the big users of Cassandra is Netflix. With over 400 millions of subscriptions, they quickly reached the limits with their existing Oracle SQL database. They were using a single Datacenter which represented a single point of failure and they were reaching the limits over traffic and capacity because now people can watch Netflix from phones, Wii devices, etc., specially over weekends when they have more demand for resources. So they decided to move everything to the cloud and replace its Oracle SQL database by Cassandra. As of March 2013, Netflix’ Cassandra deployment consists of 50 clusters with over 750 nodes. Nothing bad.
To understand why companies like Netflix are using Cassandra we need to understand its functioning and benefits.

Cassandra Benefits

Because of its architecture and designed Cassandra is able to provide below benefits:

  1. Fault tolerance
    • Data is automatically replicated to multiple nodes for fault tolerance. Also replication across multiple Datacenters is supported and failed nodes are replaced in no downtime.
  2. Decentralized (No SPOF)
    • There are no single points of failures. There are no network bottlenecks. All nodes in the cluster are identical (peer to peer communication).
  3. High availability
    • If a node goes down the system continues to operate.
  4. Durability
    • Suitable for applications that can’t afford to loose data, even when an entire Datacenter goes down (Spreading through Datacenters around the world).
  5. Scalability
    • Can handle huge amounts of data by just adding nodes (linear scalability) and virtual nodes. (Scale out)
  6. Strong consistency
    • Reading and writing the last update of data. You can choose how much consistency you want to have by something called tunable consistency.

Cassandra architecture (Basic and brief)

Cassandra is designed to handle big data workloads across multiple nodes with no single point of failure (distributed system). Its architecture is based on the understanding that system and hardware failures can and do occur. Cassandra addresses the problem of failures by employing a peer-to-peer distributed system across homogeneous nodes where data is distributed among all nodes in the cluster. Each node exchanges information across the cluster every second (gossiping). A sequentially written commit log on each node captures write activity to ensure data durability. Data is then indexed and written to an in-memory structure, called a memtable, which resembles a write-back cache. Once the memory structure is full, the data is written to disk in an SSTable data file. All writes are automatically partitioned and replicated throughout the cluster. Using a process called compaction, Cassandra periodically consolidates SSTables, discarding obsolete data and tombstones (an indicator that data was deleted).

memtableCssCassandra is a row-oriented database. Cassandra’s architecture allows any authorized user to connect to any node in any data center and access data using the CQL language. For ease of use, CQL uses a similar syntax to SQL. From the CQL perspective the database consists of tables. Typically, a cluster has one keyspace per application.

Client read or write requests can be sent to any node in the cluster. When a client connects to a node with a request, that node serves as the coordinator for that particular client operation. The coordinator acts as a proxy between the client application and the nodes that own the data being requested. The coordinator determines which nodes in the ring should get the request based on how the cluster is configured.

In Cassandra, data distribution and replication go together. Data is organized by table and identified by a primary key, which determines which node the data is stored on. Replicas are copies of rows. When data is first written, it is also referred to as a replica.

Cassandra architecture is compared to a Token ring topology distribution. Consistent hashing allows distributing data across a cluster which minimizes reorganization when nodes are added or removed. Consistent hashing partitions data based on the partition key, this way each node in the cluster is responsible for a range of data based on the hash value and Cassandra places the data on each node according to the value of the partition key and the range that the node is responsible for.

hashingCassandra allows many tokens per node and from Casssandra 1.2 a new paradigm arrived which is called virtual nodes (vnodes). Vnodes allow each node to own a large number of small partition ranges distributed throughout the cluster. Vnodes also use consistent hashing to distribute data but using them doesn’t require token generation and assignment.

vnodesThe top portion of the graphic shows a cluster without vnodes. In this paradigm, each node is assigned a single token that represents a location in the ring. Each node stores data determined by mapping the partition key to a token value within a range from the previous node to its assigned value. Each node also contains copies of each row from other nodes in the cluster.

The bottom portion of the graphic shows a ring with vnodes. Within a cluster, virtual nodes are randomly selected and non-contiguous. The placement of a row is determined by the hash of the partition key within many smaller partition ranges belonging to each node.

Cassandra comparison

Of course even when Cassandra sounds great, there is not the only option out there regarding NoSQL distributed databases, but from what I’ve seen so far is one of the most attractive options in the market, no wonder why there are around 1500 projects implementing Cassandra. Differences between the different options vary on architecture details, so before choosing an option a prof of concept is a good idea. Some other options are:

  • Apache HBase: Open source, non-relational, distributed database modeled after Google’s BigTable and is written in Java. It is developed as part of Apache Software Foundation’s Apache Hadoop project and runs on top of HDFS (Hadoop Distributed File System), providing BigTable-like capabilities for Hadoop.
  • MongoDB: Cross-platform document-oriented database system that eschews the traditional table-based relational database structure in favor of JSON-like documents with dynamic schemas making the integration of data in certain types of applications easier and faster.
  • Couchbase: Distributed NoSQL document-oriented database that is optimized for interactive applications.

Below is a brief comparison about operations per second according to number of nodes:

comparison

When not to use Cassandra?

After considering Cassandra benefits we might think why relational (SQL) databases are still  being used and why not migrating everything to NoSQL databases like Cassandra, well, the thing is that NoSQL DB’s raised to fulfil some needs that in time have arrived to us in the form of BigData, Social media and Internet of things. But NoSQL DB’s won’t fit into every need a company has and where relational databases fit. For example, Cassandra it’s not a suitable option where Complex/nested transactions are needed, like ‘Banking transactions’, and the reason is because Cassandra does not use RDBMS ACID transactions with rollback or locking mechanisms, but instead offers atomic, isolated, and durable transactions with eventual/tunable consistency that lets the user decide how strong or eventual they want each transaction’s consistency to be.

Conclusion

There is a lot more to say about Cassandra but this blog try to quickly explain what it is and how it’s used. From my understanding till now, Cassandra is a really good suitable option for Bigdata where real time applications are ran, where there is no option for downtimes and where a highly increase of capabilities are required. Getting back to why companies like Netflix has chosen Cassandra over its Oracle option it’s easier to answer now after considering it’s functioning and benefits. Cassandra has offered to many companies like Netflix the ability to expand and increase their business further than traditional options can allow without compromising their business by providing high availability and fault tolerance at a low cost.
I can from now and on think of Cassandra by considering below bullet points:

  • Huge amounts of data that no longer fit into traditional technologies. With a traditional DB if you need to scale then you need a bigger box (scale up). With Cassandra if you need to scale you add more boxes (scale out) and those boxes can be commodity hardware (low cost).
  • The generation of unstructured data. With traditional options you have limits regarding the kind of data to save while with Cassandra you can even save data coming from IoT devices which makes of Cassandra an IoT technology that can actually be installed into an IoT device like a Raspberry Pi.
  • No more one box storing. When using traditional DB’s, you have one single point of failure, the box where the system is running, but with Cassandra, as a distributed system you have your data spread through Datacenters around the world with no SPOF.
  • Scalable on demand: With Vnodes concept into Cassandra you can double your capabilities by provisioning virtual nodes over the existing nodes to fulfil a resources demand and decommission them when no longer needed.

Previous bullet points summarizes what Cassandra can do for a company in words that can be easily understood. There is no doubt that technologies like Cassandra are the result of modern needs like BigData and IoT and that it’s just a matter of what else needs we will face in future to see more awesome technologies knocking on our door.

References

Introduction to Cassandra (video)
NoSQL comparisons
Cassandra’s FAQs
Arquitecture in Brief

Leave a Reply