Introduction to Apache Cassandra Tutorial

2.1 Introduction to Cassandra

Hello and welcome to lesson 2 of the Apache Cassandra course offered by Simplilearn. This lesson will provide an introduction to Cassandra.

2.2 Course Map

Simplilearn’s Apache Cassandra™ course is divided into eight lessons, as listed on the screen. ‘Introduction to Cassandra’ is the second lesson.

2.3 Objectives

By the end of this lesson, you will be able to describe Cassandra and its features, describe when Cassandra is used, demonstrate how to work with the Command Line Interface of Cassandra, list the advantages and limitations of Cassandra, and demonstrate how to install a VMware player.

2.4 Introducing Cassandra

Cassandra is a NoSQL database that is highly scalable and big data ready. You learned in lesson 1 that a NoSQL database is one that can work with denormalized data. Cassandra is a distributed database that is highly fault tolerant with no single point of failure. Further, it is a high performance database.

2.5 Behind the Name

Cassandra derives its name from Greek mythology. Cassandra was the daughter of King Priam of Troy and his wife, Hercuba. Cassandra had the power to predict the future, but she also had a curse that nobody would believe her predictions. True to its name, the Cassandra database holds a lot of promise for the future but suffers from many limitations, as well.

2.6 History of Cassandra

Cassandra was initially developed in 2008 at Facebook as a combination of the BigTable data store used by Google and the Dynamo data store used by Amazon. It was developed by Avinash Lakshman, the author of Amazon Dynamo and Prashant Malik. It was developed to solve the inbox search problem of Facebook. It has evolved a lot since its inception in 2008. It started with the concept of column families and super column families but later evolved as a key value store. You can still get messages from Cassandra about column families. Version 1.0 was released in 2011. In 2014, Cassandra became an Apache open-source project and the current version 2.0 was released.

2.7 Main Features of Cassandra

The following are the main features of Cassandra: Cassandra is a key value database. Data is stored as tables and columns and every table has a primary key. Further, Cassandra has a limited SQL interface. In addition, it provides very fast read and writes. A sample Cassandra query is provided on screen: select ticker comma value from stocks where ticker = xyz order by ticker. Observe that the syntax is similar to that of SQL in a Relational Database Management system.

2.8 When is Cassandra Used

Cassandra is used to store a huge amount of information very quickly. For example, when you are processing telecom switch data or stock market data, a huge volume of data is generated every minute. Cassandra is also useful when you want full indexed search to get the data quickly and the data needs to be sorted in a pre-determined order. Full-indexed search is search performed using a key. Another case where Cassandra is useful is when you expect an upsurge in data size. Cassandra enables scaling by adding more nodes as the data grows. You can also use Cassandra when you want a highly fault-tolerant cluster with no single point of failure. It is also preferred when you need high performance for both data read and write.

2.9 Simple Cassandra Program

An example of a simple Cassandra program is provided on screen. In this example, you create a table, insert a few records, and fetch data from the table. You can see that the Cassandra syntax is very similar to the standard SQL syntax. The example creates the table called stocks with two columns, inserts three rows into this table, and selects data from the table for a particular key. Observe that Cassandra uses the primary key of the table to fetch the data. The data is also stored in the primary key order.

2.10 Cassandra Command Line Interface

Cassandra provides a command line interface that is similar to the Linux shell. It can be invoked with cqlsh if that is in your path or using bin/cqlsh from the Cassandra base directory. You can use cqlsh –h to get help and arguments for the command. Before starting cqlsh, you need to set the CQLSH_HOST parameter to the address of one of the hosts where Cassandra is running. Within cqlsh, you can access the history of commands with the up arrow. Most commands have elaborate help available and you can access it using the help command within Cassandra. You can exit the shell using the exit command. Note that you need to terminate each command with a semicolon. The image shows how to run cqlsh and also depicts the sample output from cqlsh.

2.11 Advantages of Cassandra

Cassandra has many advantages for processing big data. It is highly fault tolerant with no single point of failure. This means that if any node in the cluster fails, other nodes will take over and complete the work. Every node in the cluster is identical as there are no masters or slaves in Cassandra. Therefore, one machine cannot become the bottleneck in the system. Further, you can add a machine to the cluster or remove a machine from the cluster any time without downtime. Cassandra also provides very fast data writes allowing real-time processing of big data. Cassandra outperforms many other NoSQL databases in terms of many performance benchmarks.

2.12 Limitations of Cassandra

Cassandra is not a general purpose database due to some limitations. First, it doesn’t provide aggregation of data with group by, sum, min or max like relational databases. Any aggregation has to be pre-computed and stored. Second, there are no joins of tables, so data has to be denormalized before getting stored in Cassandra. Third, it doesn’t support additional search clauses or conditions. Only keys or indexes can be used for search. We will talk more about this restriction later in the course. Lastly, there is no sorting provided on non-key fields.

2.13 VMware

In a later lesson, you will learn to install Cassandra on an Ubuntu Linux system. However, if you need to work on an operating system other than Linux, you can access the software provided by VMware. This software allows running one operating system on another using a virtual machine. This is facilitated by VMware Player. For non-commercial use, VM Player can be downloaded and used free of cost from the VMware web site.

2.14 Simplilearn Virtual Machine

Simplilearn has created a virtual machine on VMware Player. This machine, known as Hadoop Pseudo Server, comes with a preinstalled Ubuntu 12.04 LTS operating system and Hadoop setup. It can be opened with the VMWare Player and can be used for installing Cassandra. Hadoop Pseudo Server can be downloaded from the given link.

2.15 PuTTY

PuTTY is a popular, free tool for connecting to Linux systems from Windows through a remote terminal. It overcomes some of the limitations of the VM. For example, it allows moving the mouse pointer with ease, scrolling in the window, and copying and pasting text. PuTTY can be downloaded from the given link.

2.16 WinSCP

Winscp is a popular tool for copying files between Windows and Linux. It stands for Windows secure copy. It can be used to copy the files from local Windows to the Ubuntu VM running in VM Player. Winscp can be downloaded from the given link.

2.19 Quiz

A few questions will be presented in the following screens. Select the correct option and click submit to see the feedback.

2.20 Summary

Let us summarize the topics covered in this lesson. Cassandra is a key-value NoSQL data store. Cassandra is a highly fault-tolerant database. Cassandra was started in 2008 at Facebook and became an Apache project in 2014. Cassandra supports tables, columns, and simple SQL statements. Cassandra provides fast reads and writes, thus supporting real time data processing. Cassandra does not support aggregates or joins. If you need to work on an operating system other than Linux, you can assess the software provided by VM Ware. PuTTY is a popular, free tool for connecting to Linux systems from Windows through a remote terminal. Winscp is a popular tool for copying files between Windows and Linux.

2.21 Conclusion

This concludes the lesson ‘Introduction to Cassandra.’ In the next lesson, we will introduce Cassandra Architecture.

  • Disclaimer
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.

Request more information

For individuals
For business
Phone Number*
Your Message (Optional)
We are looking into your query.
Our consultants will get in touch with you soon.

A Simplilearn representative will get back to you in one business day.

First Name*
Last Name*
Phone Number*
Job Title*