Apache Cassandra - Overview of Big Data and NoSQL Database Tutorial

Welcome to the first lesson ‘Overview of Big Data and NoSQL Database’ of the Apache Cassandra Certification Course. This lesson will provide an overview of the Big Data and NoSQL database.

Let us start with the objectives of this lesson.

Objectives

After completing this lesson on Overview of Big Data and NoSQL Database, you will be able to:

  • Describe the 3 Vs. of big data.

  • Discuss some use cases for big data.

  • Explain Apache Hadoop and the concept of NoSQL.

  • Describe various types of NoSQL databases.

In the next section, we will discuss the 3 Vs. of Big Data.

Preparing for Apache Cassandra Certification? Check out our Course Preview here!

The 3 Vs. of Big Data

Big Data has three main characteristics: volume, velocity, and variety.

These can be described as follows:

Volume

It denotes the huge scaling of data, ranging from terabytes to zettabytes.

Velocity

It accounts for the streaming of data and the movement of large volumes of data at high speed.

Variety

It refers to managing the complexity of data in different structures, ranging from relational data to logs and raw text.

Also, there are other Vs. of big data; however they are not as popular. These are veracity, visualization, and value.

Veracity

It refers to the truthfulness of data.

Visualization

It refers to the presentation of data in a graphical format.

Value

It refers to the derived value of an organization from using big data.

Let us explore the first characteristic Volume in the next section.

Volume

The term “volume” refers to data volume, which is the size of digital data. The impact of the Internet and social media has resulted in the explosion of digital data. Data has grown from gigabytes to terabytes, petabytes, exabytes, and zettabytes.

Internet data growth

As illustrated in the image, in 2008, the total data on the Internet was eight exabytes. It exploded to 150 exabytes by 2011. It reached 670 exabytes in 2013. It is expected to exceed seven zettabytes in the next 10 years.

In the next section, let us see some terms used to represent data sizes.

Data Sizes-Terms

The following table contains various terms used to represent data sizes. The size, along with the power and description of each data term, is given. It is recommended you spend some time to go through the contents of the table for better understanding.

Term

Size (Power of 2)

Size Description

Kilobyte

10

1024 Bytes

Megabyte

20

1024 KB

Gigabyte

30

1024 MB

Terabyte

40

1024 GB

Petabyte

50

1024 TB

Exabyte

60

1024 PB

Zettabyte

70

1024 EB

Yottabyte

80

1024 ZB

The new terms added to address big data sizes are exabyte, zettabyte, and yottabytes. Typically, the term big data refers to data sizes regarding terabytes or more.

Next, let us learn about velocity.

Velocity

The term “velocity” refers to the speed at which the data grows. People use devices that create data constantly. Data is created from different sources, such as desktops, laptops, mobile phones, tablets, and sensors. Due to the increase in the global customer base, and transactions and interactions with customers, the data created within an organization is growing along with external data.

contributors to data growth

As illustrated in the image, there are many contributors to this data growth, such as web, online billing systems, ERP implementations, machine data, network elements, and social media. Growth in revenues of organizations indicates growth in data.

Next, let us learn about variety.

Variety

The term “variety” refers to different types of data. Data includes text, images, audio, video, XML, and HTML.

There are three types of data:

  • Structured data - where the data is represented in a tabular format. For example, MySQL databases.

  • Semi-structured data - where the data does not have a formal data model. For example, XML files.

  • Unstructured data - where there is no predefined data model. Everything is defined at runtime. For example, text files.

Let us talk about Data Evolution in the next section.

Data Evolution

Digital data has evolved over 30 years, starting with unstructured data.

Initially, the data was created as plain text documents.

Next, files were created, and data and spreadsheets increased the usage of digital computers.

The introduction of relational databases revolutionized structured data, as many organizations used it to create large amounts of structured data.

Next, the data expanded to data warehouses and Storage Area Networks or SANs to handle large volumes of structured data.

Then, the concept of Metadata was introduced to describe structured and semi-structured data.

Let us discuss the features of Big Data in the next section.

Features of Big Data

Big data has the following features:

  • It is extremely fragmented due to the variety of data.

  • It does not provide decisions directly; however, it can be used to make decisions.

  • Big data does not include unstructured data only. It also includes structured data that extends and complements unstructured data.

  • Big data is not a substitute for structured data.

  • Since most of the information on the Internet is available to anyone, it can be misused by antisocial elements.

  • Generally, big data is wide. Therefore, you may have hundreds of fields in each line of your data.

  • It is also dynamic, as gigabytes and terabytes of data are created every day.

  • Big data can be both internal, generated within the organization and external, generated on social media.

Let us discuss the big data use cases in the next section.

Looking for more information on Apache Cassandra? Watch out our Course Preview!

Big Data-Use Cases

Every industry has some use for big data. Some of the big data use cases are as follows:

  • In the retail sector, big data is used extensively for affinity detection and performing market analysis.

  • Credit card companies can detect fraudulent purchases quickly so that they can alert customers.

  • While giving loans, banks examine the private and public data of a customer to minimize risk.

  • In medical diagnostics, doctors can diagnose a patient’s illness based on symptoms, instead of intuition.

  • Digital marketers need to process huge customer data to find effective marketing channels. Insurance companies use big data to minimize insurance risks.

  • An individual’s driving data can be captured automatically and sent to the insurance companies to calculate the premium for risky drivers.

  • Manufacturing units and oil rigs have sensors that generate gigabits of data every day, which are analyzed to reduce the risk of equipment failures.

  • Advertisers use demographic data to identify the target audience.

  • Terabytes and petabytes of data are analyzed in the field of Genetics to design new models.

  • Power grids analyze large amounts of historical and weather forecasting data to forecast power consumption.

In the next section, let us talk about Big Data Analytics.

Big Data Analytics

With the origin of big data analytics, you can use complete sets of data instead of sample data to conduct an analysis.

big data analytics

As represented in the image, in traditional analytics method, analysts take a representative data sample to perform analysis and draw conclusions.

Using big data analytics, the entire dataset can be used.

Big data analytics help you find associations in data, predict future outcomes, and perform the prescriptive analysis.

The outcome of prescriptive analysis will be a definitive answer, not a probable answer.

Further, using big data for analysis also helps in making data-driven decisions instead of decisions based on intuitions.

It also helps organizations increase their safety standards, reduce maintenance costs, and prevent failures.

Let us compare traditional technology and big data technology in the next section.

Traditional Technology vs. Big Data Technology

Traditional technology can be compared with big data technology in the following ways:

Traditional Technology

Big Data Technology

  • Has a limit on scalability

  • Uses highly parallel processors on a single machine

  • Processors may be distributed with data in one place

  • Depends on expensive high-end hardware, costing more than $40000 per terabyte

  • Uses storage technologies, such as SAN

  • Highly scalable

  • Big data technology uses distributed processing with multiple machines

  • The data is distributed to multiple machines

  • Big data technology leverages commodity hardware which may cost less than $5000 per terabyte

  • Uses distributed data with data redundancy

Let us discuss Apache Hadoop in the next section.

Apache Hadoop

Apache Hadoop is the most popular framework for big data processing. Hadoop has two core components:

  • Hadoop Distributed File System or HDFS

  • MapReduce

Hadoop uses HDFS to distribute the data to multiple machines. It uses MapReduce to distribute the process to multiple machines. Further, Hadoop distributes the processing action to where the data is, instead of moving data towards the processing.

The concept of distribution performed by HDFS and MapReduce is illustrated in the following image.

First, HDFS divides the data into multiple sets, such as Data 1, Data 2, and Data 3.

Next, these datasets are distributed to multiple machines, such as CPU 1, CPU 2, and CPU 3. This action is performed by MapReduce.

In the end, the processing is done by the CPUs of each machine on the data assigned to that machine.

Let us talk about HDFS in the next section.

HDFS

HDFS is the storage component of Hadoop. The characteristics of HDFS are:

  • It stores each file as blocks

  • Default block size is 64 megabytes. This is larger than the block size on windows, which is 1K or 4K.

  • HDFS is a Write-Once-Read-Many form of the file system, also called WORM.

  • Blocks are replicated across nodes in the cluster.

  • HDFS provides three default replication copies.

The following image illustrates this concept with an example.

Suppose you store a 320 MB file into HDFS. It will be divided into five blocks, each of size 64 megabytes, as 64 multiplied by five is 320. If there are five nodes in the cluster, each block is replicated to make three copies, to result in a total of 15 blocks. These blocks are further distributed to the five nodes so that no two replicas of the same block are on the same node.

Let us talk about MapReduce in the next section.

MapReduce

MapReduce is the processing framework of Hadoop. It provides highly fault-tolerant distributed processing of the data distributed by HDFS.

MapReduce consists of two types of tasks

  1. Mappers - Mappers are tasks that run in parallel on different nodes of the cluster, to process the data blocks. In the programming language, mappers are called key-value pairs.

  2. Reducers - After completion of the map tasks, results are gathered and aggregated by the reduce tasks of MapReduce. Reduce tasks consolidate and summarize the results.

Each mapper runs on the machine on which the data block is assigned. Data locality is preferred. This follows the principle of taking processing to the data.

Let us discuss the NoSQL Databases in the next section.

NoSQL Databases

NoSQL is the common term used for all databases that do not follow the traditional Relational Database Management System or RDBMS principles.

In NoSQL databases:

  • the overhead of the ACID principles is reduced. ACID stands for Atomicity, Consistency, Isolation, and Durability. This is a set of properties that guarantee the reliable processing of database transactions. These principles are guaranteed by most RDBMS.

  • the process of normalization is not mandatory. With big data, it is difficult to follow RDBMS principles, and normalization

  • databases are denormalized. Due to the transactional database requirement of RDBMS, the relational databases are not able to handle terabytes and petabytes of data. A NoSQL database is used to overcome the limitations of transactional databases.

Let us discuss the Brewer’s Cap Principle in the next section.

Brewer’s CAP Principle

Eric Brewer, a computer scientist, proposed the Brewer’s Consistency, Availability, and Partition Tolerance or CAP principle in 1999. CAP is the basis for many NoSQL databases.

In a distributed system:

  • Consistency means that all nodes in the cluster view the same data at the same time.

  • Availability means that a response is guaranteed for every request received. The response can be regarding whether a request was successful or it failed.

  • Partition tolerance means the system continues to operate, despite the ad hoc message loss or failure on the part of the system.

Further, Brewer stated that it is not possible to guarantee all the three aspects simultaneously in a distributed system. Therefore, most NoSQL databases compromise on one of the three aspects to provide better performance.

In the next section, let us focus on the approaches to NoSQL Databases types.

Willing to take up a course in Apache Cassandra? Click to get complete course details!

Approaches to NoSQL Databases-Types

There are four main types of NoSQL approaches:

  • Graph Database

  • Document Database

  • Key Value Stores

  • Column Stores

Let us discuss each of these one-by-one.

First is the Graph Database.

A graph database helps in representing graphical data in the nodes and edges format. Some of its features are as follows:

  • Graph databases can handle millions of nodes and edges

  • They can perform efficient depth-first and breadth-first searches on the graph data, other graph searches, and traversal algorithms.

Neo4J and FlockDB are the common graph databases. The following image depicts a graph with four nodes in a Neo4J database.

Second is the Document Database.

A document database helps in storing and processing a huge number of documents. In the document database, you can store millions of documents and process fields. For example, you can store your employee details and their resumes as documents, and search for a potential employee using fields like the phone number.

MongoDB and CouchDB are the popular document databases.

The following image depicts the fields of a document stored in MongoDB.

Third is the Key-Value Stores.

These store the data in key and value formats, where each piece of data is identified by a key and has associated values. In the key-value stores, you can store billions of records efficiently and provide fast writes as well as searches of data based on keys.

Cassandra and Redis are popular key-value stores.

The following image depicts the keys and values stored in a Cassandra database.

You will learn more about key-value stores when we discuss Cassandra in the upcoming lessons.

Last is the Column Stores.

These are also called column-oriented databases. Column Stores organize data in groups of columns and are efficient in data storage and retrieval based on keys.

Some of their features are:

  • HBase is part of the Hadoop ecosystem that runs on top of HDFS to store and process terabytes and petabytes of data efficiently.

  • Column Stores normally maintain a version of the data along with each value of data. HBase and Hypertable are the most common column stores.

The following image illustrates how HBase stores the data.

Data is organized by column families, and each column family can have one or more columns. For each column, along with the data value, a version number indicating the time of data update is also stored. The column-based data is stored along with the key.

Note that NoSQL databases cannot replace general purpose databases. Although they provide better performance and scalability, they compromise on some aspects like ease of use and full SQL query support.

Summary

Let us summarize what we have learned in this lesson.

  • Big data is mainly characterized by variety, velocity, and volume.

  • With the origin of big data analytics, complete sets of data can be used to conduct an analysis instead of sample data.

  • Apache Hadoop is the most popular framework for big data processing. It has two components—HDFS and MapReduce.

  • HDFS is the storage component of Hadoop and MapReduce is the processing framework of Hadoop.

  • NoSQL is the common term used for all the databases that do not follow traditional RDBMS principles. It is based on Brewer’s CAP principle.

  • Brewer stated that it is not possible to guarantee the Consistency, Availability, and Partition Tolerance aspects simultaneously in a distributed system.

Conclusion

This concludes the lesson on overview of Big Data and NoSQL. The next lesson will focus on Cassandra Architecture.

  • Disclaimer
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.

Request more information

For individuals
For business
Name*
Email*
Phone Number*
Your Message (Optional)
We are looking into your query.
Our consultants will get in touch with you soon.

A Simplilearn representative will get back to you in one business day.

First Name*
Last Name*
Email*
Phone Number*
Company*
Job Title*