Apache Impala - Data Storage and File Format Tutorial

Welcome to the third lesson of the Impala Training Course. This lesson provides an introduction to data storage and file format considerations in Impala. Let us discuss the objectives of this lesson.

Objectives

After completing this lesson, you will be able to:

  • Describe partitioning of Impala tables

  • Explain the benefits of partitioning

  • Describe how file format can affect performance in Impala

  • List the various file formats that are supported in Impala.

Let us begin with understanding partitioning of tables in the next section.

Partitioning Tables

Typically, all the data files of an Impala table reside in a single HDFS directory. Partitioning is a technique to divide the data into multiple HDFS sub-directories physically. Partitioning is a key concept that relates to data storage in Impala.

Partitioning is appropriate for:

  • tables that contain a large amount of data and it is time-consuming to read the entire data.

  • tables that are regularly queried with conditions on the partitioning columns.

Let us next understand the SQL statements for partitioned tables in the next section.

SQL Statements for Partitioned Tables

In terms of Impala SQL syntax, partitioning affects three statements:

  • CREATE TABLE: For creating a partition, you need to select those columns that have reasonable cardinality. In Impala, you can use either the CREATE TABLE or the ALTER TABLE statement to create a partition. With the CREATE TABLE statement, you can include the PARTITIONED BY clause to identify names and data types of the partitioning columns. However, these columns are excluded from the main list of the table columns.

  • ALTER TABLE: With the ALTER TABLE statement, you can add or drop partitions to work with different parts of a huge data set. In addition, you can designate the HDFS directory that holds the data files for a specific partition. When you partition data by date values, you can "age out" outdated or irrelevant data.

  • INSERT: When inserting data into a partitioned table using the INSERT statement, you can identify the partitioning columns.

Let us next discuss how file format affects performance in Impala.

File Format and Performance Considerations

Impala supports several file formats used in Apache Hadoop. The file format used in an Impala table has a significant impact on its performance.

For example:

Some file formats in Impala tables enable compression that affects the size of data on the disk and the amount of I/O and CPU resources needed to deserialize data.

This, in turn, can limit query performance since querying often involves moving and decompressing data.

To reduce the potential impact on query performance, data is often compressed. This transfers a smaller number of bytes from the disk to memory and reduces the data transfer time. File formats in Impala can be structured. In this case, they may include metadata and built-in compression.

Let us next discuss the various file formats that are supported in Impala.

Choosing File Type and Compression Technique

File types and formats supported in Impala include Parquet, Text, Avro, RCFile, and SequenceFile. A structured file format can include metadata and built-in compression. Impala also supports compression techniques such as Snappy, Gzip, Deflate, Bzip2, and LZO. The following table summarises the above points:
 

File Type Format Compression Codecs
Parquet Structured Snappy, gzip; currently Snappy by default
Text Unstructured LZO, gzip, bzip2, Snappy
Avro Structured Snappy, gzip, deflate, bzip2
RCFile Structured Snappy, gzip, deflate, bzip2
SequenceFile Structured Snappy, gzip, deflate, bzip2

 

Are you curious to know what Impala Training is all about? Watch our Course Preview for free!

Summary

Let us summarize the topics covered in this lesson:

  • Partitioning is a technique to physically divide the data in an HDFS directory into multiple HDFS sub-directories.

  • In terms of Impala SQL syntax, partitioning affects three statements: CREATE TABLE, ALTER TABLE, and INSERT.

  • The file format used in an Impala table has a significant impact on its performance.

  • File types and formats supported in Impala include Parquet, Text, Avro, RCFile, and SequenceFile.

Conclusion

This concludes the lesson on Data Storage and File Format. The next lesson will focus on working with Impala.

  • Disclaimer
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.

Request more information

For individuals
For business
Name*
Email*
Phone Number*
Your Message (Optional)
We are looking into your query.
Our consultants will get in touch with you soon.

A Simplilearn representative will get back to you in one business day.

First Name*
Last Name*
Email*
Phone Number*
Company*
Job Title*