Demystify different compression codec in big data

4 min readMar 22, 2022

When we are working with big data files like Parquet, ORC, Avro etc then you will mostly come across different compression codec like snappy, lzo, gzip, bzip2 etc. In this article, we will try to understand some of these compression codecs and discuss basic fundamental differences between them.

Before starting anything let’s try to understand the benefit of compressing the big data files. File compression brings two major benefits:

It reduces the space needed to store files.
It speeds up data transfer across the network, or to or from disk. Hence, it reduces the I/O cost.

When dealing with large volumes of data, both of these savings can be significant but yes compression and decompression comes with some cost in terms of time taken to compress & decompress the data but when we compare this cost with the I/O gain, we can neglect the additional time it is taking.

When dealing with any big data system or distributed system it is very much important to select the right compression codec but the first question which comes into our mind is What does this codec mean?

Codec is short for compressor/decompressor. It refers to the software, hardware, or a combination of the two. One can use codecs to apply compression/decompression algorithms to data & it has two components, an encoder to compress the files, and a decoder to decompress.

The type of codec that we can use depends on the data and file type we are trying to compress. It also depends on whether we need our compressed file to be splittable. Splittable files can be processed in parallel by different processors.

Which compression format should we use?

Which compression format we should use depends on our application. The two most important factors determine our option and those are

Some of the compression codecs are more optimized for storage while others are more optimized for speed. Do ones want to maximize the speed of the application or are more concerned about keeping storage costs down? So basically there is a trade-off between storage and speed. If we want more compression ratio then we have to spend more time in compression but if we want better speed then generally we will not focus or spend more time in storage compression.

In general, one should try different strategies for the application, and benchmark them with representative datasets to find the best approach. There are more important parameters that are helpful while choosing a compression codec.

Compression ratio — How much data has been compressed or how good the compression itself is from source to destination?
Throughput, Compression Speed, Decompression Speed — How quickly the algorithm can compress the data & decompress the data? Throughput is mostly measured in Mb/s.

For large files, one should not use a compression format that does not support splitting on the whole file, since one can lose locality and make MapReduce applications or any distributed computing engine like spark very inefficient.

Different compression codecs in big data files.

There are many compression codecs available for big data files. Our goal is not to understand every codec available but to understand basic thinking while choosing different options available. So let’s discuss some of the following codec that is useful for compressing big data.

I have created one illustration for these. Let’s look at the below picture once and then we will discuss each of these compression codecs individually.

Snappy

Snappy is a very fast compression codec, however, in terms of compression, it is not very good.
In most of the project and distributed systems, it is by default choice of compression codec because it gives a good balance between compression and speed.
However, it is more optimized for speed rather than storage.
Snappy by default is not splittable inherently but snappy is mostly used with container-based file formats like Parquet, Avro, orc which itself take care of split ability.

Lzo

Lzo is also optimized for speed like snappy but unlike snappy, it is inherently splittable.
It is also more optimized for speed than storage.

Gzip

It is more optimized for storage.
In terms of processing speed, it is slow.
Gzip is also not inherently splittable so one should use this with container-based file formats like Parquet or ORC.
It uses Deflate algorithm for compression.

Bzip2

It is very much optimized for storage but it is slow w.r.t. speed.
It is inherently splittable.

So if we have to conclude all these above points then we can divide these compression codecs into two different categories. If we working with some big data system that has cold big data meaning if data is not accessed more often and we want to save storage cost then we can opt for compression coded like Gzip which is more suited for storage rather than speed. Whereas, if we are dealing with hot big data like real-time big data pipeline or batch big data pipeline where we want to access the big data more often then we should opt for compression coded like Snappy which gives a better balance between speed and storage.

I have also recorded one Youtube video for the above content. One can follow this link.

Originally published at https://www.linkedin.com.