File Formats in Big Data World — Part 1

Ankur Ranjan
7 min readSep 12, 2022

One of the most fundamental decisions to make in the Data Engineering world is to choose the proper file formats in different zones of the Big Data Pipeline. It helps the team to fetch the data faster and lower the cost of the project. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workload, particularly for Big Data Pipeline.

Now the question is why we need different file formats. This happens due to mainly three reasons and they are as follows.

  • Storage Cost
  • Faster Processing
  • Input/Output Cost

With the increasing volume of data, managing cost becomes a crucial task for a Data Engineer. With changing times, many file formats have evolved in the big data world but according to me most of them have always tried to follow some of the basic design principles and they are as follows.

  • Faster read times
  • Faster write times (This is not that big deal in the Big Data world)
  • Advanced compression support
  • Schema Evolution support.
  • Splittable Support.

Achieving all the above supports in single file formats is almost impossible. Some are made for making reading faster, while some are made to support write-heavy jobs. Some do support the schema evolution greatly, while some support partially and some don’t support it at all.

But, the good thing about Data Engineering is that most of the time we don’t need all the above-mentioned things. Data Pipeline consists of different zones like landing zone, raw zone, cleaned zone, and curated/consumption zone which demand different supports.

Let’s try to understand it by looking at a typical Data Pipeline example illustration.

Let’s try to understand different zone first.

  • Raw zone — In this zone, we typically search for file formats which support schema evaluation either partially or fully. In this storage area data lands from the ingestion layer. This is a transient area where data is ingested from sources as-is. Typically, data engineering personas interact with the data stored in this zone.
  • Cleaned zone — After the preliminary quality checks, the data from the raw zone is moved to the cleaned zone for permanent storage. Here, data is stored in its original format. Having all data from all sources permanently stored in the cleaned zone provides the ability to “replay” downstream data processing in case of errors or data loss in downstream storage zones. Typically, data engineering and data science personas interact with the data stored in this zone. Here, data should be more read-heavy.
  • Curated zone — This zone hosts data that is in the most consumption-ready state and conforms to organizational standards and data models. Datasets in the curated zone are typically partitioned, catalogued, and stored in formats that support performant and cost-effective access by the consumption layer. The processing layer creates datasets in the curated zone after cleaning, normalizing, standardizing, and enriching data from the raw zone. All personas across organizations use the data stored in this zone to drive business decisions.

So, By looking through different zone, we have understood that out of those five supports, few can do our work. We should understand that some of the file formats are designed for very specific use cases while others are very much for general purposes.

In the Big Data world, generally, files are divided into two categories.

Let’s try to understand the basic difference b/w these by a simple illustration.

The above illustration shows the high level of picture storage at the disk level. One can see the data of columns in the column-oriented database stay together and it’s very easy to fetch some of the columns out of many columns if using this format.

Let me try to make it more simple.

Row-oriented storage is suitable for situations where the entire row of data needs to be processed simultaneously where a column-oriented format makes it possible to skip unneeded columns when reading data, and suitable for situations when reading a few columns out of many columns. hen querying, columnar storage you can skip over the non-relevant data very quickly. As a result, aggregation queries are less time-consuming compared to row-oriented databases.

Most column-oriented formatted files are good for reading whereas row-oriented do support writing better. Compression is a little better for column-oriented and hence make it more suitable for analytical and storage purpose.

Now, we have let’s get deep and try to understand the basic feature of Parquet, ORC and Avro. This will help us understand the concept better and then we will be able to visualize the difference b/w all these file formats.

Apache Parquet

Let’s start with very general purpose column-oriented file formats.

Now the first question that comes to mind is. What is Apache Parquet?

Let me make it simple for you with one intro and then deep diving into its internal.

Apache Parquet is an open-source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.

Some of the characteristics of this file format are as follows.

  • Column-based format — files are organized by column, rather than by row, which saves storage space and speeds up analytics queries.
  • Supports complex data types — It supports advanced nested data structures like Array, Struct etc. We can very nested values in the column and it will work like charm.
  • Very good predicate pushdown filters support — if you apply a filter on column A to only return records with value V, the predicate pushdown will make parquet read-only blocks that may contain values V. We will talk about it in a more detailed way in the latter part of the article.
  • Advance Compression Support — By default, it comes with Snappy compression codec which gives it very much balance between compression ratio and speed.

Now let’s dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes.

Let me open one secret to you 😃

Parquet is referred to as a columnar format in many books but internally it is like a hybrid format i.e. Combination of Row and Columnar 🧐🥸

No way, what you are telling now, Ankur? Are you kidding?

I am not kidding. It is actually a hybrid file format. Even ORC also follows this format.

Instead of wondering and panicking, let’s see what hybrid file formats mean.

In a hybrid columnar layout columns still follow each other but columns are split into chunks. A hybrid approach enables as well horizontal and vertical partitioning and hence fast filtering and scanning of data. It keeps the advantages of homogenous data and the ability to apply specific encoding and compression techniques efficiently.

Here A<Int>, B<Int> and c<Int> are three different columns.

This is actually what Parquet does:

  • It follows a columnar representation but at the same time, it splits the data into row groups (default 128Mb) and column chunks.
  • Each column chunk is also split into multiple pages (default 1Mb) containing a.o. metadata (min, max, count), encoded values etc.

Let’s now look at one illustration of Apache Parquet data organization.

As I have said, it has both horizontal partitioning and vertical partitioning. At horizontal partition, it is called row group. Parquet has multiple row groups in a single file with a default size of 128 MB. Inside this row group, it has vertical partitioning and data of column lie here in a columnar fashion. This is called Column Chunk. Within the column chunk, actual data is stored. It is called data pages.

Data pages have multiple metadata information like min, max, count etc. This helps Apache Parquet to provide the predicate pushdown support which means that it will scan only those pages whose metadata information like min and max lie inside the filters applied while querying data. One thing to note here is that metadata information is also stored at the row group level but this is stored at the footer.

So when a filter is applied then it will first look to the header which has one magical number PAR1 which identifies that it is a parquet file and then it will go to the footer where it looks into metadata information. Using metadata information it will identify that It has to look into this row group. In the row group, it will use the metadata information of data pages and column chunks to scan only the required pages.

Isn’t this methodology brilliant🤩?

Let’s look into how parquet store data.

When writing a Parquet file, most implementations will use dictionary encoding to compress a column until the dictionary itself reaches a certain size threshold, usually around 1 megabyte. At this point, the column writer will “fall back” to PLAIN encoding where values are written end-to-end in “data pages” and then usually compressed with Snappy or Gzip. See the following rough diagram:

These concepts area little hard to understand but let me try to explain in simpler terms.

Let’s look at one example of dictionary encoding in the parquet files.

Let’s suppose our column has the following data which has much duplicate values.

['apple', 'orange', 'apple', NULL, 'orange', 'orange']

Parquet will store this in dictionary-encoded form

dictionary: ['apple', 'orange'] indices: [0, 1, 0, NULL, 1, 1]

One can see that using this it will always save to storage and can save more data in the same space.

Parquet also use bit packing which means that it will use the optimized number of bytes to store the data. For example, let’s suppose our column has data like

[0, 2, 6, 7, 9, 0, 1]

So internally Parquet will store these data as bytes instead of Int. It saves storage and space. It uses a complex mechanism to store the right byte size.

Apache Parquet uses snappy compression by default when it is used with computing engines like Apache Spark. This ensures the right balance b/w compression ratio and speed.

I think now we have got enough reason of making Apache Parquet one of the best file formats in the analytical world where we are trying to apply filters, aggregation and select some of the columns out of multiple columns.

I hope you have enjoyed this article and learned something from this. Let’s connect in the next article and learn more about ORC and Avro file formats.

You can also connect with me on my Youtube channel i.e. The Big Data Show. Here I try to help aspiring Data engineers through tutorials, podcasts, mock interviews etc.

#dataengineering #parquet #bigdata #apachespark

Originally published at https://www.linkedin.com.

--

--

Ankur Ranjan

Data Engineer III @Walmart | Contributor of The Big Data Show