Solve Small File Problem using Apache Hudi

Ankur Ranjan
5 min readAug 25, 2023

One of the biggest pains of Data Engineers is small file problems.

Let me tell you a short story and explain how one of the efficient tools solves this problem.

A few days ago, while UPSERTING data to my Apache Hudi table, I was observing the pipeline result and noticed that my small files were being compacted and converted into larger files. ๐Ÿค”๐Ÿ’ป๐Ÿ”๐Ÿ’พ

I experienced awe for a few minutes until I discovered the magical capabilities of Apache Hudi.

One of the best features that Apache Hudi provides is to overcome the dreaded small files problem.

For those unfamiliar with Apache Hudi, hereโ€™s a brief definition and usage.

Apache Hudi is an open table file format that can be used efficiently to enable Data Lakehouse. It provides multiple supports in building Data Lakehouse but according to me, the most impactful are the following three.

  1. Efficient Ingestion: Support for mutability, Row level updates & and deletes.
  2. Efficient reading/writing performance: Support for multiple types of index to make your write and read faster, support for MOR table, Improved file layout and timeline.
  3. Concurrency control and ACID guarantees.

To give better context, I feel Hudi is not only open table file formats, it has multiple other features that amaze me every day.

But this post is not regarding What is Apache Hudi and where to use it. It is more about one of its features that I fell in love with in recent times i.e. its ability to deal with small file problems.

One design decision in Hudi was to avoid creating small files by always writing properly sized files. There are two ways in which Hudi solves this issue.

๐—”๐˜‚๐˜๐—ผ ๐—ฆ๐—ถ๐˜‡๐—ฒ ๐—ฑ๐˜‚๐—ฟ๐—ถ๐—ป๐—ด ๐—ถ๐—ป๐—ด๐—ฒ๐˜€๐˜๐—ถ๐—ผ๐—ป: For COW

Automatically managing file sizes during ingestion may cause slight latency, but ensures efficient read queries immediately after a write is committed. Failing to manage file sizes during writing will result in slow queries until a resize cleanup is completed.

There are two important parameters that Apache Hudi uses during this process.

  1. ๐ก๐จ๐จ๐๐ข๐ž.๐ฉ๐š๐ซ๐ช๐ฎ๐ž๐ญ.๐ฆ๐š๐ฑ.๐Ÿ๐ข๐ฅ๐ž.๐ฌ๐ข๐ณ๐ž:Target size in bytes for parquet files produced by Hudi write phases.
  2. ๐ก๐จ๐จ๐๐ข๐ž.๐ฉ๐š๐ซ๐ช๐ฎ๐ž๐ญ.๐ฌ๐ฆ๐š๐ฅ๐ฅ.๐Ÿ๐ข๐ฅ๐ž.๐ฅ๐ข๐ฆ๐ข๐ญ:During upsert operation, Hudi opportunistically expands existing small files on storage, instead of writing new files, to keep the number of files to an optimum.

This config sets the file size limit below which a storage file becomes a candidate to be selected as such a small file. By default, treat any file <= 100MB as a small file.

Letโ€™s try to understand by taking one example for the COW(Copy on Write) hudi table.

So letโ€™s suppose our configuration is set to the following.

๐ก๐จ๐จ๐๐ข๐ž.๐ฉ๐š๐ซ๐ช๐ฎ๐ž๐ญ.๐ฆ๐š๐ฑ.๐Ÿ๐ข๐ฅ๐ž.๐ฌ๐ข๐ณ๐ž: 120 MB

๐ก๐จ๐จ๐๐ข๐ž.๐ฉ๐š๐ซ๐ช๐ฎ๐ž๐ญ.๐ฌ๐ฆ๐š๐ฅ๐ฅ.๐Ÿ๐ข๐ฅ๐ž.๐ฅ๐ข๐ฆ๐ข๐ญ: 100 MB

Now letโ€™s see how Apache Hudi solves small-size problems step by step for the COW table.

๐Ÿ“Š Sizing Up Small Files with Hudi

  1. ๐Ÿ“ File Size Controls: Think of it like adjusting the size of a picture! Hudi lets you set the maximum size for base/Parquet files(๐ก๐จ๐จ๐๐ข๐ž.๐ฉ๐š๐ซ๐ช๐ฎ๐ž๐ญ.๐ฆ๐š๐ฑ.๐Ÿ๐ข๐ฅ๐ž.๐ฌ๐ข๐ณ๐ž). You can also decide when a file is โ€œsmallโ€ based on a soft limit(๐ก๐จ๐จ๐๐ข๐ž.๐ฉ๐š๐ซ๐ช๐ฎ๐ž๐ญ.๐ฌ๐ฆ๐š๐ฅ๐ฅ.๐Ÿ๐ข๐ฅ๐ž.๐ฅ๐ข๐ฆ๐ข๐ญ)
  2. ๐Ÿš€ Smart Start: When youโ€™re starting a Hudi table, estimating the size of each record is like fitting puzzle pieces. Youโ€™re aiming to neatly fit your data records into a Parquet file, so theyโ€™re arranged well and take up less storage space.
  3. ๐Ÿ“ Memory Lane: As you keep writing, Hudi remembers the average record size from previous times. This helps it write and organize data better.
  4. ๐Ÿ“ Writing Magic: Imagine Hudi as a clever organizer. It adds more records to small files as you write, kind of like filling up a box. The goal? To reach the maximum size you set. For example, if the file size is 40 MB and the small.file.limit is set to 100 and the max.file.size is set to 120 MB. During the next set of inserts, it will try to add 80 MB of file to make it to 120 MB.
  5. ๐Ÿงฉ The Perfect Fit: Letโ€™s say you set a compactionSmallFileSize of 100MB and a limitFileSize of 120MB. Hudi looks at files smaller than 100MB and adds data until theyโ€™re a comfy 120MB.

Hudi helps your data find the right fit, just like sorting puzzle pieces to complete a picture! ๐Ÿงฉ๐Ÿ“Š๐Ÿช„

Letโ€™s understand by one illustration.

  • Letโ€™s suppose, For the first time you wrote to Hudi using Spark & it created 5 files.
  • During the second insert, it will identify, File 1, File 2 & and File 3 as a small file because we have set ๐ก๐จ๐จ๐๐ข๐ž.๐ฉ๐š๐ซ๐ช๐ฎ๐ž๐ญ.๐ฌ๐ฆ๐š๐ฅ๐ฅ.๐Ÿ๐ข๐ฅ๐ž.๐ฅ๐ข๐ฆ๐ข๐ญ: 100 MB. So incoming inserts are assigned to them first. For example, Hudi will try to assign more 90 MB to file 1, 100 MB to file 2, and 40 MB to file 3. So does that because we have set ๐ก๐จ๐จ๐๐ข๐ž.๐ฉ๐š๐ซ๐ช๐ฎ๐ž๐ญ.๐ฆ๐š๐ฑ.๐Ÿ๐ข๐ฅ๐ž.๐ฌ๐ข๐ณ๐ž: 120 MB & we want to set this to full capacity. During the next run inserts again, it will try to add 60 MB of data to the last file first.
  • All of these file sizing algorithms, keep running to keep the optimal size of your file. This ensures that you donโ€™t have small file problems in your cloud storage.

NOTE: Small files will be auto-sized only if there is no PENDING compaction or associated log file for that particular file slice. For example, if you had a log file and a compaction C1 was scheduled to convert that log file to parquet, no more inserts can go into that log file.

This algorithm might seem a little robust :) but trust me Apache Hudi is magical when it comes to handling small file problems.

We have discussed a lot of concepts in this article. It is becoming long. I hope some of my readers who do understand the File Sizing algorithm for the Apache Hudi MOR(Merge On Read) table will post their explanation in the comment section or Feel free to subscribe to my YouTube channel i.e. The Big Data Show. I might upload a more detailed discussion of the above concepts in the coming days.

More so, thank you for that most precious gift to a me as writer i.e. your time.

Originally published at https://www.linkedin.com.

--

--

Ankur Ranjan

Data Engineer III @Walmart | Contributor of The Big Data Show