Solve Small File Problem using Apache Hudi
One of the biggest pains of Data Engineers is small file problems.
Let me tell you a short story and explain how one of the efficient tools solves this problem.
A few days ago, while UPSERTING data to my Apache Hudi table, I was observing the pipeline result and noticed that my small files were being compacted and converted into larger files. ๐ค๐ป๐๐พ
I experienced awe for a few minutes until I discovered the magical capabilities of Apache Hudi.
One of the best features that Apache Hudi provides is to overcome the dreaded small files problem.
For those unfamiliar with Apache Hudi, hereโs a brief definition and usage.
Apache Hudi is an open table file format that can be used efficiently to enable Data Lakehouse. It provides multiple supports in building Data Lakehouse but according to me, the most impactful are the following three.
- Efficient Ingestion: Support for mutability, Row level updates & and deletes.
- Efficient reading/writing performance: Support for multiple types of index to make your write and read faster, support for MOR table, Improved file layout and timeline.
- Concurrency control and ACID guarantees.
To give better context, I feel Hudi is not only open table file formats, it has multiple other features that amaze me every day.
But this post is not regarding What is Apache Hudi and where to use it. It is more about one of its features that I fell in love with in recent times i.e. its ability to deal with small file problems.
One design decision in Hudi was to avoid creating small files by always writing properly sized files. There are two ways in which Hudi solves this issue.
๐๐๐๐ผ ๐ฆ๐ถ๐๐ฒ ๐ฑ๐๐ฟ๐ถ๐ป๐ด ๐ถ๐ป๐ด๐ฒ๐๐๐ถ๐ผ๐ป: For COW
Automatically managing file sizes during ingestion may cause slight latency, but ensures efficient read queries immediately after a write is committed. Failing to manage file sizes during writing will result in slow queries until a resize cleanup is completed.
There are two important parameters that Apache Hudi uses during this process.
- ๐ก๐จ๐จ๐๐ข๐.๐ฉ๐๐ซ๐ช๐ฎ๐๐ญ.๐ฆ๐๐ฑ.๐๐ข๐ฅ๐.๐ฌ๐ข๐ณ๐:Target size in bytes for parquet files produced by Hudi write phases.
- ๐ก๐จ๐จ๐๐ข๐.๐ฉ๐๐ซ๐ช๐ฎ๐๐ญ.๐ฌ๐ฆ๐๐ฅ๐ฅ.๐๐ข๐ฅ๐.๐ฅ๐ข๐ฆ๐ข๐ญ:During upsert operation, Hudi opportunistically expands existing small files on storage, instead of writing new files, to keep the number of files to an optimum.
This config sets the file size limit below which a storage file becomes a candidate to be selected as such a small file. By default, treat any file <= 100MB as a small file.
Letโs try to understand by taking one example for the COW(Copy on Write) hudi table.
So letโs suppose our configuration is set to the following.
๐ก๐จ๐จ๐๐ข๐.๐ฉ๐๐ซ๐ช๐ฎ๐๐ญ.๐ฆ๐๐ฑ.๐๐ข๐ฅ๐.๐ฌ๐ข๐ณ๐: 120 MB
๐ก๐จ๐จ๐๐ข๐.๐ฉ๐๐ซ๐ช๐ฎ๐๐ญ.๐ฌ๐ฆ๐๐ฅ๐ฅ.๐๐ข๐ฅ๐.๐ฅ๐ข๐ฆ๐ข๐ญ: 100 MB
Now letโs see how Apache Hudi solves small-size problems step by step for the COW table.
๐ Sizing Up Small Files with Hudi
- ๐ File Size Controls: Think of it like adjusting the size of a picture! Hudi lets you set the maximum size for base/Parquet files(๐ก๐จ๐จ๐๐ข๐.๐ฉ๐๐ซ๐ช๐ฎ๐๐ญ.๐ฆ๐๐ฑ.๐๐ข๐ฅ๐.๐ฌ๐ข๐ณ๐). You can also decide when a file is โsmallโ based on a soft limit(๐ก๐จ๐จ๐๐ข๐.๐ฉ๐๐ซ๐ช๐ฎ๐๐ญ.๐ฌ๐ฆ๐๐ฅ๐ฅ.๐๐ข๐ฅ๐.๐ฅ๐ข๐ฆ๐ข๐ญ)
- ๐ Smart Start: When youโre starting a Hudi table, estimating the size of each record is like fitting puzzle pieces. Youโre aiming to neatly fit your data records into a Parquet file, so theyโre arranged well and take up less storage space.
- ๐ Memory Lane: As you keep writing, Hudi remembers the average record size from previous times. This helps it write and organize data better.
- ๐ Writing Magic: Imagine Hudi as a clever organizer. It adds more records to small files as you write, kind of like filling up a box. The goal? To reach the maximum size you set. For example, if the file size is 40 MB and the small.file.limit is set to 100 and the max.file.size is set to 120 MB. During the next set of inserts, it will try to add 80 MB of file to make it to 120 MB.
- ๐งฉ The Perfect Fit: Letโs say you set a compactionSmallFileSize of 100MB and a limitFileSize of 120MB. Hudi looks at files smaller than 100MB and adds data until theyโre a comfy 120MB.
Hudi helps your data find the right fit, just like sorting puzzle pieces to complete a picture! ๐งฉ๐๐ช
Letโs understand by one illustration.
- Letโs suppose, For the first time you wrote to Hudi using Spark & it created 5 files.
- During the second insert, it will identify, File 1, File 2 & and File 3 as a small file because we have set ๐ก๐จ๐จ๐๐ข๐.๐ฉ๐๐ซ๐ช๐ฎ๐๐ญ.๐ฌ๐ฆ๐๐ฅ๐ฅ.๐๐ข๐ฅ๐.๐ฅ๐ข๐ฆ๐ข๐ญ: 100 MB. So incoming inserts are assigned to them first. For example, Hudi will try to assign more 90 MB to file 1, 100 MB to file 2, and 40 MB to file 3. So does that because we have set ๐ก๐จ๐จ๐๐ข๐.๐ฉ๐๐ซ๐ช๐ฎ๐๐ญ.๐ฆ๐๐ฑ.๐๐ข๐ฅ๐.๐ฌ๐ข๐ณ๐: 120 MB & we want to set this to full capacity. During the next run inserts again, it will try to add 60 MB of data to the last file first.
- All of these file sizing algorithms, keep running to keep the optimal size of your file. This ensures that you donโt have small file problems in your cloud storage.
NOTE: Small files will be auto-sized only if there is no PENDING compaction or associated log file for that particular file slice. For example, if you had a log file and a compaction C1 was scheduled to convert that log file to parquet, no more inserts can go into that log file.
This algorithm might seem a little robust :) but trust me Apache Hudi is magical when it comes to handling small file problems.
We have discussed a lot of concepts in this article. It is becoming long. I hope some of my readers who do understand the File Sizing algorithm for the Apache Hudi MOR(Merge On Read) table will post their explanation in the comment section or Feel free to subscribe to my YouTube channel i.e. The Big Data Show. I might upload a more detailed discussion of the above concepts in the coming days.
More so, thank you for that most precious gift to a me as writer i.e. your time.
Originally published at https://www.linkedin.com.