Accelerating Query Performance with Iceberg Puffin

By:
Gaurav Thalpat

In the last decade, cloud computing has changed how IT infrastructure and applications are deployed and managed. Availability and scalability are no longer a concern; developers can now focus on their code without worrying about the underlying infrastructure requirements. However, cost and performance remain the top challenges for organizations dealing with big data and analytics use cases.

The data community has always strived to develop innovative solutions to optimize resource utilization, reduce costs, and improve overall performance. Leveraging cloud object storage and open-source technologies to support AI and analytics workloads is a step in this direction. Leading at the forefront is  Apache Iceberg, an open table format that provides data warehouse-like capabilities to data lakes. The Iceberg community has also adopted Puffin, a file format that stores statistics to improve query performance.

This blog will introduce you to Apache Iceberg and Puffin formats and explain how Puffin can help improve query performance. Let’s get started.

Apache Iceberg Overview

Let’s first start with the basics.

As a data practitioner, you would have dealt with file formats like CSV, JSON, or XML while working on your projects. Data engineers would be familiar with file formats like Apache Parquet, Apache ORC, and Apache Avro. These are widely adopted file formats for implementing data lakes and lakehouses. Data files created using these formats are stored in cloud object storage, such as Amazon S3.

Open table formats provide a metadata layer on top of these data files. This metadata layer helps organize and maintain data files to support features like ACID support, time travel, and schema evolution. Apache Iceberg, Apace Hudi, and Linux Foundation’s Delta Lake are leading open table formats adopted across industries.

Apache Iceberg is one of the leading open table formats with strong community support and broad vendor adoption. It helps to implement a data lakehouse architecture to support BI to AI use cases. Iceberg also offers unique features like hidden partitions and partition layout evolution and supports Parquet, ORC, and Avro file formats for storing data.

Here is a simplified view of the Apache Iceberg layout based on the table spec documentation.

As shown in the above diagram, Iceberg has two main parts - the metadata layer and the data layer along with a built-in catalog that stores the mapping of tables and the location of their respective metadata files.

The metadata layer consists of the metadata file, manifest lists, and manifest files as described below:

  • The metadata file maintains and tracks the table schema, partition, and snapshot changes. Snapshot is a "point-in-time" state of data stored in the Iceberg tables
  • The manifest files point to the data files that hold the actual data. It also holds column-level stats information, like the minimum and maximum values per column.
  • The data layer consists of actual data files created using open file formats like Apache Parquet.

The manifest list and manifest files help in data skipping. Data skipping is an activity that query engines perform to minimize the volume of data scans when retrieving data from tables. The engine skips reading unwanted manifest and data files, thus reducing I/O and improving the overall query performance.

Apart from the partition and column-level statistics stored in manifest files, query engines can leverage additional stats to improve query performance. Query optimizer, a critical component within a query engine, can use these stats to decide the most efficient query execution method.

Let’s discuss this in a bit more detail.

How Query Optimizers Use Statistics?

SQL processing engines use optimizers to analyze different execution plans and determine the most efficient one. For example, when you join three tables, the optimizers decide which two tables to join first to get an output to join with the third table. The optimizer decides the optimal plan based on available statistics.

An optimizer based on the cost method is known as a cost-based optimizer (CBO). While there are other methods like rule-based optimizer (RBO), most compute engines dominantly use CBO as it provides a better query execution approach that reduces the query cost. The CBO method leverages the table statistics to estimate the query cost. For each execution plan, CBO estimates the utilization of resources like compute, memory, and I/O and the associated cost. It then chooses the lowest of them all for executing the query.

One of the essential stats that optimizers leverage is the ‘approximate number of distinct values’ (NDV) for each column. It is one of the valuable stats that the CBO uses for decision-making.

But how is NDV and CBO related to Iceberg Puffin? Let’s try to understand this.

Understanding Iceberg Puffin

The Iceberg community adopted the Puffin format to improve the performance of queries that retrieve data from Iceberg tables. Puffin is a file format to store statistics and indexes for Iceberg tables and columns. These are additional stats like NDV that cannot be directly stored in the manifest files. Query engines can leverage the manifest and Puffin files to determine the optimal execution path.

The below diagram shows the different files created as part of Iceberg tables and how they are used for data skipping and performance optimization:

Files created as part of Iceberg open table format along with the Puffin file

As shown in the above diagram, Puffin stores stats that are different from the regular column-level stats, like max and min values stored in manifest files. Puffin stores stats like distinct value counts per column, which optimizers can use to decide the best execution plans. As these stats are size-intensive, Puffin format was introduced to store them in a different file, enabling multiple engines to read this file and access the stats without recalculating them.

Puffin stores the stats and indexes related information (also known as blobs) and metadata details required to interpret these blobs. For calculating stats like NDV, Puffin uses an algorithm known as "Theta Sketch"

Theta Sketch is part of the Apache DataSketches library. Apache DataSketches provides various algorithms (known as sketches) for operations like distinct counting, calculating frequently occurring items, and others. Theta Sketch helps to estimate distinct counts. It is difficult to calculate the exact distinct counts with 100% accuracy for large volumes of data. Theta Sketch provides approximate values of distinct counts by hashing/sampling the overall dataset. These approximate distinct counts (NDV) are stored in Puffin and are leveraged by the optimizers for efficiently retrieving data from Iceberg tables.

Iceberg Puffin Benefits

Puffin file, with the help of the NDV stats, can offer multiple benefits. Some of the key benefits are summarized below:

Query optimization

NDV is an important parameter used by the optimizers during query execution. It helps to optimize ‘group by’distinct’, and ‘count’ queries. It can help decide the join order to optimize query performance for join queries.

Resource optimization

With the help of stats, the optimizers can select the most efficient path for executing the queries. During join queries, stats help to determine smaller tables and broadcast them across all nodes, thus optimizing the utilization of available resources and reducing I/O and data shuffling across nodes.

Engine interoperability

Puffin provides interoperability across query engines for accessing the stats, i.e., a Puffin file created by one engine (like Trino) can be used by other compute engines for query optimization.

While these are benefits of using Puffin with Iceberg, other table formats might also adopt Puffin to manage the stats in the future. We might also get to see Puffin supporting other sketches or indexes to improve query performance.

Conclusion

In this article, we discussed open table formats like Apache Iceberg and how it helps accelerate query performance by leveraging stats like NDV by storing them in a dedicated file known as Puffin. These stats can help CBOs to select the most efficient path for executing a query.

Amazon Athena, Amazon Redshift Spectrum, and Trino are some of the query engines that leverage Puffin stats while retrieving data from Iceberg tables. You can dive deeper into this topic and explore Apache Iceberg and Puffin and how to leverage these for your use cases.

References

  1. https://docs.oracle.com/cd/B10500_01/server.920/a96533/optimops.htm
  2. https://aws.amazon.com/blogs/big-data/accelerate-query-performance-with-apache-iceberg-statistics-on-the-aws-glue-data-catalog/
  3. https://www.tabular.io/blog/iceberg-202207/
  4. https://dev.to/alexmercedcoder/understanding-the-apache-iceberg-manifest-list-snapshot-507
  5. https://dev.to/alexmercedcoder/understanding-the-apache-iceberg-manifest-file-581d
  6. https://github.com/apache/iceberg/pull/8202