Apache Polaris: Data Catalog for Open Data Platforms

By:
Gaurav Thalpat

Open table formats (OTFs) like Apache Iceberg, Apache Hudi, and Delta Lake introduced warehouse-like features within a data lake. Data platforms built using these OTFs leverage object storage to store data and different compute engines to process the data. Along with storage and compute, we also need data catalogs to explore and discover data.

Data catalogs enable organizations to manage metadata and implement sound governance and security processes. While technical catalogs (which handle mainly the technical aspects of metadata) help implement low-level metadata management and access control processes, the enterprise-grade product catalogs offer advanced features like data lineage, classification, and a business glossary. Multiple open-source and commercial catalogs are available and are widely adopted by different organizations.

In this article, we will discuss Apache Polaris, an open-source technical catalog for Apache Iceberg. You will learn what Apache Polaris is and how you can leverage its cross-engine interoperability features.

But first, let's understand why we need a catalog.

Why do we need a catalog?

In any data ecosystem, the metadata is as important as the data itself. Metadata enables users to discover data and leverage it for their analysis. Without robust metadata management, data lakes can quickly become data swamps used only to dump the data without using it for business purposes. Data catalogs can help you to organize and maintain metadata and use it as per your business requirements.

A data catalog is a single window through which users can access all tables, views, reports, and ML models. It also provides enhanced features like access control management, data lineage, and data classification, enabling organizations to implement robust data governance and security processes. Some enterprise catalogs also provide features to add business context to the technical metadata.

Why does Apache Iceberg need a catalog?

All the points discussed in the previous sections apply to standard enterprise-grade data catalogs. However, Apache Iceberg is specifically dependent on its technical catalog, which is one of the fundamental components within the Iceberg table specification. Let's discuss this in more detail.

Apache Iceberg is one of the leading open table formats for implementing a data lakehouse architecture. It offers ACID support, time travel, schema evolution, and hidden partitioning features. The diagram below shows the three layers of Apache Iceberg based on the table specification:

As the diagram indicates, the catalog holds the pointer to the latest metadata file for each Iceberg table. Iceberg creates a new metadata file for every write transaction. The metadata file holds the snapshot information and other details, like table schema and partition details. After every transaction, the catalog swaps the metadata file pointer to the latest metadata file. When a client executes a query against an Iceberg table, it reads the catalog to find the latest metadata file and then reads the required manifest and data files.

Iceberg also uses this catalog-based approach to resolve issues caused by “concurrent writes.” Iceberg handles such conflicts using the catalog and the metadata file pointer to ensure ACID compliance.

Iceberg Catalog Options

Iceberg needs a catalog for table-related operations like creating tables, dropping tables, or for checking whether a table exists. There are multiple options for implementing such a cataloging solution for Iceberg tables. Iceberg integrates well with the below listed catalogs:

  • Hadoop Catalog (File-system based)
  • Hive Catalog (uses Hive metastore)
  • AWS Glue Data Catalog (managed AWS service)

While these solutions can do the primary cataloging operations, they have some key limitations, as summarized below:

To overcome these limitations, Iceberg introduced the REST Catalog Open API specification that defines the REST catalog interface for Iceberg.

Iceberg REST Catalog API

Iceberg’s REST API-based catalog implementation has recently gained interest from the data community. This is mainly because REST API offers more flexibility and simplifies the deployment process. REST API acts as an interface between catalogs and clients (query engines) accessing the catalog tables.

REST API implementations using the HTTP protocol are simple to deploy and manage. They do not depend on any specific cloud service, which helps to implement a cloud-agnostic, vendor-neutral solution.

Multiple vendors have created catalogs that can integrate with Iceberg by following the REST API specification. Some of these are listed below:

  • Apache Polaris - based on Iceberg’s REST API specification
  • Apache Gravitino - an open-source catalog built upon Iceberg’s REST API specification
  • Unity Catalog - provides a read-only implementation of the Iceberg’s REST catalog API
  • Nessie - a catalog that offers git-like features and now supports Iceberg’s REST catalog API specification

Apache Polaris

Apache Polaris is an open-source catalog that implements Iceberg’s REST API specification. It is incubating as part of The Apache Software Foundation (ASF) and was open-sourced by Snowflake last year.

Polaris catalog enables you to manage your Iceberg tables with more flexibility, security, and interoperability features. The key challenges that it solves are:

  • Interoperability across engines on a single copy of data
  • Access control mechanism across multiple engines

Avoid vendor lock-ins and dependency on specific cloud platforms

As shown in the above diagram, Polaris provides interoperability across popular open-source engines like Apache Spark, Trino, and Flink. Other query engines from AWS, Azure, GCP, and Dremio can also query data using the Polaris catalog.

The Polaris catalog consists of three layers with the hierarchy as below:

Catalog >> Namespace >> Tables

  • Within a single Polaris instance, you can create multiple ‘catalog’ resources. (For example, you can create one catalog per business unit.) You can either create an internal catalog (managed by Polaris) or an external catalog (managed by another Iceberg catalog provider like Snowflake/Glue Data Catalog)
  • Namespace helps to group (logically) multiple Iceberg tables within a catalog. There can be multiple namespaces within a catalog.
  • Iceberg tables are the granular components used by query engines to query data.

Access policies can be applied at the catalog, namespace, and table levels.

Note: Apache Polaris is an independent project that is part of ASF. However, if you are using Snowflake, you can also explore Snowflake Open Catalog, a managed service for Apache Polaris. It helps to implement a secure, centralized catalog that the REST-compatible query engines can access. Snowflake Open Catalog is generally available, and you can consider it for your use cases.

Benefits of Apache Polaris

Polaris has several benefits compared to earlier catalogs like HMS or the Hadoop catalog. Along with the REST API support, it has the following key benefits:

Interoperability

Polaris provides interoperability across different REST-compatible query engines. Different engines, like Trino, Snowflake, and Spark, can access Iceberg tables through the Polaris catalog. For example, you can create and load an Iceberg table using Snowflake and read this data using the Spark engine.

Role-based Access Control

Polaris enables enterprise-level security by offering an access control mechanism. You can implement role-based access control (RBAC) policies using the Polaris catalog. Any engine that reads data through Polaris will be restricted per these central access policies.

No Vendor Lock-in

You can install Polaris on your own infrastructure. There is no restriction to use a specific platform or cloud service.

Apache Polaris is a relatively new catalog. It is still in its early days, but we might see interesting features in the future, such as fine-grained access controls, catalog federation, and support for non-iceberg tables. These will make it a strong contender for becoming the default choice for implementing open data platforms.

Summary

Apache Iceberg strongly depends on its catalog to ensure ACID compliance and handle concurrent writes. The catalog ensures that it points to the latest metadata file when reading data from Iceberg tables. Iceberg can use multiple catalogs, such as the HMS or Hadoop catalog, but these have specific limitations.

Iceberg introduced the REST catalog open API specification for implementing a catalog to address these limitations. The catalogs implemented using the REST API specification provide flexibility and interoperability across query engines. Apache Polaris is a relatively new catalog in its incubation stage under ASF. It implements Iceberg’s REST API specification and supports additional features like RBAC.

If you are looking for an open-source technical catalog for your data lakehouse, you should explore Apache Polaris. If you are using Snowflake, you can also check the Snowflake Open Catalog which is a managed service for Apache Polaris.

References

  1. https://other-docs.snowflake.com/en/opencatalog/overview
  2. https://www.dremio.com/blog/the-evolution-of-apache-iceberg-catalogs/#the-purpose-of-iceberg-catalogs
  3. https://www.snowflake.com/en/blog/introducing-polaris-catalog/
  4. https://blog.det.life/a-brief-history-of-rest-catalogs-for-apache-iceberg-39823ea13198
  5. https://medium.com/@kywe665/unity-catalog-vs-apache-polaris-522b69a4d7df
  6. https://docs.databricks.com/aws/en/external-access/iceberg
  7. https://gravitino.apache.org/docs/0.6.1-incubating/iceberg-rest-service/
  8. https://www.dremio.com/blog/use-nessie-with-iceberg-rest-catalog/
  9. https://polaris.apache.org/in-dev/unreleased/overview/