Apache Iceberg Architecture
Apache Iceberg represents every table as a tree of metadata files sitting in object storage alongside the actual data. Understanding that tree is the key to understanding how Iceberg delivers ACID guarantees, fast queries, and multi-engine access without a central coordinator.
The Metadata Tree
Table Metadata JSON
Every Iceberg table has a current metadata JSON file recording the full
schema (with column IDs, not just names), all partition specs, all sort
orders, and a list of all snapshots. The catalog holds a pointer to the
current version. When you run ALTER TABLE ... ADD COLUMN,
Iceberg writes a new metadata JSON with the updated schema and swaps the
catalog pointer. No data files are rewritten.
Snapshots
A snapshot represents the table state at a specific committed transaction. Each snapshot has a unique ID, a parent snapshot ID, a timestamp, a summary of what changed, and a pointer to the manifest list. Snapshots are immutable. Once written, they never change. Time travel and consistent reads are direct consequences of this.
Manifest List
An Avro file where each record describes one manifest file and includes a summary of that manifest's partition range. This summary enables manifest-level partition pruning: the query planner reads only the manifest list headers to determine which manifests can be skipped entirely before reading any manifest file.
Manifest Files
Each manifest file is Avro. Every record describes one data file or delete
file and includes the file's path, format, partition values, record count,
file size, and per-column statistics (min value, max value, null count).
These statistics enable file-level data skipping: if a file's max order_date is earlier than the query's filter, the engine skips that file without opening
it.
The Read Path: How Query Planning Works
Steps 1 through 5 are metadata operations — no Parquet data is read. A query that filters on a well-partitioned date column may skip 99% of the data files. This is how Iceberg achieves warehouse-level query performance on raw object storage.
The Write Path: Committing a Snapshot
Concurrency
Iceberg uses optimistic concurrency control. Writers do not take locks. Two appends to different partitions: no conflict, both succeed. Two writers overwriting the same partition: conflict, one retries. A compaction job and an append job: usually compatible, both succeed.
Schema Evolution via Column IDs
Every column has a permanent numeric field ID. Parquet files store data by field ID, not column name. This means renaming a column, adding a column, or reordering columns requires no data rewrites. Old files still work because Iceberg maps field IDs to current names at read time.
Snapshot Maintenance
Old snapshots reference data files and keep them from being garbage collected. The standard operations are: expire snapshots, remove orphan files, compaction, and rewrite manifests.