Apache Iceberg – The Definitive Guide

Name: Apache Iceberg – The Definitive Guide
Author: Tomer Shiran

Data Lakehouse Functionality, Performance, and Scalability on the Data Lake

Specificaties

Paperback, 300 blz. | Engels

O'Reilly | 1e druk, 2024

ISBN13: 9781098148621

Rubricering

Hoofdrubriek : Computer en informatica

Juridisch : Computer en informatica

O'Reilly 1e druk, 2024 9781098148621

€ 78,50

In winkelwagen

Levertijd ongeveer 16 werkdagen

Gratis verzonden

Samenvatting

Traditional data architecture patterns are severely limited. To use these patterns, you have to ETL data into each tool—a cost-prohibitive process for making warehouse features available to all of your data. The lack of flexibility with these patterns requires you to lock into a set of priority tools and formats, which creates data silos and data drift. This practical book shows you a better way.

Apache Iceberg provides the capabilities, performance, scalability, and savings that fulfill the promise of an open data lakehouse. By following the lessons in this book, you'll be able to achieve interactive, batch, machine learning, and streaming analytics with this high-performance open source format. Authors Tomer Shiran, Jason Hughes, and Alex Merced from Dremio show you how to get started with Iceberg.

With this book, you'll learn:
- The architecture of Apache Iceberg tables
- What happens under the hood when you perform operations on Iceberg tables
- How to further optimize Iceberg tables for maximum performance
- How to use Iceberg with popular data engines such as Apache Spark, Apache Flink, and Dremio
- Discover why Apache Iceberg is a foundational technology for implementing an open data lakehouse.

Specificaties

ISBN13:9781098148621

Trefwoorden:Programmeren, Web programming, Apache

Taal:Engels

Bindwijze:paperback

Aantal pagina's:300

Uitgever:O'Reilly

Druk:1

Verschijningsdatum:29-2-2024

Hoofdrubriek:Computer en informatica

Inhoudsopgave

Foreword by Gerrit Kazmaier
Foreword by Raghu Ramakrishnan
Foreword by Rick Sears
Preface
About This Book
Why We Wrote This Book
What You Will Find Inside
How to Use This Book
Feedback and Questions
Conventions Used in This Book
Using Code Examples
O’Reilly Online Learning
How to Contact Us
Acknowledgments

I. Fundamentals of Apache Iceberg
1. Introduction to Apache Iceberg
How Did We Get Here? A Brief History
Foundational Components of a System Designed for OLAP Workloads
Bringing It All Together
The Data Warehouse
A Brief History
Pros and Cons of a Data Warehouse
The Data Lake
A Brief History
Pros and Cons of a Data Lake
Should I Run Analytics on a Data Lake or a Data Warehouse?
The Data Lakehouse
What Is a Table Format?
Hive: The Original Table Format
Modern Data Lake Table Formats
What Is Apache Iceberg?
How Apache Iceberg Came to Be
The Apache Iceberg Architecture
Key Features of Apache Iceberg
Conclusion

2. The Architecture of Apache Iceberg
The Data Layer
Datafiles
Delete Files
The Metadata Layer
Manifest Files
Manifest Lists
Metadata Files
Puffin Files
The Catalog
Conclusion

3. Lifecycle of Write and Read Queries
Writing Queries in Apache Iceberg
Create the Table
Insert the Query
Merge Query
Reading Queries in Apache Iceberg
The SELECT Query
The Time-Travel Query
Conclusion

4. Optimizing the Performance of Iceberg Tables
Compaction
Hands-on with Compaction
Compaction Strategies
Automating Compaction
Sorting
Z-order
Partitioning
Hidden Partitioning
Partition Evolution
Other Partitioning Considerations
Copy-on-Write Versus Merge-on-Read
Copy-on-Write
Merge-on-Read
Configuring COW and MOR
Other Considerations
Metrics Collection
Rewriting Manifests
Optimizing Storage
Write Distribution Mode
Object Storage Considerations
Datafile Bloom Filters
Conclusion

5. Iceberg Catalogs
Requirements of an Iceberg Catalog
Catalog Comparison
The Hadoop Catalog
The Hive Catalog
The AWS Glue Catalog
The Nessie Catalog
The REST Catalog
The JDBC Catalog
Other Catalogs
Catalog Migration
Using the Apache Iceberg Catalog Migration CLI
Using an Engine
Conclusion

II. Hands-on with Apache Iceberg
6. Apache Spark
Configuration
Configuring Apache Iceberg and Spark
Configuring the Catalogs
Starting Spark with All the Configurations (AWS Glue Example)
Data Definition Language Operations
CREATE TABLE
ALTER TABLE
Alter a Table with Iceberg’s Spark SQL Extensions
DROP TABLE
Reading Data
The Select All Query
The Filter Rows Query
Aggregation Queries
Using Window Functions
Writing Data
INSERT INTO
MERGE INTO
INSERT OVERWRITE
DELETE FROM
UPDATE
Iceberg Table Maintenance Procedures
Expire Snapshots
Rewrite Datafiles
Rewrite Manifests
Remove Orphan Files
Conclusion

7. Dremio’s SQL Query Engine
Configuration
Data Definition Language Operations
CREATE TABLE
ALTER TABLE
DROP TABLE
Reading Data
Using the SELECT Query
Filtering Rows
Using Aggregated Queries
Using Window Functions
Writing Data
INSERT INTO
COPY INTO
MERGE INTO
DELETE
UPDATE
Iceberg Table Maintenance
Expire Snapshots
Rewrite Datafiles
Rewrite Manifests
Conclusion

8. AWS Glue
Configuration
Creating a Glue Database
Configuring the Glue ETL Job
Create a Table Using the Glue Data Catalog
Read the Table
Insert the Data
Conclusion

9. Apache Flink
Configuration
Prerequisites
Start the Flink Cluster and Flink SQL Client
Data Definition Language Operations
CREATE CATALOG
CREATE DATABASE
CREATE TABLE
ALTER TABLE
DROP TABLE
Reading Data
Flink SQL Batch Read
Flink SQL Streaming Read
Metadata Table
Writing Data
INSERT INTO
INSERT OVERWRITE
UPSERT
Flink DataFrame and Table API with Apache Iceberg Tables
Prerequisites
Configuring the Flink Job
Starting the Cluster and Building the Package
Running the Job
Conclusion

III. Apache Iceberg in Practice
10. Apache Iceberg in Production
Apache Iceberg Metadata Tables
The history Metadata Table
The metadata_log_entries Metadata Table
The snapshots Metadata Table
The files Metadata Table
The manifests Metadata Table
The partitions Metadata Table
The all_data_files Metadata Table
The all_manifests Metadata Table
The refs Metadata Table
The entries Metadata Table
Using the Metadata Tables in Conjunction
Isolation of Changes with Branches
Table Branching and Tagging
Catalog Branching and Tagging
Multitable Transactions
Rolling Back Changes
Rolling Back at the Table Level
Rolling Back at the Catalog Level
Conclusion

11. Streaming with Apache Iceberg
Streaming with Spark
Streaming into Iceberg with Spark
Streaming from Iceberg with Spark
Streaming with Flink
Streaming into Iceberg with Flink
Example of Streaming into Iceberg with Flink
Streaming with Kafka Connect
The Iceberg Kafka Sink
Streaming with AWS
Conclusion

12. Governance and Security
Securing Datafiles
Securing Files: Best Practices
Hadoop Distributed File System
Amazon Simple Storage Service
Azure Data Lake Storage
Google Cloud Storage
Securing and Governing at the Semantic Layer
Semantic Layer Best Practices
Dremio
Trino
Securing and Governing at the Catalog Level
Nessie
Tabular
AWS Glue and Lake Formation
Additional Security and Governance Considerations
Conclusion

13. Migrating to Apache Iceberg
Migration Considerations
Three-Step In-Place Migration Plan
Four-Phase Shadow Migration Plan
Migrating Hive Tables to Apache Iceberg
The Snapshot Procedure
The Migrate Procedure
Migrating Delta Lake to Apache Iceberg
Migrating Apache Hudi to Apache Iceberg
Migrating Individual Files to Apache Iceberg
Using the add_files Procedure
Migrating from Delta Lake or Apache Hudi Without Preserving History
Migrating from Anywhere by Rewriting Data
Migrating Data to a New Iceberg Table
Migrating Data into an Existing Iceberg Table
Conclusion

14. Real-World Use Cases of Apache Iceberg
Ensuring High-Quality Data with Write-Audit-Publish in Apache Iceberg
WAP Using Iceberg’s Branching Feature
Running BI Workloads on the Data Lake
Land the Raw Data into the Data Lake
Curate Virtual Data Marts/Data Products
Create a Reflection to Accelerate Our Dashboard
Connect Our View to Our BI Tool
Benefits of Running BI Workloads on the Data Lake
Implementing Change Data Capture with Apache Iceberg
Create Apache Iceberg Tables
Apply Updates from Operational Systems
Create the Change Log View to Capture Changes
Merge Changed Data in the Aggregated Table
Conclusion

Index
About the Authors