Harnessing Real-Time Data Analytics with Presto and Kafka at Uber

Chapter 1: Introduction to Data at Uber

Data plays a pivotal role at Uber. Whether it's optimizing driver routes or forecasting rider demand, our capacity to swiftly process and analyze massive datasets in real-time is essential. This article delves into how Uber leverages Presto and Apache Kafka to manage data analytics efficiently.

Section 1.1: Overview of Presto and Kafka

What is Presto?

Presto is a powerful open-source distributed SQL query engine that enables interactive querying across extensive datasets. Its versatility allows it to access data from various storage solutions, including databases and data lakes.

What is Apache Kafka?

Apache Kafka serves as a distributed streaming platform, facilitating the publication, subscription, storage, and processing of real-time data streams. At Uber, it acts as the backbone for our data streaming operations, managing trillions of messages on a daily basis.

Section 1.2: The Challenge of Real-Time Data Analysis

As Uber expanded, our data teams required an effective method for executing prompt, on-demand analyses of live data streams from Kafka. Traditional approaches often demanded complex configurations or failed to provide the low-latency performance necessary for troubleshooting and exploratory data analysis.

Chapter 2: Exploring Solutions

To address the needs for real-time data analysis, we evaluated several potential solutions:

Streaming Processing Engines: While tools like Apache Flink continuously process streams, they are not suited for quick, point-in-time queries.
Real-time OLAP Datastores: Solutions like Apache Pinot can deliver low-latency queries but typically demand substantial setup and resources.

Ultimately, we found that integrating Presto with Kafka offered an optimal combination of user-friendliness, versatility, and performance.

Section 2.1: The Solution - Integrating Presto and Kafka

By merging Presto with Kafka, we enabled real-time SQL queries directly on Kafka streams, which brought several advantages:

Immediate Access: Kafka topics can be queried immediately upon creation without any additional setup.
Advanced Query Capabilities: Presto’s functionality allows for comprehensive insights by joining data across multiple sources.

However, we faced challenges such as dynamic topic discovery, query limitations, and quota management, which we addressed through strategic enhancements.

Section 2.2: Overcoming Integration Challenges

Integrating these systems involved tackling various technical hurdles:

Dynamic Topic Discovery: We established a system for on-demand discovery of Kafka topics and schemas.
Query Restrictions: To maintain system performance, we implemented limits on the amount of data each query could retrieve from Kafka.
Quota Management: Leveraging Kafka's broker quotas allowed us to control data consumption and prevent system overloads.

Section 2.3: Dynamic Topic and Schema Discovery

We implemented a mechanism for on-demand discovery of cluster/topic and schema information. Kafka topic metadata and data schema are dynamically retrieved using KafkaMetadata at runtime. This approach supports multiple Kafka clusters within a single connector and includes a caching layer to minimize requests to the Kafka cluster.

Section 2.4: Query Filtering for Reliability

To enhance reliability, we introduced checks to enforce column filters that require either _timestamp or _partition_offset to be present in the filter constraints of Presto queries for Kafka. Queries lacking these filters are rejected, preventing excessive data pulls that could compromise system performance.

Section 2.5: Managing Quotas in the Kafka Cluster

Given Kafka's critical role at Uber, we aimed to prevent potential cluster degradation by regulating Presto's data consumption. This was achieved by assigning a static Kafka consumer client ID to all Presto workers, subjecting them to a unified quota pool that ensures stable consumption rates and averts overloads.

Chapter 3: The Impact of Our Integration

The fusion of Presto and Kafka has dramatically enhanced our capacity for ad-hoc data analysis. Engineers can now execute straightforward SQL queries to access real-time data in mere seconds, significantly boosting productivity and decision-making capabilities. Previously, data lookups could take several minutes; now, they are completed in just a few seconds.

For instance, verifying whether a specific order message is absent from a Kafka stream can now be accomplished with this simple SQL query:

SELECT * FROM kafka.cluster.order WHERE uuid= '0e43a4–5213–11ec';

This query yields results in seconds, enabling rapid troubleshooting and data exploration.

Section 3.1: Conclusion

The integration of Presto with Kafka has revolutionized the manner in which Uber conducts real-time data analysis. By facilitating swift, on-demand queries over live data streams, we have empowered our teams to make quicker, data-driven decisions. Looking ahead, we intend to share our advancements with the open-source community, allowing others to benefit from our innovations.

The video titled "Real Time Analytics at Uber with Presto-Pinot" provides an in-depth exploration of how Uber employs these technologies for effective data analysis.

ronwdavis.com