Declarative Data Pipelines with Hoptimator


Table 1: Partial listing of user onboarding experiences

The holes in the above chart – the majority of spaces – represent gaps where self-service does not exist yet. In those cases, creating an end-to-end data pipeline involves writing custom code to bridge the gaps. For streaming data pipelines, this involves writing stream processing jobs.

For example, to create an end-to-end data pipeline which brings data from Espresso into Pinot, we have self-service solutions for the Espresso→Brooklin hop and for the Kafka→Pinot hop, but not for the Brooklin→Kafka hop in between. A developer would need to write and operationalize a custom stream processing job to replicate their Brooklin datastream into a Kafka topic. A number of Samza and Beam jobs exist for such purposes.

The records streaming through these data pipelines often require transformation into a more convenient format. For example, Pinot ingestion by default expects records to be flat, with field names and types that are compatible with the Pinot table definition. It is unlikely that an Espresso table and a Pinot table happen to agree on these details. This sort of mismatch can occur between any pair of systems. Thus, data pipelines often involve some stream processing logic to transform records of one schema into another, filter out unnecessary records, or drop extra fields.

This means that data pipelines almost always require some form of stream processing in the middle. We have historically thought about these as two different technologies (e.g. Brooklin vs Samza), and have left it to developers to string them together. In order to provide an end-to-end data pipeline experience, we need a way to combine stream processing and data pipelines into a single concept.

Enter Flink

We’ve recently adopted Apache Flink at Linkedin, and Flink SQL has changed the way we think about data pipelines and stream processing. Flink is often seen as a stream processing engine, and historically the APIs have reflected that. But since the introduction of the Table API and Flink SQL, Flink has evolved to support more general-purpose data pipelines.

This is in large part due to the Table API’s concept of Connectors, which are not unlike the connectors of Brooklin or Kafka Connect. Connectors are the glue between different systems, and thus are associated with data pipelines. To some extent, the Table API subsumes Brooklin’s use-cases by pulling Connectors into a converged stream processing platform.

This means we can express data pipelines and stream processing in the same language (SQL) and run them on the same runtime (Flink). End-to-end data pipelines that would normally span multiple systems and require custom code can be written as a bit of Flink SQL and deployed in one shot.

Toward Declarative Data Pipelines

From a user perspective, the ideal end-to-end experience is a single authoring language (e.g. Flink SQL) for a single runtime (e.g. Flink) on a single big cluster. Users want to deploy a data pipeline with kubectl apply -f my-pipeline.yaml. The reality, however, is considerably more complex. A single end-to-end data pipeline at LinkedIn may span multiple purpose-built data plane systems (e.g. Brooklin, Gobblin), run on multiple stream processing engines (e.g. Samza, Flink), and talk to multiple storage systems (e.g. Espresso, Venice). Each of these may require manual onboarding, custom code, imperative API calls, and so on.

Starting from the ideal user experience and working backwards, we can imagine a declarative model for end-to-end data pipelines. This would present data pipelines as a single construct, but would implement them by assembling the required components from the data plane and compute layer. If the data pipeline requires some stream processing, we could automatically provision a Flink job. If part of the data pipeline requires an approval or review process, we could automatically trigger the workflow.

Leaning into the Kubernetes ecosystem, it’s clear this would involve a sophisticated operator. This would take a custom resource spec (essentially, a YAML file) and turn it into various physical resources in the data plane. Ultimately, a single pipeline spec would result in new Flink jobs, Kafka topics, and so on.

However, it’s not hard to imagine the proliferation of complex configuration that may result from such a model. It may be nice to have a single YAML file, but only insofar as that YAML file is itself simple.

To solve this problem, we started looking into expressing end-to-end data pipelines in SQL. We use streaming SQL extensively at LinkedIn, but existing SQL only expresses one “hop” of a data pipeline, e.g. from one Kafka topic to another. This has resulted in data pipelines that span hundreds of SQL statements. Ideally, an entire data pipeline could be codified as a single, high-level construct. What if an entire end-to-end, multi-hop data pipeline were just a SQL query?

Hoptimator: SQL-based Multi-hop Data Pipeline Orchestrator 

We’ve been building an experimental data pipeline orchestrator called Hoptimator. It’s essentially a sophisticated Kubernetes operator that constructs end-to-end, multi-hop data pipelines based on SQL queries. Hoptimator’s user experience is based on a high-level concept we call “subscriptions”, which represent a materialized view. Given a subscription, Hoptimator automatically creates a data pipeline to materialize the corresponding view. This enables developers to create complex data pipelines with shocking simplicity:



Source link