Introduction

DataYoga is a framework for building and running streaming or batch data pipelines. The DataYoga CLI helps define and deploy data pipelines using a declarative markup language using YAML files.

DataYoga Overview

DataYoga runs Transformation Jobs that read or react to incoming data. Each Transformation Job contains a series of Transformation Steps. Steps can produce data, transform data, or output data to external sources. For example, a Step can read data from a Kafka source, or add a calculated field, or enrich data using a Twitter API. A Step can perform NLP analysis on text to identify sentiment using an AI model, or write data to a data lake on S3 in Parquet format.

Each Transformation Step uses a Processor Block that can perform a wide variety of actions.

DataYoga provides a collection of Blocks that can:

Read and write from relational and non relational databases
Read, write, and parse data from local storage and cloud storage
Perform transformations, modify structure, add computed fields, rename fields, or remove fields
Enrich data from external sources and APIs

DataYoga provides a standalone stream processing engine, the DataYoga Runtime that validates and run Transformation Jobs. The Runtime provides:

Validation
Error handling
Metrics and observability
Credentials management

The Runtime supports multiple stream processing strategies including:

Stream processing
Parallelism
Buffering
Rate limit

It supports both async processing, multi-threading, and multi-processing to enable maximum throughput with a low footprint.