This article is divided into two parts 1. Reasons for instability 2. How to build process to achieve absolute robust data platform
Let’s put more light on factors which are “root cause” of instability of data infrastructure. Starting with : Data Pipelines
In a data infrastructure where the analytics data warehouse is different from the primary data source [ Reason for having these separate we will discuss in a separate post ], data pipelines are inevitable. These are required to move data from various data sources to the destination data warehouse / data mart.
Data pipelines are required irrespective of whether you are going with ELT strategy or with ETL. These data pipelines are either of batch execution nature or are continuously running.
Data Pipelines is a basic unit affecting stability in data infrastructure. Data Pipelines fail, get stuck, get delayed resulting into data failures.
Multiple modules work together to move data
For any data pipeline in execution, there are multiple modules in play which make data movement possible. Like,
-Data onboarding : It can be data pull connectors (eg. pull from rdbms, api, nosql, file etc.) Or can be data push service (eg. REST APIs or webhooks which allow apps to push data)
-Data Lake : To store all data across all sources (eg. s3 / hdfs)
-Batch workflow manager / scheduler ( Something like airflow / oozie / aws glue )
-Complex data processing elements like continuous running stream processing, AI-ML containerised data processing nodes.
These multiple modules (with distinct tech tools) working together in a distributed way across clusters of machines, increase failure points.
These different modules are distributed across instances (VMs / container / physical machine).
Eg. different VMs for kafka, containerised data pull connectors, Spark / Hive / Drill / Presto based ETL deployed on virtual machines, AI-ML containers as part of data science ETL jobs. Airflow deployed on a VM to manage all DAG definitions, etc.
This increases the infrastructure failure probability. (most common like disk full, memory / cpu bottleneck, machine restart etc.)
Data pipelines keep on increasing and so are data points
As business and analytics grow, more and more data pipelines get deployed.
This leads to possibilities of : pipeline failure, over utilization of resources by a pipeline affecting other pipelines, a pipeline taking more time to process breaking data SLAs because of increased data scale, etc.
More failure points, more is failure probability
By design, data engineering is a consortium of multiple moving components (tools, infra, data pipelines, data sets etc.). More the count of components, more the failure points and an unstable data platform.
Multiple components are the core of data engineering, you can reduce it upto some extent (eg. going serverless in a few cases to avoid infra issues), but cannot have a single component doing all.
So the answer is – Not always, and not for complex data needs.
The benefits of a multi module hybrid system are much more than a single tool. We will discuss these in another article.
Systematic approach works for the following goals,
-Automate recovery as much as possible
-Capture every failure
-Check for proactive data availability, if some failures go uncaptured then this helps detect issues.
Set auto recovery
-Have cloud managed infra components (eg. kinesis, rds, step function etc.) wherever possible, so that there are low infra maintenance efforts. But this might not be possible in all scenarios due to cost reasons.
-Whenever there are self managed components deployed (eg. Airflow / Oozie / Kafka deployed on VMs), make sure to have Auto Restart set for these. Make these as linux services with auto restart set for those.
-Always set data pipelines nodes in auto retry mode.
-Any workflow management tool (airflow or oozie or step function or others) has this feature where you can set for retry post failure.
-One thing required here is : Data pipeline nodes should be idempotent. It should not result into data duplication issues because of re-executions.
Set failure alerts for data pipelines
– This ensures all failures are reported
– Sometimes data pipelines get stuck for long periods of time (due to various possible factors). In this case, this does not generate failure alerts.
– SLA alerts help here. Standard workflow management tools have this feature that if start / end of workflow is missed then it raises alert after waiting for fixed time duration.
– This is an approach to observe data independent of pipelines.
– Set alerts to check availability of data at specified time. Eg, at morning 8am yesterday’s data is supposed to be populated, then set an alert at 8:30am to check whether that data has been populated (take count and confirm)
– This alert gives final confirmation that data is available.
With the above it becomes absolutely sure that every failure in populating data is captured.
Process to handle alerts
After setting alerts for failure and data check, it results into a new problem of “too many alerts”, starts resulting into “alerts getting ignored”
So, setting alerts is half work done. It is equally important to have an “alerting handling process” in place.
Few pointers about same below,
1.Have an Alert Catalog
a. Each alert should have its Catalog with key points as,
i. The recovery process mentioned in it.
ii. Frequency of occurrence
2. Have a dedicated person / people for handling alerts (Dedicated support responsibility)
a. There are two ways about it : partial ownership of support to existing data team members / Build separate support team with full time responsibility to handle alerts
b. If you have recovery process, then you can have support team in place and core data team can be freed to work on other important data backlog
3. How should alerts be handled?
a. Any person handling an alert should refer to Catalog to get the recovery process,
b. If recovery is not mentioned for an alert, then the alert can be directed to ‘Core data team’, they need to handle the alert and define the recovery process.
4. This way, system moves towards,
a. A complete catalog of all possible alerts and their recovery processes
b. An independent support system where any support person can quickly becomes productive for handling alerts
5. Pick up infrastructure tasks to improve robustness
a. From alert catalog, prioritise alerts and build automations to have permanent fix for those.
Whatever tools you choose to build data infrastructure, because of the nature of data pipelines, there are going to be multiple failure points.
The goal should be to capture failures and set auto-recovery for those. Following approach helps in same,
-Set auto recovery wherever possible (for all known failure points)
-Know failures early : Capture every failure, capture data unavailability proactively.
-Build alert handling process such that it builds a recovery process repository and naturally leads to automation for recovery. Resulting into very stable data platform.
We, at dataeaze, help our customers to move towards complete robust data platforms by setting up the above process, resulting in analytics, which the business team relies on!