About robustness challenge
After enabling data engineering automation, after building data pipelines of ETL or ELT, the bigger challenge is to ensure that this data flow automation is robust. In simpler words, ‘to ensure that whatever data is supposed to be available, is available at the expected time (SLA)’.
Data unavailability is very unpleasant for any data team member. It reduces business trust on data platforms, leads to sleepless nights fixing issues, filling data backlog, makes team enter into the cycle of ad hoc support tasks, and derails the entire analytics roadmap.
Infra failures being the primary reason
One of the key reasons for ‘not-so-robust’ data platforms is infrastructure and service failures. Something like disk full, out of memory, CPU and server hang, important service going down which is without HA (eg. airflow). If and when these failures go unnoticed, they lead to data unavailability.
While there are monitoring and alerting measures to ensure that no such failure goes unnoticed, along with a process set to take action on the recovery to ensure robustness, it has its own maintenance cost in terms of people, efforts, and time.
Data Engineering platform on serverless cloud
Having data infra on serverless cloud helps to reduce infra failure issues, resulting in a much stable data automation.
Data engineering on the serverless cloud,
All the above tools working together provide a very powerful data engineering stack.
Allows leveraging serverless pay-per-use cloud, the robustness of serverless, scalability of cloud, power of customization to build any data use case with any available tool/library.
How does it help with improved robustness?
S3: Serves functional features of storage and retrieval. No hassle of servers, disk, throughout bottleneck. Secure connectivity from everywhere and freedom from infra issues.
Lambda: Provided functional feature of execution of connector, no hassle of allocation of any infra.
Step function: Workflow coordination through serverless step function, workflow tools are single points through which all jobs are triggered. Using a serverless step function avoids a single point of failure.
ECS / EMR through Step Function monitored through cloudwatch: Bring up on demand – execute and then terminate. Monitored and controlled execution results in robust workflow management.
To summarise, going ‘serverless on a cloud’ leads to a much stable infra, along with reduced maintenance efforts, and a robust platform right from day 1. Less fatigue to the data team gives them the time to focus on more important data tasks and (very important) peace of mind to data leaders.
What is the negative side of this?
Making all these different serverless tools work together involves its own set of efforts. We at dataeaze have a framework where all this is simplified and all these tools work together seamlessly for AWS and Azure Cloud. We have enabled multiple robust data platforms on serverless cloud.
Note :
There are other primary benefits of going serverless such as overall resource and cost optimization, which we will discuss in a separate post.
Also, there are other aspects to achieve a more robust platform with better alerting, monitoring, and auto-recovery framework, which will be further discussed in a separate post.