There are many questions about data engineering that we come across, but this particular question has turned out to be an important arch design question recently. Let’s try to shade some light into it for better understanding:
What is ELT and ETL?
ELT is Extract from source, Load to data mart and then Transform every time at end use case.
ETL is Extract from source, Transform to bring data in format best suited for analytics and reporting, then Load to data mart which serve end use cases.
In general, this Load in both the cases refers to Load to Data Mart, which is serving end use cases, and not intermediate raw data lake.
Is ELT better than ETL?
ELT makes the flow simpler as you end up loading data directly to data mart. Any transforms for driving KPIs or correlating data are done directly on data mart, on use case/ query basis.
What are the Pros of Using ELT?
No pre-defined transforms, makes flow definition easy and quick, reduces the development turnaround time.
No transforms to maintain, makes enhancements also quick and easy. Improved maintainability.
On a whole, it is about reduced efforts in development and maintenance as well as reduced turnaround time of desired reports.
What are the Cons of Using ELT?
As data is in its raw format loaded to data mart, all heavy lifting of data [correlation (across different data sets), enrichment and aggregation (for KPI calculation/ quality improvement)] needs to be done at data mart. This puts ‘important data guarantees’ dependent on ‘capabilities of data mart’ like
Interactive response to analytics and reporting queries is the most desired feature. In ELT, it is a dependent capability of data mart.
Supporting higher volumes of data keeping query response time guarantees intact. It becomes challenging and dependent on capability of data mart.
Supporting complex transformations (like ML inference or custom complex KPIs) again becomes dependent on processing capabilities provided by data mart.
This results into
Selecting a costly data warehouse which provides required processing capabilities, real time response guarantees at large data scale.
OR Compromising on response time guarantees for some of complex analytics use cases.
OR In a few cases, it may just not work.
Where does ETL help?
ETL architecture allows raw data lake and a powerful transform layer before Load to data mart and that opens possibilities to address the following issues:
Transform layer provides freedom on data format, data processing capabilities on structured / unstructured / semi structured data.
It provides flexibility on complex data processing capabilities which allows any transformation in between (like ML model inference / model training / complex KPIs like Uniques etc.)
Architecture allows pre aggregations and data mart schema suitable to use case needs. Overall optimising on scale of data to load in data mart. This reduces the overall scale requirements of reporting data Marts.
Typically, raw data lakes are low cost highly scalable storage systems. This gives freedom to collect and store all possible data at lowest possible costs.
ELT architecture enables reporting analytics with less effort and quick turnaround time.
ETL architecture is necessary to optimise cost, to have all the flexibility on data format and processing.
So, choice is not either-or, rather strategically leverage both per business needs.
Data Engineering arch, ELT / ETL, which one to choose & when?
Smaller data scale and ad hoc analytics: Load directly to data mart for quick turnaround and flexibility of analytics.
For smaller data sets or if you are in the initial phase of data engineering automation and expect quick turnaround time, ELT is a good choice where raw data is loaded directly into the reporting data mart.
Another variant of this is to have your analytics use cases are directly served from primary data stores.
Who is it suitable for?
Smaller data size data engineering stack.
Data mart for ad hoc data analysis which should enable any custom analytics on all possible data.
For static reports / dashboards on larger data volume, introduce pre-aggr to optimize on data mart cost. This improves the robustness of SLA.
ELTs work well for smaller volumes, but as data volume scales up, it impacts response time SLAs, it also increases cost as all raw data is loaded to data mart as is.
For a fixed set of reports / dashboards, it is always possible to define a data mart schema which contains KPIs at required dimensions. ETL helps here with pre aggregations with overall benefit of:
Reduced data volume
Reduced data mart cost
Better robust system with control over response time SLAs
For Ad Hoc analytics it is always good to have ELT to data lake, as it allows all fields to be available for analytics.
So, a matured data stack should have elements of both ELT and ETL.
ELT for Connect ad-hoc / bi layer on raw data directly. May that be in Data Mart (in case of medium / small volume) or can be with Raw Data Lake (in case of larger volumes). Eg, Raw data loaded directly to Snow Flake and analysed through BI tools OR connect BI tools to data stored in S3 / Hive on HDFS.
Have advanced Transform capabilities which allow pre-aggregation, combine, enrichment of data. This allows use case specific data to be cooked and loaded to Data Mart. This allows reduction of overall Data Mart size, ensures interactive response time guarantees, ensures overall cost optimization, enables complex analytics which otherwise would not be possible with ELT.