There are many questions about data engineering that we come across, but this particular question has turned out to be an important arch design question recently. Let’s try to shade some light into it for better understanding:
ELT is Extract from source, Load to data mart and then Transform every time at end use case.
ETL is Extract from source, Transform to bring data in format best suited for analytics and reporting, then Load to data mart which serve end use cases.
In general, this Load in both the cases refers to Load to Data Mart, which is serving end use cases, and not intermediate raw data lake.
ELT makes the flow simpler as you end up loading data directly to data mart. Any transforms for driving KPIs or correlating data are done directly on data mart, on use case/ query basis.
On a whole, it is about reduced efforts in development and maintenance as well as reduced turnaround time of desired reports.
As data is in its raw format loaded to data mart, all heavy lifting of data [correlation (across different data sets), enrichment and aggregation (for KPI calculation/ quality improvement)] needs to be done at data mart. This puts ‘important data guarantees’ dependent on ‘capabilities of data mart’ like
This results into
ETL architecture allows raw data lake and a powerful transform layer before Load to data mart and that opens possibilities to address the following issues:
ELT architecture enables reporting analytics with less effort and quick turnaround time.
ETL architecture is necessary to optimise cost, to have all the flexibility on data format and processing.
So, choice is not either-or, rather strategically leverage both per business needs.
Smaller data scale and ad hoc analytics: Load directly to data mart for quick turnaround and flexibility of analytics.
For smaller data sets or if you are in the initial phase of data engineering automation and expect quick turnaround time, ELT is a good choice where raw data is loaded directly into the reporting data mart.
Another variant of this is to have your analytics use cases are directly served from primary data stores.
Smaller data size data engineering stack.
Data mart for ad hoc data analysis which should enable any custom analytics on all possible data.
ELTs work well for smaller volumes, but as data volume scales up, it impacts response time SLAs, it also increases cost as all raw data is loaded to data mart as is.
For a fixed set of reports / dashboards, it is always possible to define a data mart schema which contains KPIs at required dimensions. ETL helps here with pre aggregations with overall benefit of:
For Ad Hoc analytics it is always good to have ELT to data lake, as it allows all fields to be available for analytics.