The new Cloud data pipelines: The principles of ELT and lots of SQL

The new Cloud data pipelines  :

 

The principles of ELT and lots of  SQL  !

For years, ETL tools like BODS, DataStage, Informatica, Ab Initio, and Talend dominated the data pipeline building landscape. 
This presented a window of opportunity:

  • The hardware constraints and limited performance of "on-premises" Data Warehouses suggested that transformations be moved to third-party servers, hence the lesser success of ELTs (Oracle Data Integrator, ODI are examples). 
  • Companies wanted to democratize the creation of data pipelines with business-oriented tools.


But cloud architectures have changed the game, particularly due to the power of cloud data warehouses.

Are we witnessing a massive, but SQL-boosted, revival of the principles that were at work in ELT?

If so, how can we be part of this dynamic?

 

Why are companies abandoning  E & L logic in the Cloud?  

Historical ETL technologies revolutionized the world of data, but their design relies on paradigms that now have limitations:

  • Optimizing ETL chains and parallelizing processes is not necessarily very simple or intuitive and depends essentially on the tool. 
  • The workflows we design work well with relatively "static" systems.  However, in the era of the "data-driven" enterprise, systems are inflationary. 
  • Specialists in specific ETL tools are becoming rare (and the same is true for ELT),  which is problematic for maintenance! Especially since everyone has been "tuning" the ETL tools, with countless nested features that create opacity.

Moving towards new data pipeline tools: combining power and  SQL   !

Low-cost power 

 

  •  The transformations take place within the Data Warehouse as with the old ELTs, but this time in the Cloud:  the native performance of Cloud DWHs  (BigQuery, Snowflake, Redshift) is almost unlimited, which boosts the performance of these tools (Matillion, dbt, Airbyte, Fivetran, etc.). 
  • These data pipeline tools are more accessible in terms of cost:  they often rely on open source modules (such as dbt). 

SQL as a technical pivot

 

  • The SQL transformations encapsulated in these data pipeline tools are open, modularized  and designed for collaboration  : versioning, reusable models, etc.  
  • SQL (or its dialects), which forms the technical basis of these data pipeline tools, is mastered by most IT teams , which in times of "talent" shortage is not insignificant!

Is migrating ETL jobs to a cloud data pipeline tool "challenging"?

 

For example, imagine a company (like the ones we know) that has accumulated more than 10,000 ETL jobs, built over several years with complex transformations such as joins, aggregations, or even specific business rules...


Migrating ETL jobs to a Cloud data pipeline tool requires a complete overhaul of their execution, especially when adopting an SQL approach.

 

Automation is essential.

 

80%  of the effort =  regaining control over your existing system. How?

Perform a comprehensive and automated inventory of the source system with  {openAudit}

 

  • Through an analysis of internal processes via physical data lineage,  in the field, in the source database, in the ETL: analysis of views, nested views and other scripts associated with feeding the flows.
  • Through an analysis of information usage,  via an analysis of audit database logs, for data consumption and injection.

  • By analyzing the impacts on the data visualization tools  associated with the source database, to understand the related complexity (calculation rules) and to be able to do true end-to-end data lineage.

  • Through an analysis of task scheduling , in order to link it to data lineage and data usage.

Optimize the scope of what needs to be migrated by cross-referencing usage with data lineage using  {openAudit} 

 

  • To reduce migration effort  by identifying living branches and dead branches to be decommissioned upstream.
  • To define a roadmap by profession,  by type of flow, etc.
 

 

Migration: it  absolutely has to involve SQL !

Converting ETL logic into source, in SQL, using  {openAudit}

 

  • Scheduling of elementary transformations  :  {openAudit}  detects the sources, targets, filters and transformation logics of all ETL jobs and subjobs in the source, and isolates the transformations that will be carried into the target tool.
  • Conversion to SQL  : these ETL transformations are converted into SQL queries by  {openAudit} , step by step.
  • Control points  : each SQL step corresponds to a step in the source job, which allows  {openAudit}  to introduce real control points (which do not exist in ETL).

Conversion of ETL logic to data pipeline tool logic in open models with  {openAudit}  and tests:

 

  • SQL transformations are directly integrated into models of the target tool . 

     The generated SQL can be adjusted by {openAudit}  to fit the target syntax. 
  • Then, add the models to the project configuration,  define the dependencies between models, etc.
  • The models are tested  to verify their functionality.
  • The data can be tested  to ensure its integrity.
 

Commentaires

Posts les plus consultés de ce blog

Power BI libère les utilisateurs… Mais comment garder la maîtrise de sa plateforme dans le temps ?

De la source à la cellule du dashboard : Cartographier le SI pour le reconstruire intelligemment

Migrer de SAP BO vers GCP Looker - Garder ses données en source ? Possible ?