ETL or ELT Where to Go?

Well gone are the days when Developers fights for C vs C++ or C++ vs Java and even on compilers. As time changes we argue more on applications and what are available solution and how one is better than other available option.

Until recently transformation was supposed to be outside the data warehouses as they were not considered capable enough to handle complex and extensive business transformations and business mappings. Extract, Transform and Load or to be precise ETL was the only option explored and considered effective way to transform the data and load processed information into data warehouse.

In the ETL field multiple tools like IBM InfoSphere DataStage,
Ab Initio, Informatica Power Center and many others played a major role and established a new segment involving the data transfer between source systems and data warehouse. Everything was going fine for these products and they were competing on features, performance rather than real innovations for many years.

On a sidelines because of new vendors coming with advanced technologies combined with specific hardware a new term "push-down" came into existence. In simple terms Extract, Load and then Transform or ELT. It means push part of or all the Business Transformation to the database. Netezza changed the DW market completely. It forced people to think about innovations into current ETL model as well and opened another segment commonly known as ELT. Netezza consolidated analytic activity in the appliance, right where the data resides.

This change in the ETL segment started the debate shall I use ETL or ELT. What is better? Which tool is providing me a better options? Is DataStage providing Push Down or Informatica has better options and so on.

In this process we simply forgot the reason why we have DW. DW is meant to have a base foundation which provides me a Standard Data which I can use in my reports or analysis instead of start transforming it further. It is kind of deviating from concept of DW as then different people has different data as interpretation is different. Pushing Business Transformations or Business Logic to lower layer complicates Business intelligence (BI) as it depends on multiple resources or in-house 3rd party tools.

Netezza still lacks many features and in the time being ETL tools started some push down mechanisms. Another one which is capable of doing same is Teradata which is capable of doing large scale business transformation on the machine. Isn't IBM's Balanced Optimizer does something like that, re-arrange the queries and push-down them to DW and give a mix of ETL and ELT based on requirement, dynamically deciding what is best.

Isn't this ELT approach came much before Netezza as Before and After SQL in DataStage and similar in Infomatica allows user to execute Stored Procedures. If we are doing transformation via Stored Procedure it is nothing but ELT even though via ETL.

In ETL what we do is we do all kind of transformations before we load consumption ready data into the target. In ELT approach, we will do some transformation prior to the load, and then load it in staging tables or temporary tables. Post this execute steps to do jojns, sorting or even indexes and make data available for further consumption. It is always better to avoid any extra trip to DW but it should be optimal enough as if I am saving one trip but spending 4 times extra in doing some processing then it is not worth of saving.

ELT is definitely a possible solution when ETL becomes a bottleneck but can't be a complete solution. Push Down mechanism can be used in specific cases but not necessarily a blind go to this approach. Lets take a small example, I received customer issues with huge log files. Now it is not really possible to use some DB query and process it. Possible option is use some ETL Tool and split the file based on requirement and process it. On the other hand if I have to join to fields and get a specific outcome which further relies on aggregation of other 2 fields then definitely ELT mode works faster.

I can still achieve some amount of ELT even via DataStage push down approach but can't use Netezza or Teradata ELT capabilities to transform my log file consist of varied length data. ELT also has issue when data is coming from different sources as then I need to rely on completion before do a report. ETL vs ELT is dependant on want you need to do with the data. Person deciding should understand what DWH contains and how big the data is and what is time frame looking for processing as at end of the day every thing cost and we rely on ROI. We should explore best and use the same and if it requires combination so be it.

Please do comment and give feedback to take this discussion forward.

Disclaimer: "The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions."