Old fashioned ETL processing has become a ghosted method. In this method, “E” stands for the “extraction” of data, “T” stands for transforming of data, and “L” stands for loading the data. This labor-intensive and often time-consuming process resulted in either a lot of cooks in the kitchen (higher costs and occasional inconsistencies) or a lone wolf performing the entire ETL process (when was that deadline?). So, what’s so bad about ETL methods? Well, in either of the scenarios mentioned above, there’s a customized code being written to work with a variety of legacy-based data systems and countless transformations being done to get the data to merge and work together. The real kicker here is that none of those billable hours resulted in any tangible or informative business intelligence. They just made the data clean enough for someone to actually analyze it and look for the statistical insights.
The point is that, with modern agile data engineering services available, legacy ETL is dead in the water – especially as it concerns moving to the cloud. Enter modern data integration tools! Modern agile data engineering and data ops have pulled a 180 in this field. Instead of considering two separate realms of “data preparation” and “data analytics and querying”, we now see the two merged into one. Essentially, the engine for merging and combining data sets is melted right into the distributed storing/computing cluster performing analytics and queries.
The key characteristic relates to the underlying execution engines. The challenges faced in rewriting customized code each time a data pipeline is adjusted from one platform to another are no longer slowing down business because agile data engineering platforms are independent of the underlying execution engine. If you’re not convinced yet, let’s look at a side-by-side comparison of agile data engineering platforms and old fashioned ETL.
Vendor One has a proprietary engine and it’s their way or the highway. Wait, you have another vendor that collects things differently? Looks like you’ll need to clean and merge the datasets in-house before they can be analyzed. Not only are you paying each of these one-size-fits-all vendors but you’re also paying for these two data sets to be combined which can be time-consuming and costly.
Improvements and growth of engines oftentimes only occurs when it’s needed on the side of the proprietary ownership company of this specific engine. If they don’t have issues on their end (their profits) why should they change things?
FULL. PLATFORM. INDEPENDENCE. Stop reshaping the same tools for data analysis over and over for different pipelines. Keep things efficient, adaptive, and portable.
Effort, time, and money. These are the ingredients needed for advancement for the speed of business as opposed to the speed of one business. Distributed computing engines are shared in the open-source community and adapted to countless use cases. Flexibility and adaptability are constantly growing.