The reason this process has developed relates to the challenge of legacy data collection and storage systems, most of which were housed locally on private servers and lacked compatibility — at least quick compatibility — with other legacy programs companies might be utilizing. Hence, the original schema of ETL involved a small group of people (or, in some cases, a lone individual) in charge of migrating massive CSV files from several legacy data storage systems, performing ad hoc queries and cloud transformation on each set of data, and handwriting customized code to work with each initial data set to find a common variable to merge them across.
The worst part? These are heavy workloads that result in poor cost savings. None of that time spent provided any actionable knowledge or game-changing intelligence; rather, it was spent solely on getting data into a workable format.
Things have advanced a bit, offering more modern data architectures built off Hadoop and Spark; however, the same primary challenges are still present. For example, even though the architecture has less of an ad hoc feel, there is still a need to develop a migration strategy to extract data from legacy systems and load it into the data lake for transformation and analysis whether it's to an on-site data center or a cloud platform.
Enter modern data integration tool solutions and cloud applications! Modern agile data engineering and ops have pulled a 180 in this field. Instead of considering two separate realms of data preparation and data analytics and querying, we have now seen the two migrate into one. Essentially, the engine for merging and combining data sets is now melted right into the distributed storing/computing cluster that performs analytics and queries. The key characteristic here relates to the underlying execution engines and infrastructure.
Originally, if code was handwritten for a data pipeline to run well on a given big data platform, you would need to rewrite this application to maintain performance on a different big data platform. However, herein lies the added value of agile data engineering platforms — they are independent of the underlying execution engine. To further understand this idea, let's consider a few comparisons between new agile data engineering platforms versus old data integration applications.
-Platform independence. No more re-coding of data pipelines when moving from a private server on Monday to a cloud environment later that week. High benefits are that the new strategy is quick, smooth, and port
-The cost of time and money invested in distributed computing engines are shared in the open-source community and adapted to countless use cases. Thus, advancement happens with shared risk at the speed of business as opposed to the speed of one business.
-The platform for data integration is also intended for analytics and business intelligence. Thus, overall, we are looking at a lot less time and energy spent moving data back and forth between platforms.
-Tied down to a single, specific data integration vendor. Their proprietary engine and service provider were the only resources
-When platforms were meant only for data merging and transformation, things had to be moved around a lot in order to go from architectural development to analytic needs — added time and energy.
The issues around legacy ETL continue to be speed, cost, and quality. "Cloud Lift and Optimize" solutions automate the migration and translation of complex application workloads and allow organizations to get quality and trusted data into the hands of data scientists, BI groups, and across their business when and where they need it without waiting for time-consuming and expensive developer efforts.