For organizations that have adopted a ‘Cloud First’ strategy, one of the biggest challenges will be the migration of data from legacy databases and ETL systems. The migration of legacy data (including code and ETL pipelines) is a essential to take advantage of cloud native capabilities and to optimize your cloud applications. Moreover, without the migration of legacy data, robust analytics cannot be performed.
In regards to migrating legacy code, one of the most important considerations for organizations is thoroughly understanding the goals you wish to achieve by migrating to the cloud and then determining the appropriate architecture. Migration planning, code translation, migration automation, effective system integration testing and finally developing a transformational roadmap are other important considerations.
With specific emphasis on migrating legacy ETL Pipelines, here is my best advice:
There are many options to choose from when determining your data architecture including your cloud provider, cloud data platform and cloud database. These choices are key and will likely determine the course of your cloud migration for years to come. However, the an archictectural decision is whether to migrate your ETL’s ‘As-Is’ into the cloud via translation from source to target or to go direct to a cloud target through ETL transformation. Essentially, organizations must determine the level of risk they are willing to take in order to optimize cloud benefits versus taking a more patient approach to migrating their data and ETLs to achieve more modest benefits.
Things to think about are the retirement of legacy infrastructure and licensing which can be obtained much quicker through a translation approach and reducing the time it takes to get your ETL’s into the cloud and then focusing on specific business benefits. In addition, a direct to cloud target through ETL transformation (essentially rewriting your ETL’s in cloud native languages) introduces long timelines and can take, in some cases years to achieve.
Once you have determined your architecture and approach, careful and detailed planning is key. Using ‘Crawling Tools’ to do deep dives on your ETL code is strongly recommended. If you are choosing an ETL translation approach, this is critical, as it enables you to determine supported or not supported functions in your legacy ETL system compared to the cloud target and dependencies between pipelines, jobs and database objects. This output is used in Wave Planning, which determines the order in which ETLs are migrated and tested. The testing approach is usually to migrate and unit individual ETL jobs and validate them using synthetic data. The migration order becomes important once you enter into your system integration phase of the migration.
One of the key benefits of choosing an ‘As-Is' translation approach is being able to use automated code translation. This is a highly effective process whereby legacy code is parsed down to its elements and reconstituted into cloud native ETL code languages such as Spark, Python and PySpark. While no legacy ETL & Cloud Platforms are exactly the same, automation can reach upwards of 90% plus coverage, leaving your technical teams to focus on true exceptions where, for instance, legacy ETL functions are simply not provided by the cloud ETL systems. Automated Code Translation significantly de-risks your ETL migration as it provides an extremely accurate baseline of migration, reduced migration timeframes and ensures end users experience very little impact related to end user BI tools and access to data.
Automated Code Translation provides an accurate baseline of migration, allowing your teams to focus on integrating ETL code into your cloud environments and making sure they provide the expected functions, data validation and performance. As you migrate your ETL pipelines into the cloud, you will make sure they run without error in your Cloud ETL system, and ensure identified dependencies are accounted for. Data record count validation and data quality check are very important as you need to make sure all data is accounted for. A parallel run between your legacy and target ETL systems is strongly recommended, using the same data for both. Some organization will run parallel at least to cover of major business events such as month, quarter or year ends. Parallel runs help to increase confidence of your ETL migration and will make your business users advocates of the cloud.
Once you have migrated your legacy ETL’s to the cloud, work with the business to put together a Cloud Data Road Map based on business priorities. One of the key benefits of an ‘As-Is’ translation, is that it allows you to plan your data transformation without the risks and pressures of a migration. Because translation moderizes your ETL code, you are well positioned to make the transformative changes required to implement a cloud native data platform, including data landing zones, universal data lakes and business hubs; bringing to bear all the cloud and end user BI tools avaialble, all the while managing business priorities, cost and risk.
While this is not an exhaustive list of best practices, these are certainly good practices to instill to ensure your successful migration of your legacy ETL pipelines to the cloud.
About Next Pathway Inc.
Next Pathway Inc. is the Automated Cloud Migration company. Powered by the SHIFT™ Migration Suite, Next Pathway automates the end-to-end challenges companies experience when migrating applications to the cloud. For more information, please visit nextpathway.com.