Traditionally, ETL code is written using an editor to create SQL statements, often embedded within a procedural language like Java. Historically ETL has been the only way to get data from A to B. But writing ETL code leads to a lot of problems most notably, lack of trust in data due to lack of lineage, and metadata; high maintenance costs and security risks. When the norm was simple departmental data warehouses, manual ETL was tolerable. But as data lakes become enterprise-wide in size and complexity, manual coding breaks down as discussed in the earlier article.
Companies running into these ETL problems sometimes make the mistake of adopting GUI-based ETL tools to replace a standard editor. Several vendors offer these GUI tools but at their core they all basically use drag and drop instead of writing lines of code. So instead of writing something like SELECT NAME, ADDRESS, PHONE FROM CUSTOMER_FILE WHERE ZIPCODE ="90210", the user would drag the desired fields from a visual representation of the target table and drag ZIPCODE to the condition area, indicating it must equal “90210”.
There is one, and only one benefit to a GUI approach - it ensures the correct syntax for the command. That’s it. Period! Full stop! I don’t even think these GUI approaches are any faster or easier to learn, and if they are, it is marginal. SQL code is relatively easy, which is why there is so much of it (again see the previous article for why ETL is the root problem).
For the limited value of GUI approaches, there are serious downsides:
Any one of these 4 downsides should be enough to scare any IT leader away from these tools. Honestly you are better off coding straight ETL than using a GUI tool. They are just lipstick on a pig! Which brings us back to square one: ETL is the root of all data problems! GUI tools are a terrible choice but manually coding doesn't end well either. So what should you do?
At Next Pathway, we pioneered the "no ETL" approach to moving data and building data lakes that last.
Our tools use standard governance artifacts as input which enforces data quality by design. We automatically generate the necessary source code (Spark, Hive, Scala, etc) from these, which your company’s engineers own, manage in their code repository, and maintain over time. This keeps you the IT employees (and data stewards) in the company in control, with no vendor lock-in.