The client’s data processing platform, Databricks, was already active using the Medallion Architecture. This architecture was designed with four layers:
The job bringing in core data executed on the Databricks platform facilitated the Extract, Load, and Transform (ELT) model. Jobs are created by tethering the execution of multiple actions (notebooks, Python scripts or wheels, SQL, etc.) together. The client created jobs on the data platform to run their notebooks, which executed Python scripts. Each job was a series of steps that loaded data into each layer of the Medallion Architecture for the different data sources necessary for the Data Lake supporting the client. One limitation within Databricks that was discovered was the inability to migrate jobs from one instance to another. Our team created a set of notebooks that would overcome that shortcoming and limit the number of possible transcription errors. It also reduced the time needed to build jobs in elevated environments (Dev, QA, Prod).
The initial focus of our engagement was to build out their Silver layer processing. While the load of data kept track of what changes occurred over time (i.e., SCD), no transformation was conducted. Hylaine began by determining the data meant to land in Silver. From there, we held discussions to lay the groundwork for what data transformations would occur. With this knowledge, we created the notebooks to deliver this client outcome. As the team built out the steps to provide the desired results, a series of standards evolved that reduced the amount of time necessary to deliver the data to Silver. Initially, the development included coding the entire notebook. With the standards implemented, we used a template as the base code. Using these economies of scale, we spent the time to deliver on writing and testing the query necessary to provide the required information agreed upon during the client information gathering sessions.
Over the 18-month collaborative effort with this home services client, Hylaine increased the amount of data available within their data lake by 170%. While working to provide a robust Data Lake, the lessons learned evolved into new coding standards and shared code opportunities.