Enhancing Data Lake Efficiency

Blog Jun 25

Enhancing Data Lake Efficiency

A client operating in the home services and utilities market engaged Hylaine to build a solution allowing for the augmentation of their existing data warehouse team in their data streamlining initiatives. The initial focus was building out the data transformation layer in their Medallion Data Architecture, a data design pattern used to organize data in a lakehouse logically. As part of the process, Hylaine was also able to help enhance their coding standards.

The client’s data processing platform, Databricks, was already active using the Medallion Architecture. This architecture was designed with four layers:

Raw

Landing spot of data ingested into Databricks.

Bronze

Extract and load the information received from vendors and other sources.
Load of all data from the external systems.
As-is data, plus added metadata.

Silver

Transform, clean, and consolidate landed data.
Slowly Changing Dimension (SCD) data tracking.

Gold

Final transformations applied. 
Store data in a consumption-ready state.

The job bringing in core data executed on the Databricks platform facilitated the Extract, Load, and Transform (ELT) model. Jobs are created by tethering the execution of multiple actions (notebooks, Python scripts or wheels, SQL, etc.) together. The client created jobs on the data platform to run their notebooks, which executed Python scripts. Each job was a series of steps that loaded data into each layer of the Medallion Architecture for the different data sources necessary for the Data Lake supporting the client. One limitation within Databricks that was discovered was the inability to migrate jobs from one instance to another. Our team created a set of notebooks that would overcome that shortcoming and limit the number of possible transcription errors. It also reduced the time needed to build jobs in elevated environments (Dev, QA, Prod).

The initial focus of our engagement was to build out their Silver layer processing. While the load of data kept track of what changes occurred over time (i.e., SCD), no transformation was conducted. Hylaine began by determining the data meant to land in Silver. From there, we held discussions to lay the groundwork for what data transformations would occur. With this knowledge, we created the notebooks to deliver this client outcome. As the team built out the steps to provide the desired results, a series of standards evolved that reduced the amount of time necessary to deliver the data to Silver. Initially, the development included coding the entire notebook. With the standards implemented, we used a template as the base code. Using these economies of scale, we spent the time to deliver on writing and testing the query necessary to provide the required information agreed upon during the client information gathering sessions.

Upon the successful completion of the Silver load project, the client engaged our team to process new vendor sources through the entire life cycle of their data lake. The data had to be pulled down into the Raw layer using the vendor connectors provided, which store both structured and unstructured data. After landing, the data was extracted and loaded into the Bronze layer tables. The Bronze layer represents the most current information available from the vendor source. Hylaine used the collected client specifications to drive the necessary transformations into the Silver layer tables. The Hyaline team took full advantage of the previously built template to deliver nine unique vendor sources into the client’s Data Lake in six months.

Over the 18-month collaborative effort with this home services client, Hylaine increased the amount of data available within their data lake by 170%. While working to provide a robust Data Lake, the lessons learned evolved into new coding standards and shared code opportunities.