Skip links

Data Quality Framework for Snowflake Data Pipeline

Jump To Section

Data Quality Excellence with SSOT Implementation

In today’s rapidly evolving data-driven world, ensuring data quality has become a paramount concern for organizations across various industries. Data quality refers to the degree to which data meets requirements and expectations. It encompasses accuracy, completeness, consistency, timeliness, and reliability, ensuring suitability for analysis and decision-making.

This is where building a SSOT (single source of truth) is incredibly important for any organization to ensure reliability, auditability and transparency. As part of our Digital Business Methodology, we extensively leverage platforms such as Snowflake to help our clients build SSOTs as aligned to their business needs and goals. 

While building out a SSOT solution leveraging Snowflake, we realized that there are not many tools available for performing data quality checks and available ones are either not compatible, expensive or did not meet a complex set of requirements. This led us to develop a custom data quality framework within Snowflake using stored procedures. 

Benefits derived through this framework: 

  • All our data quality check code is native and within Snowflake. 
  • It utilized Snowflake’s compute power to perform the checks, so there was no need for provisioning or maintaining additional computes. We could scale it up as per the requirements, complexity and volume of data processed.
  • We could flexibly extend it to accommodate both business and technical needs as per requirement.

Data Quality (DQ) Framework

We developed this framework as an independent piece that is easily pluggable to data pipeline without disturbing any of the existing data mappings. The high-level flow is as follows:

  1. Data ingestion from multiple sources to Landing
  2. Data quality checks on Landing data and cleansing if required
  3. Cleansed data load from Landing to Staging
Data Quality Framework for Snowflake Data Pipeline

Features of the Data Quality (DQ) framework:

  • Perform data quality checks such as uniqueness check or completeness check on a column or a set of columns from a table.
  • Capture the results of data quality rules executed, suitable for reporting or analysis.
  • Capture anomaly records that do not comply with the DQ rule in a separate table.
  • Ability to cleanse the exception data. 
  • Perform business rule check which might need data from multiple columns from across multiple tables.
  • Ability to stop/resume load when DQ check fails.
  • Flexibility to add the framework at any point in the data load pipeline.
  • Extensible and configurable metadata to simplify modification/addition of DQ rules.
  • Extended features like table existence and table structure checks performed before data loading.

Below is the high-level design:

Data Quality high-level design

Metadata

This is centralized location for all the configurations or rules that needs to be enforced. It is possible that there are different data check needs in the same table on different columns. The metadata configuration maintains all the data quality checks in one place which is referred for running the Data Quality (DQ) validation throughout. Structure of this table can vary as per the requirement. We have maintained details like table name with schema, column name for which rules must be executed for, stored procedure name which has the code for executing the rule and a few flag indicators like stop load if rule fails and if capturing and cleansing of exception data is needed or not.

Data Quality Metadata Configuration

Master Stored Procedure

The master stored procedure acts as the central control in the Snowflake data quality framework. It triggers data quality checks for specific tables using input parameters for schema and table names. These parameters link to metadata containing applicable rules.

Using metadata, the master procedure coordinates data quality check procedures which accesses data from the designated table, performs quality checks, and logs the results accordingly.

Monitoring execution status, the master procedure halts the entire data quality check process if suspension-triggering rule is failed.

As per setup, the master procedure can also initiate data table clean-up, eliminating problematic records.

Worker Stored Procedure

Each DQ check is executed by a worker that underlying uses a stored procedure. The controller/master orchestrates these workers as per the rules defined in the metadata configuration. 

Below is an example to check completeness. Similar workers can be created for other DQ checks namely uniqueness, subset, double entry, checking table structure etc.

SP_DQC_IS_COMPLETE is used to check if the column has NULL values and populate the result accordingly to the results table. This stored procedure can be called individually as well. The stored procedure takes inputs from the metadata table that is passed as parameters.

Adding New Rule

The framework facilitates easy extensibility. Adding new rule or code is quite easy. A new stored procedure needs to be created with the rule execution logic and the same needs to be configured in the metadata table. This is further simplified by utilizing a common template across all the worker stored procedures. This template incorporates preset parameters, along with logic for logging data quality outcomes and capturing exception data in a dedicated table. By integrating new data quality check logic and making minor adjustments to existing procedures, the rule check procedure becomes readily adaptable.

DQ Results 

This table captures results of the rules executed with corresponding details. Details are logged irrespective of success or failure in data quality. This table structure can be modified as per the requirement. We have maintained details like rule id to link it back to metadata table, table name, status and message. Message is only populated when there is a failure, and it will contain the record count of failed records. Along with this we have maintained a few audit columns like Date of execution and batch id of data for which the rule is executed.

Data Quality Results

Conclusion

The DQ check framework developed for Snowflake data pipeline addressed our core DQ needs without relying on any external solution. The framework was extensible thus enabled adding new validation rules as the project progressed. By leveraging Snowflake’s scalability and parallel processing capabilities, the framework efficiently handled large volumes and varieties of data. It seamlessly integrated with the data pipeline, enabling data quality checks at various stages. Customizable quality rules and automation reduced manual effort, providing a systematic approach to validate data.

 1 Snowflake schema to store raw data from source
 2 Snowflake schema to store cleansed data post data quality checks and validations
Avinash Ananda Mulki & Selvakumari Sukumar

Avinash Ananda Mulki & Selvakumari Sukumar

Latest Reads

Subscribe

Suggested Reading

Ready to Unlock Yours Enterprise's Full Potential?

Adaptive Clinical Trial Designs: Modify trials based on interim results for faster identification of effective drugs.Identify effective drugs faster with data analytics and machine learning algorithms to analyze interim trial results and modify.
Real-World Evidence (RWE) Integration: Supplement trial data with real-world insights for drug effectiveness and safety.Supplement trial data with real-world insights for drug effectiveness and safety.
Biomarker Identification and Validation: Validate biomarkers predicting treatment response for targeted therapies.Utilize bioinformatics and computational biology to validate biomarkers predicting treatment response for targeted therapies.
Collaborative Clinical Research Networks: Establish networks for better patient recruitment and data sharing.Leverage cloud-based platforms and collaborative software to establish networks for better patient recruitment and data sharing.
Master Protocols and Basket Trials: Evaluate multiple drugs in one trial for efficient drug development.Implement electronic data capture systems and digital platforms to efficiently manage and evaluate multiple drugs or drug combinations within a single trial, enabling more streamlined drug development
Remote and Decentralized Trials: Embrace virtual trials for broader patient participation.Embrace telemedicine, virtual monitoring, and digital health tools to conduct remote and decentralized trials, allowing patients to participate from home and reducing the need for frequent in-person visits
Patient-Centric Trials: Design trials with patient needs in mind for better recruitment and retention.Develop patient-centric mobile apps and web portals that provide trial information, virtual support groups, and patient-reported outcome tracking to enhance patient engagement, recruitment, and retention
Regulatory Engagement and Expedited Review Pathways: Engage regulators early for faster approvals.Utilize digital communication tools to engage regulatory agencies early in the drug development process, enabling faster feedback and exploration of expedited review pathways for accelerated approvals
Companion Diagnostics Development: Develop diagnostics for targeted recruitment and personalized treatment.Implement bioinformatics and genomics technologies to develop companion diagnostics that can identify patient subpopulations likely to benefit from the drug, aiding in targeted recruitment and personalized treatment
Data Standardization and Interoperability: Ensure seamless data exchange among research sites.Utilize interoperable electronic health record systems and health data standards to ensure seamless data exchange among different research sites, promoting efficient data aggregation and analysis
Use of AI and Predictive Analytics: Apply AI for drug candidate identification and data analysis.Leverage AI algorithms and predictive analytics to analyze large datasets, identify potential drug candidates, optimize trial designs, and predict treatment outcomes, accelerating the drug development process
R&D Investments: Improve the drug or expand indicationsUtilize computational modelling and simulation techniques to accelerate drug discovery and optimize drug development processes