Lab: ANSI mode and legacy Parquet schema drift
Hands-on lab to reproduce how spark.sql.ansi.enabled changes failure behavior with legacy Parquet type drift.
Prerequisites
- Basic Spark SQL
- Access to Databricks workspace
- Sample Parquet dataset
Goal
Reproduce the impact of spark.sql.ansi.enabled on type conversion failures when legacy Parquet partitions contain schema drift.
Setup
SET spark.sql.ansi.enabled = false;
CREATE OR REPLACE TEMP VIEW bronze_orders AS
SELECT *
FROM parquet.`dbfs:/mnt/lake/orders_legacy`;
Step 1: Permissive mode
SELECT
order_id,
CAST(amount AS DECIMAL(18,2)) AS amount_num
FROM bronze_orders;
Observe non-convertible values becoming NULL.
Step 2: Strict ANSI mode
SET spark.sql.ansi.enabled = true;
SELECT
order_id,
CAST(amount AS DECIMAL(18,2)) AS amount_num
FROM bronze_orders;
Observe explicit cast failures.
Validation
- Confirm mismatch in row quality between permissive and strict execution.
- Count invalid rows and define a threshold for pipeline blocking.