AI-assisted · Human-reviewed
Why QUALIFY should be your first reflex for deduplication and temporal pricing
A practical SQL playbook for using QUALIFY to remove duplicate source exports and to pick the right product cost version at order date when computing margin.
Real project context
When the real rule is “rank candidate rows, then keep one,” QUALIFY is usually the clearest SQL you can write.
The pattern appears constantly in two places:
- source systems that emit duplicate exports after an internal change
- pricing tables where several product costs are valid over time and margin must use the cost that was active on the order date
In the lab below, both examples use synthetic but realistic B2B datasets:
- an online lottery operator with duplicated sales exports after a release changed the export timestamp logic
- a food delivery wholesaler selling to restaurants, where supplier prices change during the month
Decision rule: if the logic is “apply a window function, then keep only the winning rows,” start with
QUALIFY. If the intermediate ranked dataset must be reused several times, a CTE or temp view can make sense. If your SQL engine does not supportQUALIFY, use an inline subquery before reaching for heavier patterns.
➡️ Lab: QUALIFY for deduplication and temporal pricing
Problem statement
The operational issue is rarely “there are duplicates.” The real issue is usually one of these:
- two records look identical from a business point of view, but differ on technical timestamps
- several product prices exist, and only one is correct for the order date
- a downstream metric looks valid because SQL still returns one result, even when the wrong record was chosen
That is where QUALIFY earns its place: rank rows with a window function, then keep the winner in the same query block.
Why common solutions fail
Two-CTE pattern: valid, but noisier than necessary
Many teams write this:
WITH ranked_exports AS (
SELECT
order_id,
customer_id,
gross_amount,
source_created_at,
source_exported_at,
ingestion_ts,
ROW_NUMBER() OVER (
PARTITION BY order_id
ORDER BY source_exported_at DESC, ingestion_ts DESC
) AS rn
FROM bronze_lottery_sales
),
deduped AS (
SELECT *
FROM ranked_exports
WHERE rn = 1
)
SELECT *
FROM deduped;
This works, but the second CTE adds ceremony without adding meaning.
It becomes even heavier when teams split the same logic across:
- one CTE to rank
- one CTE to filter
- one final CTE to rejoin the kept rows into another dataset
The result is not always wrong, but the query becomes longer and harder to review.
Inline subquery: acceptable, but still indirect
Some teams avoid extra named CTEs and push the ranking into an inline subquery:
SELECT *
FROM (
SELECT
order_id,
customer_id,
gross_amount,
source_exported_at,
ingestion_ts,
ROW_NUMBER() OVER (
PARTITION BY order_id
ORDER BY source_exported_at DESC, ingestion_ts DESC
) AS rn
FROM bronze_lottery_sales
) ranked
WHERE rn = 1;
This is shorter than the two-CTE version, but the same drawback remains: the ranking rule and the filter rule are separated across query layers.
Correlated subquery: compact, but fragile
Another common version:
SELECT s.*
FROM bronze_lottery_sales s
WHERE source_exported_at = (
SELECT MAX(source_exported_at)
FROM bronze_lottery_sales x
WHERE x.order_id = s.order_id
);
This pattern breaks as soon as:
- two rows share the same max timestamp
- the tie-breaker is actually
ingestion_ts - you need the “best” row based on more than one ordering column
It also obscures intent. The real rule is not “match the max timestamp.” The rule is “rank candidate rows and keep the first.”
Temporary table or temporary view: useful for debugging, expensive as a default
Temporary objects are not inherently bad. In notebooks, incident analysis, or long investigations, a TEMP VIEW or temporary table can make intermediate states easier to inspect:
CREATE OR REPLACE TEMP VIEW ranked_exports AS
SELECT
order_id,
customer_id,
source_exported_at,
ingestion_ts,
ROW_NUMBER() OVER (
PARTITION BY order_id
ORDER BY source_exported_at DESC, ingestion_ts DESC
) AS rn
FROM bronze_lottery_sales;
SELECT *
FROM ranked_exports
WHERE rn = 1;
That said, using temp objects as the default implementation has real costs:
- more lines of SQL to maintain
- object lifecycle to manage inside notebooks or jobs
- more opportunities for naming collisions
- harder portability when logic is copied between jobs, models, and ad hoc analysis
For production logic that is only “rank then keep,” QUALIFY is usually the cleaner expression. Temp views are more defensible when the intermediate ranked dataset will be reused several times in the same workflow.
Technique comparison
For this class of problem, the trade-off is usually:
QUALIFY: shortest path when the logic is “window function, then filter on that result” in the same query- inline subquery: workable, but adds one extra nesting level just to expose the window output
- two or more CTEs: clearer than deeply nested SQL, but often longer than necessary for simple row selection
- correlated subquery: can look compact, but gets fragile when ties or multiple ordering columns matter
- temp table or temp view: useful in debugging sessions or when an intermediate dataset is reused, but heavier for one-pass filtering
In practice, complexity tends to increase in this order:
QUALIFY < inline subquery < two-step CTE chain < temp object workflow < correlated subquery with tie-break exceptions
That is not a universal ranking for every query, but it matches the day-to-day warehouse cases where teams are selecting one “best” row per partition.
| Technique | Readability | Handles ties well | Good for debugging | Best use case |
|---|---|---|---|---|
QUALIFY | High | Yes, when combined with RANK() / DENSE_RANK() / ROW_NUMBER() | Medium | One-pass filtering after a window function |
| Inline subquery | Medium | Yes, but less directly | Low | When QUALIFY is unavailable but the logic is still simple |
| Two-step CTE chain | Medium | Yes | Medium | When intermediate naming improves review or reuse slightly |
| Correlated subquery | Low for ranked logic | Weak once tie-breakers multiply | Low | Simple max/min lookups without ranking complexity |
| Temp table / temp view | Medium | Yes | High | Investigation, notebooks, or repeated reuse of the same intermediate dataset |
Implementation details
Case 1: duplicate exports from a sales source
In the synthetic lottery example, the sales platform shipped two exports for the same order after a new developer changed one internal batch timestamp. The business payload stayed the same, but the technical export metadata differed:
source_created_at: when the order was first writtensource_exported_at: when the export batch was generatedingestion_ts: when the data platform received the file
If you reach for SELECT DISTINCT, the duplicates remain because the technical timestamps are different.
Use QUALIFY instead:
SELECT
order_id,
customer_id,
game_code,
gross_amount,
source_created_at,
source_exported_at,
ingestion_ts
FROM bronze_lottery_sales
QUALIFY ROW_NUMBER() OVER (
PARTITION BY order_id
ORDER BY source_exported_at DESC, ingestion_ts DESC
) = 1;
This keeps one row per business order and makes the tie-breaker explicit.
Case 2: pick the correct product cost at order date
Margin calculations fail quietly when the latest known supplier cost is used instead of the cost valid on the order date.
Typical bad shortcut:
SELECT
o.order_id,
o.product_id,
o.sale_price,
p.unit_cost,
o.sale_price - p.unit_cost AS margin
FROM b2b_orders o
JOIN latest_product_cost p
ON o.product_id = p.product_id;
This is wrong whenever latest_product_cost was updated after the order.
The correct logic is: for each order, keep the most recent supplier price where effective_date <= order_date.
SELECT
o.order_id,
o.order_date,
o.product_id,
o.sale_price,
p.unit_cost,
o.sale_price - p.unit_cost AS gross_margin
FROM b2b_orders o
JOIN supplier_price_history p
ON o.product_id = p.product_id
AND p.effective_date <= o.order_date
QUALIFY ROW_NUMBER() OVER (
PARTITION BY o.order_id
ORDER BY p.effective_date DESC, p.updated_at DESC
) = 1;
The logic stays auditable: join valid candidates, rank them, keep the winner.
Code and lab walkthrough
The lab package includes:
- a downloadable notebook with synthetic data generation
- SQL examples for duplicate detection and cleanup
- a temporal pricing example showing the margin gap between a wrong join and the correct
QUALIFYpattern - a preferred-store example using
RANK()andDENSE_RANK() - a pipeline-log example using
ROW_NUMBER()andLAG()
Notebook download: qualify-dedup-and-temporal-pricing.ipynb
One useful diagnostic query before deleting duplicates is to show groups that now have more than one technical row:
SELECT
order_id,
COUNT(*) AS row_count,
MIN(source_exported_at) AS first_export_ts,
MAX(source_exported_at) AS last_export_ts
FROM bronze_lottery_sales
GROUP BY order_id
HAVING COUNT(*) > 1
ORDER BY row_count DESC, last_export_ts DESC;
After that review, convert the logic into a controlled delete:
DELETE FROM bronze_lottery_sales
WHERE (order_id, source_exported_at, ingestion_ts) IN (
SELECT
order_id,
source_exported_at,
ingestion_ts
FROM bronze_lottery_sales
QUALIFY ROW_NUMBER() OVER (
PARTITION BY order_id
ORDER BY source_exported_at DESC, ingestion_ts DESC
) > 1
);
Only apply a destructive delete when the ranking rule has already been validated in a non-destructive select.
Other high-value QUALIFY patterns
Preferred store by customer visit count
Another common business need is to assign one preferred store per customer for CRM segmentation, local promotions, or territory ownership.
The base table often contains one row per visit. First aggregate visits by customer and store, then rank stores by visit volume:
WITH customer_store_visits AS (
SELECT
customer_id,
store_id,
COUNT(*) AS visit_count,
MAX(visit_date) AS last_visit_date
FROM customer_shop_visits
GROUP BY customer_id, store_id
)
SELECT
customer_id,
store_id,
visit_count,
last_visit_date,
DENSE_RANK() OVER (
PARTITION BY customer_id
ORDER BY visit_count DESC
) AS visit_rank
FROM customer_store_visits
QUALIFY visit_rank = 1;
Why DENSE_RANK() here:
- use
DENSE_RANK()when tied stores should all remain visible as preferred stores - use
RANK()if you want the same tie behavior but also want rank gaps in downstream reporting - use
ROW_NUMBER()only if the business requires forcing a single winner, for example by addinglast_visit_date DESC
This is a good example of QUALIFY not for deduplication, but for “keep the best group after aggregation.”
Pipeline tracking: latest status and retry transitions
Custom pipeline tracking tables usually contain several log rows for the same run_id:
startedfailedretriedsuccess
To report the final state per run, keep the latest log event:
SELECT
run_id,
pipeline_name,
status,
event_ts,
attempt_no
FROM pipeline_run_log
QUALIFY ROW_NUMBER() OVER (
PARTITION BY run_id
ORDER BY event_ts DESC, attempt_no DESC
) = 1;
To isolate the transition where a retry recovered after a failure, compare each row to the previous status:
SELECT
run_id,
pipeline_name,
status,
event_ts,
LAG(status) OVER (
PARTITION BY run_id
ORDER BY event_ts, attempt_no
) AS previous_status
FROM pipeline_run_log
QUALIFY previous_status = 'failed'
AND status IN ('retried', 'success');
This pattern is useful when teams want to measure operational resilience:
- how many failed runs recovered without manual intervention
- which pipelines usually succeed on retry two or three
- where retry loops hide unstable upstream dependencies
Measured results on the synthetic lab
On the included lab dataset:
- the duplicate export scenario starts with 12 rows for 8 real lottery orders
SELECT DISTINCTstill returns 12 rows because timestamps differ- the
QUALIFY ROW_NUMBER()pattern returns the expected 8 rows - the pricing scenario shows multiple restaurant orders where using the latest supplier price overstates margin by 8% to 19%
- the preferred-store scenario exposes customers with tied favorite stores, which is exactly where
DENSE_RANK()is safer than forcing one arbitrary winner - the pipeline-log scenario isolates both the latest status per run and the runs that recovered immediately after a failure
The percentages are synthetic, but the failure mode is common in production because the wrong join still returns a complete-looking result set.
Lessons learned
- If the business rule is “pick one row after ranking,” start with
QUALIFY. SELECT DISTINCTis not a deduplication strategy when technical metadata differs.- Correlated subqueries hide tie-breaker logic and are fragile for temporal rules.
- For cost history, “latest row” and “row valid at order date” are different questions.
DENSE_RANK()is often the right choice when business users need all tied top candidates, not an arbitrary winner.LAG()plusQUALIFYis an effective way to extract operational state transitions from log tables.- Temporary views are good debugging tools, but they should not be the default answer for one-pass ranking logic.
- Validate with a review query first, then delete or merge only after the kept row is explicit.
QUALIFY does not replace good modeling, but it does remove a large amount of avoidable SQL overhead in day-to-day warehouse work.