AI-assisted · Human-reviewed

Why QUALIFY should be your first reflex for deduplication and temporal pricing

A practical SQL playbook for using QUALIFY to remove duplicate source exports and to pick the right product cost version at order date when computing margin.

Real project context

When the real rule is “rank candidate rows, then keep one,” QUALIFY is usually the clearest SQL you can write.

The pattern appears constantly in two places:

source systems that emit duplicate exports after an internal change
pricing tables where several product costs are valid over time and margin must use the cost that was active on the order date

In the lab below, both examples use synthetic but realistic B2B datasets:

an online lottery operator with duplicated sales exports after a release changed the export timestamp logic
a food delivery wholesaler selling to restaurants, where supplier prices change during the month

Decision rule: if the logic is “apply a window function, then keep only the winning rows,” start with QUALIFY. If the intermediate ranked dataset must be reused several times, a CTE or temp view can make sense. If your SQL engine does not support QUALIFY, use an inline subquery before reaching for heavier patterns.

➡️ Lab: QUALIFY for deduplication and temporal pricing

Problem statement

The operational issue is rarely “there are duplicates.” The real issue is usually one of these:

two records look identical from a business point of view, but differ on technical timestamps
several product prices exist, and only one is correct for the order date
a downstream metric looks valid because SQL still returns one result, even when the wrong record was chosen

That is where QUALIFY earns its place: rank rows with a window function, then keep the winner in the same query block.

Why common solutions fail

Two-CTE pattern: valid, but noisier than necessary

Many teams write this:

WITH ranked_exports AS (
  SELECT
    order_id,
    customer_id,
    gross_amount,
    source_created_at,
    source_exported_at,
    ingestion_ts,
    ROW_NUMBER() OVER (
      PARTITION BY order_id
      ORDER BY source_exported_at DESC, ingestion_ts DESC
    ) AS rn
  FROM bronze_lottery_sales
),
deduped AS (
  SELECT *
  FROM ranked_exports
  WHERE rn = 1
)
SELECT *
FROM deduped;

This works, but the second CTE adds ceremony without adding meaning.

It becomes even heavier when teams split the same logic across:

one CTE to rank
one CTE to filter
one final CTE to rejoin the kept rows into another dataset

The result is not always wrong, but the query becomes longer and harder to review.

Inline subquery: acceptable, but still indirect

Some teams avoid extra named CTEs and push the ranking into an inline subquery:

SELECT *
FROM (
  SELECT
    order_id,
    customer_id,
    gross_amount,
    source_exported_at,
    ingestion_ts,
    ROW_NUMBER() OVER (
      PARTITION BY order_id
      ORDER BY source_exported_at DESC, ingestion_ts DESC
    ) AS rn
  FROM bronze_lottery_sales
) ranked
WHERE rn = 1;

This is shorter than the two-CTE version, but the same drawback remains: the ranking rule and the filter rule are separated across query layers.

Correlated subquery: compact, but fragile

Another common version:

SELECT s.*
FROM bronze_lottery_sales s
WHERE source_exported_at = (
  SELECT MAX(source_exported_at)
  FROM bronze_lottery_sales x
  WHERE x.order_id = s.order_id
);

This pattern breaks as soon as:

two rows share the same max timestamp
the tie-breaker is actually ingestion_ts
you need the “best” row based on more than one ordering column

It also obscures intent. The real rule is not “match the max timestamp.” The rule is “rank candidate rows and keep the first.”

Temporary table or temporary view: useful for debugging, expensive as a default

Temporary objects are not inherently bad. In notebooks, incident analysis, or long investigations, a TEMP VIEW or temporary table can make intermediate states easier to inspect:

CREATE OR REPLACE TEMP VIEW ranked_exports AS
SELECT
  order_id,
  customer_id,
  source_exported_at,
  ingestion_ts,
  ROW_NUMBER() OVER (
    PARTITION BY order_id
    ORDER BY source_exported_at DESC, ingestion_ts DESC
  ) AS rn
FROM bronze_lottery_sales;

SELECT *
FROM ranked_exports
WHERE rn = 1;

That said, using temp objects as the default implementation has real costs:

more lines of SQL to maintain
object lifecycle to manage inside notebooks or jobs
more opportunities for naming collisions
harder portability when logic is copied between jobs, models, and ad hoc analysis

For production logic that is only “rank then keep,” QUALIFY is usually the cleaner expression. Temp views are more defensible when the intermediate ranked dataset will be reused several times in the same workflow.

Technique comparison

For this class of problem, the trade-off is usually:

QUALIFY: shortest path when the logic is “window function, then filter on that result” in the same query
inline subquery: workable, but adds one extra nesting level just to expose the window output
two or more CTEs: clearer than deeply nested SQL, but often longer than necessary for simple row selection
correlated subquery: can look compact, but gets fragile when ties or multiple ordering columns matter
temp table or temp view: useful in debugging sessions or when an intermediate dataset is reused, but heavier for one-pass filtering

In practice, complexity tends to increase in this order:

QUALIFY < inline subquery < two-step CTE chain < temp object workflow < correlated subquery with tie-break exceptions

That is not a universal ranking for every query, but it matches the day-to-day warehouse cases where teams are selecting one “best” row per partition.

Technique	Readability	Handles ties well	Good for debugging	Best use case
`QUALIFY`	High	Yes, when combined with `RANK()` / `DENSE_RANK()` / `ROW_NUMBER()`	Medium	One-pass filtering after a window function
Inline subquery	Medium	Yes, but less directly	Low	When `QUALIFY` is unavailable but the logic is still simple
Two-step CTE chain	Medium	Yes	Medium	When intermediate naming improves review or reuse slightly
Correlated subquery	Low for ranked logic	Weak once tie-breakers multiply	Low	Simple max/min lookups without ranking complexity
Temp table / temp view	Medium	Yes	High	Investigation, notebooks, or repeated reuse of the same intermediate dataset

Implementation details

Case 1: duplicate exports from a sales source

In the synthetic lottery example, the sales platform shipped two exports for the same order after a new developer changed one internal batch timestamp. The business payload stayed the same, but the technical export metadata differed:

source_created_at: when the order was first written
source_exported_at: when the export batch was generated
ingestion_ts: when the data platform received the file

If you reach for SELECT DISTINCT, the duplicates remain because the technical timestamps are different.

Use QUALIFY instead:

SELECT
  order_id,
  customer_id,
  game_code,
  gross_amount,
  source_created_at,
  source_exported_at,
  ingestion_ts
FROM bronze_lottery_sales
QUALIFY ROW_NUMBER() OVER (
  PARTITION BY order_id
  ORDER BY source_exported_at DESC, ingestion_ts DESC
) = 1;

This keeps one row per business order and makes the tie-breaker explicit.

Case 2: pick the correct product cost at order date

Margin calculations fail quietly when the latest known supplier cost is used instead of the cost valid on the order date.

Typical bad shortcut:

SELECT
  o.order_id,
  o.product_id,
  o.sale_price,
  p.unit_cost,
  o.sale_price - p.unit_cost AS margin
FROM b2b_orders o
JOIN latest_product_cost p
  ON o.product_id = p.product_id;

This is wrong whenever latest_product_cost was updated after the order.

The correct logic is: for each order, keep the most recent supplier price where effective_date <= order_date.

SELECT
  o.order_id,
  o.order_date,
  o.product_id,
  o.sale_price,
  p.unit_cost,
  o.sale_price - p.unit_cost AS gross_margin
FROM b2b_orders o
JOIN supplier_price_history p
  ON o.product_id = p.product_id
 AND p.effective_date <= o.order_date
QUALIFY ROW_NUMBER() OVER (
  PARTITION BY o.order_id
  ORDER BY p.effective_date DESC, p.updated_at DESC
) = 1;

The logic stays auditable: join valid candidates, rank them, keep the winner.

Code and lab walkthrough

The lab package includes:

a downloadable notebook with synthetic data generation
SQL examples for duplicate detection and cleanup
a temporal pricing example showing the margin gap between a wrong join and the correct QUALIFY pattern
a preferred-store example using RANK() and DENSE_RANK()
a pipeline-log example using ROW_NUMBER() and LAG()

Notebook download: qualify-dedup-and-temporal-pricing.ipynb

One useful diagnostic query before deleting duplicates is to show groups that now have more than one technical row:

SELECT
  order_id,
  COUNT(*) AS row_count,
  MIN(source_exported_at) AS first_export_ts,
  MAX(source_exported_at) AS last_export_ts
FROM bronze_lottery_sales
GROUP BY order_id
HAVING COUNT(*) > 1
ORDER BY row_count DESC, last_export_ts DESC;

After that review, convert the logic into a controlled delete:

DELETE FROM bronze_lottery_sales
WHERE (order_id, source_exported_at, ingestion_ts) IN (
  SELECT
    order_id,
    source_exported_at,
    ingestion_ts
  FROM bronze_lottery_sales
  QUALIFY ROW_NUMBER() OVER (
    PARTITION BY order_id
    ORDER BY source_exported_at DESC, ingestion_ts DESC
  ) > 1
);

Only apply a destructive delete when the ranking rule has already been validated in a non-destructive select.

Other high-value `QUALIFY` patterns

Preferred store by customer visit count

Another common business need is to assign one preferred store per customer for CRM segmentation, local promotions, or territory ownership.

The base table often contains one row per visit. First aggregate visits by customer and store, then rank stores by visit volume:

WITH customer_store_visits AS (
  SELECT
    customer_id,
    store_id,
    COUNT(*) AS visit_count,
    MAX(visit_date) AS last_visit_date
  FROM customer_shop_visits
  GROUP BY customer_id, store_id
)
SELECT
  customer_id,
  store_id,
  visit_count,
  last_visit_date,
  DENSE_RANK() OVER (
    PARTITION BY customer_id
    ORDER BY visit_count DESC
  ) AS visit_rank
FROM customer_store_visits
QUALIFY visit_rank = 1;

Why DENSE_RANK() here:

use DENSE_RANK() when tied stores should all remain visible as preferred stores
use RANK() if you want the same tie behavior but also want rank gaps in downstream reporting
use ROW_NUMBER() only if the business requires forcing a single winner, for example by adding last_visit_date DESC

This is a good example of QUALIFY not for deduplication, but for “keep the best group after aggregation.”

Pipeline tracking: latest status and retry transitions

Custom pipeline tracking tables usually contain several log rows for the same run_id:

started
failed
retried
success

To report the final state per run, keep the latest log event:

SELECT
  run_id,
  pipeline_name,
  status,
  event_ts,
  attempt_no
FROM pipeline_run_log
QUALIFY ROW_NUMBER() OVER (
  PARTITION BY run_id
  ORDER BY event_ts DESC, attempt_no DESC
) = 1;

To isolate the transition where a retry recovered after a failure, compare each row to the previous status:

SELECT
  run_id,
  pipeline_name,
  status,
  event_ts,
  LAG(status) OVER (
    PARTITION BY run_id
    ORDER BY event_ts, attempt_no
  ) AS previous_status
FROM pipeline_run_log
QUALIFY previous_status = 'failed'
    AND status IN ('retried', 'success');

This pattern is useful when teams want to measure operational resilience:

how many failed runs recovered without manual intervention
which pipelines usually succeed on retry two or three
where retry loops hide unstable upstream dependencies

Measured results on the synthetic lab

On the included lab dataset:

the duplicate export scenario starts with 12 rows for 8 real lottery orders
SELECT DISTINCT still returns 12 rows because timestamps differ
the QUALIFY ROW_NUMBER() pattern returns the expected 8 rows
the pricing scenario shows multiple restaurant orders where using the latest supplier price overstates margin by 8% to 19%
the preferred-store scenario exposes customers with tied favorite stores, which is exactly where DENSE_RANK() is safer than forcing one arbitrary winner
the pipeline-log scenario isolates both the latest status per run and the runs that recovered immediately after a failure

The percentages are synthetic, but the failure mode is common in production because the wrong join still returns a complete-looking result set.

Lessons learned

If the business rule is “pick one row after ranking,” start with QUALIFY.
SELECT DISTINCT is not a deduplication strategy when technical metadata differs.
Correlated subqueries hide tie-breaker logic and are fragile for temporal rules.
For cost history, “latest row” and “row valid at order date” are different questions.
DENSE_RANK() is often the right choice when business users need all tied top candidates, not an arbitrary winner.
LAG() plus QUALIFY is an effective way to extract operational state transitions from log tables.
Temporary views are good debugging tools, but they should not be the default answer for one-pass ranking logic.
Validate with a review query first, then delete or merge only after the kept row is explicit.

QUALIFY does not replace good modeling, but it does remove a large amount of avoidable SQL overhead in day-to-day warehouse work.