When a district graduates to a region

This is the second post in two weeks about deduping districts. The first one was about same-name within the same region — two rows for Côtes du Rhône that should be one.

This one is about same name across the region/district boundary. When Brouilly stops being a district under Bourgogne and becomes a region in its own right, the old district row doesn't auto-vanish. scripts/remove-redundant-districts.js is the script that handles those.

Why a district would graduate

Three scenarios where you'd promote a district:

The district is famous enough to stand alone. Bolgheri is technically a coastal DOC under Tuscany, but visitors searching for Bolgheri don't think of it as "in Tuscany" — they think of Sassicaia. Promoting it to a region gives it its own landing page.
The district has its own districts. Brouilly contains Côte de Brouilly and a handful of named climats. Districts can't have districts in our schema, so when sub-zones appear, the parent has to graduate.
Geographic accuracy. Some appellations span far enough that treating them as a district (a point on the map) misrepresents them. Champagne is the textbook case — it's an AOC, but it's also a region with sub-zones that need their own pins.

In all three cases, the workflow is: insert a new row in regions with the same name, leave the old row in districts for now, and re-link any wineries that should belong to the new region.

The leftover district row is what this script cleans up.

The detection rule

The whole rule is one filter:

const redundant = ds.filter(d => {
  const matchedRegionId = rByName.get(d.name.toLowerCase())
  return matchedRegionId && matchedRegionId !== d.region_id
})

For each district, look up whether a region exists with the same name (case-insensitive). If yes and that region isn't the district's parent, the district is redundant.

The matchedRegionId !== d.region_id clause is there to skip the common-and-fine case of a district whose name happens to match its parent region (rare, but a couple of edge cases exist — Soave shows up as both a region and a sub-district of itself in some data sources).

The safety guard

A redundant district is only safe to delete if nothing else points at it. The script checks wineries.district_id before any delete:

for (const d of redundant) {
  const { count } = await s.from('wineries')
    .select('*', { count: 'exact', head: true })
    .eq('district_id', d.id)
  if (count && count > 0) {
    console.log(`  KEEP ${d.id} ${d.name} (${count} wineries linked)`)
  } else {
    safe.push(d)
  }
}

If even one winery still links to the district row, we keep it. The operator then has a manual cleanup job: move those wineries to the new region (or to a new district under it), and re-run.

This is the conservative choice. A more aggressive script would re-link the wineries automatically, the way dedup-districts.js does within a single region. We deliberately don't here because the semantics aren't symmetric: same-name in the same region is always a duplicate, but same-name across the region/district boundary means a deliberate restructure is happening, and the operator should decide what to do with the orphaned wineries.

What a typical run looks like

scanned 612 regions, 3,487 districts
17 redundant districts found
  KEEP 2841 Brouilly (12 wineries linked)
  KEEP 3102 Bolgheri (4 wineries linked)
15 safe to delete (no winery links)
deleted 15 redundant districts

The two kept rows in this example need attention before the next run: someone has to move those 16 wineries to the new region rows. After that the script's next pass will see zero redundant districts and exit silently.

The wider point

There are two dedup scripts because there are two kinds of duplication, and they want different policies:

| Script | Detects | Auto-relinks wineries? | |---|---|---| | dedup-districts.js | Same name, same region | Yes (always the older row wins) | | remove-redundant-districts.js | District named like a region | No (operator decides) |

The temptation is to unify them. "It's all just name collisions, right?" It isn't — the two duplication patterns mean different things about the data, and asking the operator to think about them separately is part of what keeps the schema honest.

#data-model#dedup