Home / Blog / Côtes-du-Rhône, Côtes du Rhône, and the deduper that calls them equal

May 11, 2026· Wine World Map

Côtes-du-Rhône, Côtes du Rhône, and the deduper that calls them equal

After enough OSM imports, every wine region accumulates two or three copies of the same district. Côtes-du-Rhône and Côtes du Rhône look identical to a human reading a map. To Postgres they're two rows. Wineries get linked to whichever one happened to be created first, and when you click a district marker only half the producers show up.

scripts/dedup-districts.js is the script that cleans up after the importer. It runs in maybe ten seconds and routinely removes 40-80 duplicates from a database that has ~3,500 districts.

The whole script is one normalizer

The interesting line is this:

const norm = str => str.toLowerCase()
  .normalize('NFD').replace(/[̀-ͯ]/g, '')
  .replace(/[^a-z0-9]+/g, '-')
  .replace(/^-+|-+$/g, '')

Four passes. NFD decomposes accented characters into base + combining mark (so é becomes e + ◌́). The diacritic strip throws the combining marks away. Then any run of non-alphanumeric becomes a single hyphen. Then trim.

So all of these collapse to cotes-du-rhone:

| Input | Normalized | |---|---| | Côtes-du-Rhône | cotes-du-rhone | | Côtes du Rhône | cotes-du-rhone | | CÔTES DU RHÔNE | cotes-du-rhone | | Cotes du Rhone | cotes-du-rhone | | côtes_du_rhône | cotes-du-rhone |

Group by (region_id, normalized_name), anything with more than one row is a duplicate cluster.

The "keep the older one" rule

Within each cluster: sort by id, keep the lowest, drop the rest. Before dropping, move every winery whose district_id points at the drop target to point at the keeper instead.

g.sort((a, b) => a.id - b.id)
const keep = g[0]
const drop = g.slice(1)
for (const r of drop) {
  await s.from('wineries').update({ district_id: keep.id }).eq('district_id', r.id)
  await s.from('districts').delete().eq('id', r.id)
}

The "keep lowest id" rule is arbitrary but defensible: the older row is the one humans have had longer to enrich. If someone wrote a description for the original Médoc back in week 22, we don't want to nuke it just because a re-import created a sibling.

What this doesn't catch

A few things slip past:

  • Translations. Tuscany and Toscana aren't normalized to the same string. That has to be a manual rename (and a redirect — see the Toscana post).
  • Punctuation that carries meaning. Saint-Émilion and Saint Émilion Grand Cru both normalize to leading saint-emilion, but they're genuinely different appellations, so the prefix match isn't run.
  • Cyrillic and Greek. Στεφανία doesn't NFD-decompose to ASCII. Right now we don't have districts there, so it hasn't bitten yet.

Why this lives in a script and not a UNIQUE constraint

The obvious question: why not put a unique index on (region_id, slug) and let the database refuse duplicates at insert time? Because the importer's slug logic ran into edge cases — Wikidata-derived names, OSM name:en vs name, and the occasional district that genuinely has two valid spellings in two languages. A unique index turned INSERT … ON CONFLICT DO NOTHING into silent data loss.

So the rule is: imports are allowed to be sloppy, and a dedicated script reconciles. That's two-pass, but each pass is simple, and the dedup is auditable (--dry-run prints exactly what would be merged before anything is written).

#data#normalization#supabase