Home / Blog / Côtes-du-Rhône, Côtes du Rhône, and the deduper that calls them equal
May 11, 2026· Wine World Map
Côtes-du-Rhône, Côtes du Rhône, and the deduper that calls them equal
After enough OSM imports, every wine region accumulates two or three copies of the same district. Côtes-du-Rhône and Côtes du Rhône look identical to a human reading a map. To Postgres they're two rows. Wineries get linked to whichever one happened to be created first, and when you click a district marker only half the producers show up.
scripts/dedup-districts.js is the script that cleans up after the
importer. It runs in maybe ten seconds and routinely removes 40-80
duplicates from a database that has ~3,500 districts.
The whole script is one normalizer
The interesting line is this:
const norm = str => str.toLowerCase()
.normalize('NFD').replace(/[̀-ͯ]/g, '')
.replace(/[^a-z0-9]+/g, '-')
.replace(/^-+|-+$/g, '')
Four passes. NFD decomposes accented characters into base + combining
mark (so é becomes e + ◌́). The diacritic strip throws the
combining marks away. Then any run of non-alphanumeric becomes a single
hyphen. Then trim.
So all of these collapse to cotes-du-rhone:
| Input | Normalized | |---|---| | Côtes-du-Rhône | cotes-du-rhone | | Côtes du Rhône | cotes-du-rhone | | CÔTES DU RHÔNE | cotes-du-rhone | | Cotes du Rhone | cotes-du-rhone | | côtes_du_rhône | cotes-du-rhone |
Group by (region_id, normalized_name), anything with more than one
row is a duplicate cluster.
The "keep the older one" rule
Within each cluster: sort by id, keep the lowest, drop the rest.
Before dropping, move every winery whose district_id points at the
drop target to point at the keeper instead.
g.sort((a, b) => a.id - b.id)
const keep = g[0]
const drop = g.slice(1)
for (const r of drop) {
await s.from('wineries').update({ district_id: keep.id }).eq('district_id', r.id)
await s.from('districts').delete().eq('id', r.id)
}
The "keep lowest id" rule is arbitrary but defensible: the older row is the one humans have had longer to enrich. If someone wrote a description for the original Médoc back in week 22, we don't want to nuke it just because a re-import created a sibling.
What this doesn't catch
A few things slip past:
- Translations.
TuscanyandToscanaaren't normalized to the same string. That has to be a manual rename (and a redirect — see the Toscana post). - Punctuation that carries meaning.
Saint-ÉmilionandSaint Émilion Grand Cruboth normalize to leadingsaint-emilion, but they're genuinely different appellations, so the prefix match isn't run. - Cyrillic and Greek.
Στεφανίαdoesn't NFD-decompose to ASCII. Right now we don't have districts there, so it hasn't bitten yet.
Why this lives in a script and not a UNIQUE constraint
The obvious question: why not put a unique index on (region_id, slug) and let the database refuse duplicates at insert time? Because
the importer's slug logic ran into edge cases — Wikidata-derived
names, OSM name:en vs name, and the occasional district that
genuinely has two valid spellings in two languages. A unique index
turned INSERT … ON CONFLICT DO NOTHING into silent data loss.
So the rule is: imports are allowed to be sloppy, and a dedicated
script reconciles. That's two-pass, but each pass is simple, and the
dedup is auditable (--dry-run prints exactly what would be merged
before anything is written).
#data#normalization#supabase