Field-level redaction is not the dominant lever.
Every U.S. state publishes voter registration data, but states differ on which fields they release. Texas, taking the conservative position, withholds date of birth, race, party affiliation, and phone number. North Carolina publishes all four. The two states therefore anchor opposite ends of the U.S. disclosure spectrum, and this paper uses them as the two empirical anchors throughout.
Conventional wisdom holds that the more conservative regime substantially protects voter privacy. This paper tests that wisdom empirically. Using two voter files, Travis County, TX (n = 879,827) and Robeson County, NC (n = 63,435 active), the analysis formalizes a four-step methodology, proves three theoretical properties of a linkage ladder defined over quasi-identifier subsets, and demonstrates the attack empirically against the FEC contributions database.
The principal empirical conclusion is that even a voter file at the conservative end of the U.S. disclosure spectrum admits k=1 re-identification by trivial attacker knowledge sets at rates exceeding 75% (name alone) and 95% (name + ZIP), and that the file at the permissive end yields rates that are quantitatively similar. The substantive policy implication is that field-level redaction is an empirically secondary determinant of contemporary voter-file privacy outcomes, not the dominant lever the contemporary debate has treated it as. Access controls (rate limits, requester verification, audit logging, downstream-resale prohibitions) are the dominant lever.
All record-level analyses were performed in-memory; no individual record has been written to disk in identifiable form, displayed, or exported. The voter files were lawfully obtained as public records under their respective state statutes (TX Election Code §13.004; NC Gen. Stat. §163-82.10). The research targets disclosure policy, not any individual voter.
Three numbers that change the policy frame.
Each statistic is reproducible from a single ~25-second Python script run against publicly-released voter files. Confidence intervals and per-ZIP variance are reported in the full paper.
Ten findings, two anchor regimes.
The full paper formalizes the linkage ladder, applies it to two empirical case files, runs four robustness analyses, and grounds seven threat-model case studies in vulnerable-class sizes computed directly from the data.
Both regimes admit > 85% re-identification by trivial inputs
Name + ZIP uniquely identifies 95.81% of Travis (TX) voters and 87.79% of Robeson (NC) voters. Name alone identifies 75.14% (TX) and 73.42% (NC). The disclosure-regime difference is observable but does not change the qualitative conclusion of high re-identifiability under low-knowledge adversaries.
Names dominate, not date of birth
In neither file is DOB the dominant identifier. Names plus any geographic narrowing produce uniqueness above 80% in both regimes. The classic Sweeney 87% triple is recovered against the NC file (at 93.27%) only after name initials are added on top of ZIP, gender, birth year, race, and party, meaning names are doing most of the identifying work, not the demographic triple.
Quasi-identifier substitution, registration date is a DOB proxy
Texas withholds DOB but publishes registration date at single-day resolution. ZIP + gender + exact registration date uniquely identifies 27.89% of Travis voters; generalizing the same field to year resolution drops the rate to 0.05%, a 558× reduction in identifying power with no loss of utility for any legitimate research or audit purpose.
Behavioral fingerprint, turnout patterns alone identify voters
Treating each voter’s 30-year participation pattern as a 105-bit binary string, 24.84% of all Travis voters have a unique turnout pattern. Conditional on at least 20 elections of participation, the rate is 98.42% (out of 103,668 such voters). Long-term active voters are essentially fingerprinted by their participation history, with no name, no address, no demographic data.
Inverse-attack symmetry, address + gender pins down 67% in both regimes
Address + gender uniquely identifies 67.51% (TX) and 67.71% (NC) of voters. The absence of DOB in TX does not mitigate the inverse attack, because the attack does not use DOB. A subset of 26.51% (TX) live alone at their address and are pinned deterministically.
Open-primary advantage, ~25 points fewer voters exposed
34.86% of Travis (TX) voters carry a clean partisan signal derivable from primary-ballot history; 59.78% of Robeson (NC) declared-party voters expose the same signal directly. Open primaries deliver a quantifiable workplace-screening privacy advantage of roughly 25 percentage points, a property of the underlying primary regime, not the file structure.
FEC linkage, an empirical floor under the theoretical ceiling
An exact-match merge of 181 unique FEC contributors in ZIP 78704 against the Travis file yielded 52.49% k=1 match rate and 58.01% any-match, with no nickname normalization, suffix handling, or fuzzy matching. Standard linkage tooling routinely raises rates above 90%. The 52.49% is the empirical floor, not the ceiling.
Robustness, point estimates survive every check
Wilson 95% CIs on headline statistics are tight (≤ 0.10 percentage points). Across 58 ZIPs, within-ZIP name uniqueness has median 97.12% and worst case 91.18%. Across four name-normalization variants, results depart by ≤ 0.276 points. Counter-intuitively, character-level error injection increases uniqueness (95.81% → 97.11% at 10% errors) because errors break colliding name-and-ZIP groups.
Chained dossier, voter file to full profile in 30 to 90 minutes for under $30
A complete dossier on a named target through five chain stages, voter file, people-search, property records, court records, social media, takes 30 to 90 minutes at total budget under $30. Legal-defense costs after a single privacy compromise routinely exceed $10,000. The asymmetry between attacker cost and defender cost is large.
Subgroup salience, vulnerable populations are large in absolute terms
Specific subgroup vulnerabilities on the Travis file: 320 deployed-military APO/FPO records (each uniquely identifiable), 79,649 (9.05%) recent registrants vulnerable to suppression mailings, 67,829 (7.71%) suspense-list voters at heightened identity-fraud risk, and 4,308 (0.49%) out-of-state mailers exposed to asset-hunting. These are classes of thousands of voters, not edge cases.
Four steps, three theoretical properties.
The framework is designed to apply to any voter file, or any public PII dataset where name and geographic identifiers are present. Three lemmas establish that the linkage ladder is monotone in attacker knowledge, intersective under disjoint quasi-identifiers, and bounded below by collision probability.
Linkage Map
Catalog every external dataset that shares quasi-identifiers with the voter file, FEC, property, court, broker, social, breach corpora.
Linkage Ladder
For each plausible attacker knowledge set F, compute (u_F, ρ_F): the share unique under F and the share narrowed to groups of fewer than five.
Marginal Information Gain
Tabulate the attributes added at each chain stage with rough time and dollar cost, voter file, people-search, property, court, social, breach, premium broker.
Harm Taxonomy
Classify each linkage by primary harm modality and instantiate with a threat-model case study grounded in a vulnerable-class size computed directly from the data.
All numeric claims in the paper are reproducible from a single ~25-second Python script that emits a results.json containing every reported value. The empirical FEC linkage uses an aggregate-only output script, no individual record from the merge is included in the paper.
Two narrow fixes, one structural shift.
Two specific recommendations follow directly from the data and are low-cost relative to their privacy gains. The broader structural shift is from field-level redaction toward access-control mechanisms.
Generalize Texas’s registration-date field to year resolution
This drops the relevant uniqueness rate from 27.89% to 0.05%, a 558-fold reduction, with no loss of utility for any legitimate research or audit purpose. The implementation is a one-line change at file-export time.
Auto-filter APO/FPO mailing codes for deployed military
320 Travis voters carry APO/FPO codes (AE/AP/AA), each uniquely identifiable; their families are exposed to targeting and coercion while the service member is overseas. Implementation cost: three string-equality checks at file-export time.
Shift from field-level redaction to access controls
Rate limits on bulk file requests, requester-identity verification with audited stated-purpose declarations, prohibitions on commercial resale of voter-file data, audit logs of who has accessed which records, and tiered disclosure with full record access only via a credentialed request channel.
Get in touch.
Interested in voter privacy, re-identification risk, or the broader policy implications of public-records disclosure? I welcome citation requests, academic collaboration, conference invitations, and policy-oriented inquiries.
Contact Noah