It is a sample of the North Carolina Voter Registration dataset, available here (using the snapshot: VR_Snapshot_20181106). Since the original dataset was very large for our experiments we applied a sampling technique to reduce its size, but without disrupting the ratios of duplicates in the dataset. We describe the sampling technique followingly:
- Sampling: The sampling is done according to this template. We have first identified all cluster sizes for the full dataset. To downsample without losing this cluster sizes ratios, we enter at "# goal records" the target number of records. Finally, "# final records" reports the final numbers that will be obtained. The actual code can be found in the mdedup_utils project.
Based on this sampling we selected the following dataset, along with duplicate pairs.
- A simple data preparation of lower-casing and removing of special characters has been applied. Available in tab separated value (TSV) format. (14,183 objects - TSV format)