The dataset is provided as ZIP archive containing one JSON file per name with:
- the person name ($.name)
- the HDP URL ($.hdp.url)
- the Wikipedia articles linked on the HDP ($.hdp.articles) including their URL ($.hdp.articles[*].url), title ($.hdp.articles[*].title), and text content ($.hdp.articles[*].text)
- the non-empty pages of the top-100 Google Web search results ($.pages) or the given name including their URL ($.pages[*].url) and text content ($.pages[*].text)
- the actual assigned article URL per Web search result page ($.pages[*].alignment) iff available
To prevent trivial alignments, the Web search results do not contain any page from the "wikipedia.org" domain.