From Freebase to Wikidata: The Great Migration
WWW, pp. 1419-1428, 2016.
datum strategyCompound Value Typescollaborative knowledge baseIW3C2web searchMore(7+)
We have started to import labels that are in Freebase but not in Wikidata
Collaborative knowledge bases that make their data freely available in a machine-readable form are central for the data strategy of many projects and organizations. The two major collaborative knowledge bases are Wikimedia's Wikidata and Google's Freebase. Due to the success of Wikidata, Google decided in 2014 to offer the content of Free...More
- Large Web-based knowledge bases that make their data available under free licenses in a machine-readable form have become central for the data strategy of many projects and organizations.
- In order to provide the Wikidata community with references for the facts in Freebase, we have reused data from the Google Knowledge Vault .
- We decided to rely on crowdsourced human curation and created the Primary Sources Tool, a widget that displays Freebase statements for curation by the contributor that can be added to the currently shown Wikidata item.
- The challenge concerns the data mappings between Freebase topics and properties, and Wikidata items and properties.
- It maps 1.15 million Freebase topics to their corresponding Wikidata items.
- A second actively maintained mapping has been created by Samsung.9 It is based on the same idea, but matches a Freebase topic with a Wikidata item even if there is only a single shared Wikipedia link.
- In order to prepare the data for integration into the Primary Sources Tool, we have created a set of scripts which map the content of the last Freebase dump to Wikidata statements.16 The Primary Sources Tool was developed in parallel with the mapping scripts.
- We first created a small dataset of around 1 million statements based on few select properties (e.g., /people/person/birth_place) and deployed both the first dataset and a basic version of the tool in order to gather initial user feedback.
- Google/freebase-wikidata-converter to adapt the mapping scripts early and correct minor issues with the data in the back-end.
- If we attempted to encode Wikidata statements as if they were Freebase facts, i.e., by removing sources, representing statements with qualifiers using CVTs, and adding reverse properties, this would lead to a number of 110 million facts, i.e. an increase of 167% over the raw number of statements.
- Based on our property mappings, 0.52 million of these facts (i.e., 92%) are converted to Wikidata statements.
- In order to suggest interesting topics to add to Wikidata, we could rank the topics that are not mapped yet by the number of incoming links from already mapped topics and filter less interesting types like ISBNs. Another area for improvement is to upload high quality datasets using a bot, like the reviewed facts or some sets for external IDs, in order to speed up the integration of Freebase content into Wikidata.
- Concluding, in a fairly short amount of time, we have been able to provide the Wikidata community with more than 14 million new Wikidata statements using a customizable and generalizable approach, consisting of data preparation scripts and the Primary Sources Tool, which is well integrated into the Wikidata user interface.
- In the 17.2 million Freebase labels for mapped topics, only 0.9 million, i.e., 5%, lack from Wikidata
- So leaving aside the not mapped topics, we have created a statement for more than 24% of the facts