Grants talk:Project/DBpedia/GlobalFactSyncRE/Timeline/Tasks

Documentation of the current prefusion-dump/MongoDB setup[edit]

Documentation of the current prefusion-dump/MongoDB setup under https://git.informatik.uni-leipzig.de/gfs/main/blob/master/global.dbpedia.org.md. by Marvin. Tina Schmeissner (talk) 13:14, 11 June 2019 (UTC)[reply]

Sebastian Hellmann commented here: https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(proposals)#A_new_use_for_Wikidata_external_IDs_in_Wikipedia_(template) Tina Schmeissner (talk) 13:21, 11 June 2019 (UTC)[reply]

Challenge[edit]

We want to announce a challenge to hopefully find a good intern to work on the project. A first draft can be found here: challenge. Any ideas for improvement are welcome. Tina Schmeissner (talk) 12:10, 14 June 2019 (UTC)[reply]

Initial version of references extraction from infoboxes[edit]

email from Krzysztof:

We have an initial version of references extraction from infoboxes. The project URL is https://git.informatik.uni-leipzig.de/kwecel/infoboxes-refs

So far the script extracts raw references, i.e. without further parsing. It just puts what is available between <ref></ref>. Please not that some references have their names, hence we leave just names with the goal to further processing during the extraction phase. Moreover, it is more convenient for potential joining with another table in which we could extract reference once and use in many places.

The following columns can be found in the output.

1- Wikipedia_article: name/title of the Wikipedia article

2- Infobox_name: name of the infobox; list of infoboxes is contained in a separate directory and was prepared based on analysis what template is really an infobox

3- Parameter_name: raw property in DBpedia notion; identifies row in an infobox

4- Reference_name: name of the reference, if provided; if not, the following value is used instead: "<noname_ref>"; names are unique only within given article; sometimes reference names is defined outside of an infobox

5- Reference_direct_code: raw code, as explained above; this is main input for further development

Włodek will upload the code. There are also some examples in output folder - ca. 10000 rows for selected languages. We can upload the samples just for overview directly to gitlab. For full dumps we need to discuss the destination. Where data should be uploaded?

Tina Schmeissner (talk) 09:11, 17 June 2019 (UTC)[reply]

Factual Consensus Finder - UI[edit]

I understand what the FCF does, but there are still a bunch of questions:

1. How or where do I enter the subject / entity that the infobox belongs to on the page? Do I always need the DBpedia identifier?

2. How will the user be able to reach this page from a Wikipedia page? I assume ideal case scenario would be if eventually there was a link to the FCF page somewhere in the infoboxes.

3. Using DBpedia as an example:

predicate	# of values and sources	questions	Feedback from Marvin
description	1	Result is “semantic web” for German wiki, but this is not shown anywhere in the infobox of the German wiki.	“semantic web” is listed in the IB with the predicate "Beschreibung", but not shown in the actual IB
latest release version	5	First value is empty, with 4 wikis as sources.	There are empty but valid triples being extracted
developer	5	Why are the universities listed in all these languages (why not just in the language of the respective wiki?), and why are they linked to their respective FCF pages?	not yet discussed

Tina Schmeissner (talk) 12:07, 18 June 2019 (UTC)[reply]

Two docs about fixing mappings[edit]

you can also see https://docs.google.com/document/d/1yZLNKZ802pC-U0PYMqnyem9KZn5qADccXR2Te2wlr6Q/edit and https://svn.aksw.org/papers/2018/SAC_DBpedia_mappings_alignment/public.pdf Sent from Dimitris, 13:00, 8 July 2019 (UTC)

MusicBrainz - SameAs Problem[edit]

Found this paper: Automatic Interlinking of Music Datasets on the Semantic Web ftp.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-369/paper18.pdf SebastianHellmann (talk) 08:16, 9 July 2019 (UTC)[reply]

DBpedia extractor + Infobox references exctractor[edit]

Example on extracting references from article about Facebook in English Wikipedia:

http://dbpedia.informatik.uni-leipzig.de:8111/infobox/references?article=https://en.wikipedia.org/wiki/Facebook&format=json
Works for: de, en, es, fr, it, nl, pl, pt, ru, sv
TODO: clear list of all related infoboxes in each languages (with redirects)

DBpedia extraction framework on this page:

http://dbpedia.informatik.uni-leipzig.de:9998/server/extraction/en/extract?title=Facebook&format=json&extractors=mappings
Problems: there a lot of parameters which are not extracted. Examples:

there is no parameter "rww źródło" in article about Aceton in PL Wiki: http://dbpedia.informatik.uni-leipzig.de:9998/server/extraction/pl/extract?title=Aceton&revid=&format=json&extractors=custom

Specific structure of the Taxobox in FR Wiki: http://dbpedia.informatik.uni-leipzig.de:9998/server/extraction/fr/extract?title=Lasaeidae&revid=&format=json&extractors=custom

Updates[edit]

Now it is possible to see each parameter of the citation templates (if exists): http://dbpedia.informatik.uni-leipzig.de:8111/infobox/references?article=https://en.wikipedia.org/wiki/Albert_Einstein&format=json
and also parser can use data from DBpedia extraction framework with custom option (adding '&dbpedia' to the URL): http://dbpedia.informatik.uni-leipzig.de:8111/infobox/references?article=https://en.wikipedia.org/wiki/Albert_Einstein&format=json&dbpedia
- Problems:

parameter 'spouse' has two values (names) and each value has additional data (dates)
parameter 'award' not parsed correctly (there is list in template 'Plainlist') -> values and reference 'frs' not found.

--Lewoniewski (talk) 09:06, 12 August 2019 (UTC)[reply]

Upgraded version of Python Infobox Reference Extractor (PIRE):

updated list of infoboxes
works with "r" templates (added new parameter "Reference_mode" for such templates): http://dbpedia.informatik.uni-leipzig.de:8111/infobox/references?article=https://pl.wikipedia.org/wiki/Andrespol&format=json&dbpedia
and others. --Lewoniewski (talk) 20:08, 8 September 2019 (UTC)[reply]

Extraction statistics in September 2019:

http://stats.infoboxes.net
Presentation at SEMANTiCS 2019, 14th DBpedia Community Meeting in Karlsruhe
all extracted reference data in JSON: http://dbpedia.informatik.uni-leipzig.de/repo/lewoniewski/gfs/infobox-refs/2019.09.01/

citation_id[edit]

In both versions of the parser for each citation template special 'citation_id' parameter is generated based on values of one of the following citation template parameters:

doi -> http://doi.org/...
jstor -> https://jstor.org/stable/...
pmc -> https://ncbi.nlm.nih.gov/pmc/articles/PMC...
pmid -> https://ncbi.nlm.nih.gov/pubmed/...
arxiv -> http://arxiv.org/abs/...
isbn -> http://books.google.com/books?vid=ISBN...
issn -> https://worldcat.org/ISSN/...
oclc -> https://worldcat.org/oclc/...
url -> http....
website -> http....

The order is important - depending on which parameter is found first, parser will generate appropriate ID. If there is no such parameters, parser generate id with the hash 'http://citation.dbpedia.org/hash2/...' based on the 'title' parameter or (if empty) based on citation template content. --Lewoniewski (talk) 09:05, 12 August 2019 (UTC)[reply]

References names/metadata[edit]

<ref name="" />
https://en.wikipedia.org/wiki/Template:R
Specific templates for selected sources (metadata not directly available):

Is there other options? --Lewoniewski (talk) 10:32, 4 September 2019 (UTC)[reply]

Errors handling in wikicode[edit]

There is no pair of brackets for template in the infobox about Warszaw in Polish Wikipedia (this revision):

 |rok                       = 
 |liczba ludności           = 1 777 972 (31.12.2018)</small><ref name="GUS 2018">{{Cytuj stronę |url = http://demografia.stat.gov.pl/bazademografia/Tables.aspx</ref>
 |gęstość zaludnienia       = 3412 <small>(1.01.2018)</small><ref name="GUS 2018" />

There is no "=" between name and value of parameter. Example on wiceprezydent parameter from this revision:

|pierwsza dama = [[Margarita Penón]]
 |wiceprezydent<br />1. [[Jorge Manuel Dengo Obregón]] (1986-1990)<br />2. [[Victoria Garrón Orozco]](1986-1990)<br />1. [[Laura Chinchilla]] (2006-2010)<br />2. [[Kevin Casas Zamora]] (2006-2010)
 | quote =

Parameters separator in a wrong place (this revision):

'''R5 (silnik)|R5'''

- Pay attention to (in code with comment PPnPP):
  - length of parameter name of the infobox.
  - length of parameter value and number of the references.

Must to be taken into the account - large value of the parameter trasa in the infobox Droga krajowa nr 11 (Polska).

URLs extraction from references[edit]

Wikipedia infoboxes[edit]

Here are statistics of extraction of references URLs from infoboxes in different Wikipedia languages (based on dumps from September 2019):

http://dbpedia.informatik.uni-leipzig.de/repo/lewoniewski/gfs/infobox-refs/2019.09.01/stats/

Files with "_domains" shows domain usage frequency in the references, "all_domains.txt" - summation of results from all considered language versions of Wikipedia.

Wikidata[edit]

Similar statistics for Wikidata (based on dumps from October 2019):

http://dbpedia.informatik.uni-leipzig.de/repo/lewoniewski/gfs/infobox-refs/2019.10.01/wikidata/stats/

In files with "_unique" - only unique URL in references per Wikidata item was taken into the account.