Grants:Project/Information extraction and replacement/Intuition on templates

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

An intuition on template filling goes like this. Assume that we analyze articles from lakes of Oppland for elevation, then we would have something like the following. This is just for the example, in real code we would use the special page "What links here" on the d:elevation above sea level, and then limited the set to a specific type of instance. It is simpler to use the category for this example. Note that some elevations are not linked, this is because they do not exist in Wikidata.

Lake Initial constituents (prefix) Vaue Unit Trailing constituents (postfix)
Aursjoen It lies at an elevation of 1,098 m above sea level.
Aursjøen The 36.38-square-kilometre (14.05 sq mi) lake sits at an elevation of 856 metres (2,808 ft) above sea level and is about 70.67 kilometres (43.91 mi) around.
Bygdin Bygdin is regulated and its normal level lies between 1,048 and 1,057 meters above sea level.
Dokkfløyvatn It lies at an elevation of 735 m above sea level
Einavatnet It lies at an elevation of 398 metres (1,306 ft) above sea level.
Helin It is located at 870 m above the sea, and has a volume of 18.6 million m³.
Langvatnet It has an area of 0.3505 square kilometers (0.1353 sq mi) and is located at 1,422 meters (4,665 ft) above mean sea level.
Losna It lies 181 m above sea level.
Mjøsa It is 365 km² in area and its volume is estimated at 56 km³; normally its surface is 123 metres above sea level, and its greatest depth is 468 metres.
Nedre Heimdalsvatn It lies at an elevation of 1,053 m above sea level.
Prestesteinsvatnet The 4.12-square-kilometre (1,020-acre) lake sits at an elevation of 1,357 metres (4,452 ft) above sea level.
Randsfjorden The lake is 135 metres (443 ft) above sea level.
Rauddalsvatn It lies at an elevation of 916 m above sea level.
Sandvatnet/Kaldfjorden/Øyvatnet It is at an elevation of 1,019 m above sea level.
Slidrefjord It is at an elevation of 366 m above sea level.
Steinbusjøen It lies at an elevation of 1,211 m above sea level.
Strondafjorden It lies at an elevation of 355 m above sea level.
Tisleifjorden It has an elevation of 819 m above sea level.
Tyin The lake serves as a reservoir for Tyin kraftverk and the water level is regulated between 1082.84 and 1072.50 m above sea level.
Vågåvatn The lake is 362 meters above sea level and has a surface area of 14.76 km², making it one of the 200 largest lakes in Norway.
Vangsmjøse It is at an elevation of 466 m above sea level.
Vinstre It is at an elevation of 1,032 m above sea level.

By inspection we find that there are some repeated prefixed constituents «The lake is», «is located at», «at an elevation of», and a repeated suffixed constituent «above sea level». The value can be a single or combined value, that is a list. There are also units «m», «meters», and «metres».

It is possible to estimate probabilities for observing the constituents in specific relative positions around the value, thereby estimating probabilities for observing those overall patterns on external pages. With this we can search inside pages found out on the net. It is not difficult to find some pages mentioning w:Bygdin, but weeding out the pages that actually mention the elevation is more difficult and at least very time consuming.

External pages with text fragments found to satisfy a minimum probability, that is one or more previously found constituents are found on the page and the overall probability can then be calculated. Note that the actual value for the article in question is used together with the found constituents. Matching pages can then be ranked according to the probability, with more likely pages on the top. The user can then check the excerpt, or even open the page, to manually accept the page as source. If it is accepted, then a reference is made by mw:citoid and inserted after the period that contains the factoid.


  • It is not necessary to rewrite any text to inert the reference, it is only necessary to scan forward to the end end of the period and insert the reference.
  • It will not be necessary to manually verify this insertion, it is only necessary to manually verify insertion of the final references tag if it is missing.
  • It might be necessary to add the reference to the statemet at Wikidata, and then reimport the value with the reference.