Community Wishlist Survey 2019/Archive/Easy to use external data collector

From Meta, a Wikimedia project coordination wiki

Easy to use external data collector

NoN Outside the scope of Community Tech

  • Problem: It is often painfull to gater data from external sources by hand while it could be automatized.
  • Who would benefit: This idea has the potential to quickly enhance wikidata completion.
  • Proposed solution: The idea is to ease the creation of tools that can automaticly gather data from similar pages.

    A wikidata contributor open the tool generator and type a URL such as https://www.iso.org/standard/45481.html . He clicks on different parts of the page such as the title, the baseline, the edition number, the publication date... Each time he click on a field the system ask him what wikidata field of property should be linked to the content. When finished, the tool show him what the item would look like. By validating it, it creates a proposal page on Wikidata with the generated code of the tool's module that can be reviewed by the community. After review, an admin can create the tool. Then the tool is launched on a specific URL typology : https://www.iso.org/standard/i.html with i from 1 to 99999 to automaticly gather all data from the ISO site, compare it with existing items and create/complete the relevant items.

  • More comments: The phase of clicking on the different fields of the page is the easy part. Important parts that have to leaved to experienced users are the code verification and the algorithm to detect already existing items.
  • Phabricator tickets:
  • Proposer: Thibdx (talk) 22:18, 5 November 2018 (UTC)[reply]

Discussion

  • This is really similar to how you make Citoid work with specific pages, except it requires more programming skill. Maybe there's a technical wish that direction that makes sense. --Izno (talk) 02:50, 6 November 2018 (UTC)[reply]
  • Hi @Thibdx:. Creating a tool to automatically scrape data-related pages is quite a feat; it requires a lot of work to make the tool properly adjustable so the user is able to use it in a useful manner in pages that probably have more differences than we realize. A "scraping" tool is always a challenge, and one that needs to be flexible enough to answer the need of this task is a huge task. As such, this is out of scope for the Community Tech team to do as part of the wishlist process. Thank you for your participation, MSchottlender-WMF (talk) 18:23, 15 November 2018 (UTC)[reply]