Grants:IEG/Citation data acquisition framework

From Meta, a Wikimedia project coordination wiki
statusnot selected
Citation data acquisition framework
summaryCreate a framework for placing the update and creation of algorithms for automatically obtaining citation information in the hands of the Wikipedia community such that the algorithms can be updated or or new ones created as needed, shared across multiple citation generation tools, and localized to Wikipedia projects and languages.
targetEnglish Wikipedia first, then branching to multiple others
strategic priorityimprove quality
themetools
amount9,750 USD
granteeMakyen
contact• Makyen
this project needs...
volunteer
join
endorse
created on19:19, 30 September 2014 (UTC)


Project idea[edit]

What is the problem you're trying to solve?[edit]

Currently citations generated by the various citation generation tools all produce different citations due to the algorithms used to obtain the data being independently developed for each tool. These tools have varying levels of capability with respect to acquiring data from the internet (page scraping) with the specific websites supported varying from tool to tool. When the layout used by a website changes, the ability of these tools to acquire the citation data either degrades (the page no longer matches the algorithm used), or the developers of the tools have to put in constant effort to maintain the tools ability to get the data. When the ability to gather this data degrades it can result in automatically generated citations which have information that is inaccurate or even detrimental to the ability to find the actual reference. In addition, in developing citation generation tools a large amount of developer time is consumed just re-developing the algorithms for each individual website. Among the other issues, this use of developer time is a considerable drain on a resource which could be used to provide additional or improved capabilities in these or other tools.

What is your solution?[edit]

The proposed solution is to move the definition of the algorithms for acquiring data from each website out of the individual tools and into a shared resource that is stored on-wiki similar to the way that AutoWikiBrowser stores descriptions of General fixes‎, Template redirects‎, and Typos‎‎. Doing this provides many significant benefits. Among other benefits: It permits the algorithms to be maintained by the Wikipedia community instead of bottle-necked with the developer(s) of a particular tool. It allows the algorithms to be localized to each Wikipedia project, and/or shared across projects (e.g which parameter name is used, or even how it is used, is not the same on different Wikipedia projects/languages). By harnessing the work of the Wikipedia community a much larger pool of effort is available from those people most interested in having the data obtained from a particular website be accurate – the person citing a particular website is much more motivated to update the algorithm to adapt to website changes, or even develop a new algorithm for a website previously unsupported. This enables the updates needed when a website's layout changes to be kept much more current across multiple tools than is possible when updates must (separately on multiple tools) go through the limited resource of the developer(s) of each tool. It also permits a much wider breadth of websites being supported by the citation tools due to the availability of more people working on developing and updating algorithms.

Project goals[edit]

Improve the overall quality of citations with automatically, or semi-automatically, obtained information and the breadth of websites for which citation information can be automatically obtained.

Project plan[edit]

Activities[edit]

  • Develop a generalized data structure/language to describe the algorithms for acquiring data from websites (page scraping) based on the prototype proof-of-concept in Cite4Wiki Phoenix (currently a Firefox add-on).
  • Publish to other citation tool developers that this description language is in development. Engage in discussion as to changes needed or any additional capabilities needed or desired.
  • Make the description language and the algorithms described available on-wiki similar to how AutoWikiBrowser stores descriptions of General fixes‎, Template redirects‎, and Typos. Make it such that this location and the algorithms contained can be customized on a per Wikipedia project/language basis.
  • Expand the implementation of the prototype page scraping engine in Cite4Wiki Phoenix such that it runs the entirety of the description language developed.
  • Make the citation generation engine in Cite4Wiki Phoenix available to other developers as a reference design, packaged such that it could be included in other projects as a module. This enables other developers to include this functionality without having to duplicate the development of an engine to page scrape based on these descriptions.
  • Enhance Cite4Wiki Phoenix so that it uses the on-wiki storage of algorithms established under this proposal.
  • Develop a user interface for a non-developer to write such algorithms within Cite4Wiki Phoenix as an example implementation.
  • Expand the number of websites for which algorithms described in this language exist and the quality of these descriptions. Encourage the participation of other Wikipedians to develop and enhance these algorithms.
  • Enhance the Cite4Wiki Phoenix user interface and capabilities based on user input.

Budget[edit]

  • Developer/Technical writing time (full-time equivalent, 2 person-months): $8,500
  • Machines for testing different OS/Browser environments: PC/Mac (Windows XP, 7, 8; OSX; linux) (all purchased used): $1,250

Community engagement[edit]

Post announcements and ask for comment in appropriate talk pages, and directly contact developers of currently available citation development tools, and ask for/encourage participation.

Sustainability[edit]

It is a specific goal of this project to enable the Wikipedia community to maintain the page scraping algorithms into the future so the descriptions stay up to date with changes to the external websites being cited. Developers of citation tools will also have available a public specification for the language and a reference design which permits easy inclusion in their projects. It is expected that this will grow to be used across multiple Wikipedia projects and languages.

Measures of success[edit]

  • Page scraping language specification publicly available.
  • Reference page scraping engine module publicly available (free and open-source).
  • Storage for algorithms on-wiki.
  • User interface for non-developer editing of algorithm descriptions operational in a Cite4Wiki Phoenix release (free and open-source).

Get involved[edit]

Participants[edit]

Makyen: Makyen has design experience in multiple computer languages and operating systems; has experience with writing for Wikipedia and coding templates, etc.; has technical writing experience within the electronics industry; and developed the proof-of-concept page scraping engine in Cite4Wiki Phoenix which uses page scraping descriptions stored external to the Cite4Wiki Phoenix Firefox add-on.

Community notification[edit]

Please paste links below to where relevant communities have been notified of your proposal, and to any other relevant community discussions. Need notification tips?

Endorsements[edit]

Do you think this project should be selected for an Individual Engagement Grant? Please add your name and rationale for endorsing this project below! (Other constructive feedback is welcome on the discussion page).