Community Wishlist Survey 2019/Bots and gadgets/Machine readable diffs

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search
Random proposal ►

 ◄ Back to Bots and gadgets  The survey has concluded. Here are the results!


  • Problem: Diffs cannot be read without screen scraping. Even the API output requires HTML parsing to get at the wikitext changes.
  • Who would benefit: Any semi-automated or fully automated consumer of diffs (e.g. bot operators, data scientists, tools, researchers).
  • Proposed solution: Exactly what it says on the tin. Add a different diff format to the API that JSON/XML parsers can understand.
  • Proposer: MER-C (talk) 20:29, 30 October 2018 (UTC)

Discussion[edit]

I like this idea. Gryllida 22:17, 30 October 2018 (UTC)

  • Could you provide an example of how this API's output might look? MaxSem (WMF) (talk) 23:56, 30 October 2018 (UTC)
  • And examples of use cases where the lack of this format proved prohibitively expensive or blocking? —TheDJ (talkcontribs) 08:00, 31 October 2018 (UTC)
  • I note there are likely four parts to this request:
    1. Determine the "machine readable" format that will be used. Is there an existing standard that could be used, or do we have to invent something?
    2. Create a DiffFormatter subclass that generates that format.
    3. Update the wikidiff2 extension to be able to output structured diff data, either in a format that can be handled by DiffFormatter or in the "machine readable" format directly.
    4. Adjust exiting code to expose the structured output via the API.
    I note that third-party sites using $wgExternalDiffEngine will likely have to gracefully not support this feature. Anomie (talk) 15:54, 31 October 2018 (UTC)
    The problem I want to solve is that I want to perform analysis on the content added or removed. As for the output - I can work with something like this, although the text of the initial revision should also be returned for completeness. The refactoring required for this task will also knock out the technical debt preventing phab:T104072, phab:T117279 and phab:T38902 (moving MobileDiff into core, and making it available through the API) so there are more use cases than that given here. MER-C (talk) 19:39, 31 October 2018 (UTC)


  • On benefit: Yeah, various applications would need such an interface to inspect and analyse edits.
  • On output format:
    • JSON or XML or best both, when on workbench anyway.
    • The contents will be ruled by current wikidiff structure. Other systems would be possible, if collected information is not directly sent to output stream.
      • Simply an array of difference groups, each containing the same information as visible by two column output today.
      • Each group consisting of two objects, before and later.
      • Each object with line range, recently suggested: last detected headline, and Array of paragraphs.
      • Each paragraph with +/- state, and single line content, if any.
      • Each line content as an Array of tupels, each tupel of changed/unchanged flag and string (escaped according to output format).
    • The same as HTML today, just in different syntax according to JSON or XML.
    • The diff itself needs to be wrapped by some informative data:
      • Both revID involved.
      • Method/structure, currently constant wikidiff2 but may be subject to changes over decades.
      • The wikiID (can be derived from request URL, but for sake of completeness).
      • Other information, if present anyway, like pageID or nick or timestamp or page name, but these may be derived from revIDs later.
  • On special features:
    • An API request might provide control information, like number of paragraphs ahead and after, which are constant values for HTML special pages, but parameter values numContextLines already.
    • A research application might drop paragraphs around which are helpful for human readers to identfy the context.
  • On implementation efforts:
    • Unfortunately the 15 years old procedure does not create a complete diff object first, then starting output formatting.
      • Formatted output is collected immediately when each diff is found.
      • Otherwise the entire output object could be just thrown into a serializer for JSON or XML.
    • The two column output is formatted by TableDiff.cpp today.
      • Two copies of this need to be made, JsonDiff.cpp and XmlDiff.cpp.
      • Then appropriate atomic syntax is to be generated like HTML, with proper encoding of some " and < characters.
    • The stream needs to be wrapped into output head and termination, and usual administrative business.

Greetings --13:03, 4 November 2018 (UTC)PerfektesChaos (talk)

Note that the diff engine should not generate JSON or XML directly. It should generate a data structure built out of PHP associative arrays and other PHP native types, which the API will then turn into JSON, XML, or serialized PHP based on the 'format' parameter as it does for every other API request. Anomie (talk) 15:54, 5 November 2018 (UTC)
Or HTML for the front end in either desktop or mobile format. MER-C (talk) 18:58, 6 November 2018 (UTC)
  • Thanks for the ping. Yes, MABS does machine-readable diffs. I'm still working on this project. --MarkAHershberger(talk) 15:10, 15 November 2018 (UTC)

Voting[edit]

  • Support Support MER-C (talk) 18:59, 16 November 2018 (UTC)
  • Support Support More moral than anything else. I doubt that a proposal like this is going to be too popular, despite how useful it seems it might be to particular editors. — Insertcleverphrasehere (or here) 00:28, 17 November 2018 (UTC)
  • Support Support Ellery (talk) 02:39, 17 November 2018 (UTC)
  • Support Support Liuxinyu970226 (talk) 03:42, 17 November 2018 (UTC)
  • Support Support Also, what ICPH said. Enterprisey (talk) 04:06, 17 November 2018 (UTC)
  • Support Support Fabiorahamim (talk) 07:01, 17 November 2018 (UTC)
  • Support Support Afernand74 (talk) 09:37, 17 November 2018 (UTC)
  • Support Support Victor Schmidt (talk) 16:59, 17 November 2018 (UTC)
  • Support Support Iluvatar (talk) 20:37, 17 November 2018 (UTC)
  • Support Support Dirk Beetstra T C (en: U, T) 03:57, 18 November 2018 (UTC)
  • Support Support Temp3600 (talk) 05:39, 18 November 2018 (UTC)
  • Support Support NMaia (talk) 10:17, 18 November 2018 (UTC)
  • Support Support ~ Amory (utc) 11:43, 18 November 2018 (UTC)
  • Support Support Sebastian Wallroth (talk) 13:16, 18 November 2018 (UTC)
  • Support Support β16 - (talk) 10:18, 19 November 2018 (UTC)
  • Support Support Benjamin (talk) 10:23, 19 November 2018 (UTC)
  • Support Support --Frozen Hippopotamus (talk) 11:25, 19 November 2018 (UTC)
  • Support Support Yes... Doc James (talk · contribs · email) 03:52, 20 November 2018 (UTC)
  • Support Support This would enable many useful tools to reject junk or number-changing editors. Johnuniq (talk) 06:17, 20 November 2018 (UTC)
  • Support Support Jamesmcmahon0 (talk) 10:29, 20 November 2018 (UTC)
  • Support Support Gareth (talk) 11:01, 20 November 2018 (UTC)
  • Support Support Philk84 (talk) 13:57, 20 November 2018 (UTC)
  • Support Support Lots of potential for this idea. I don't know what, but other people would. Headbomb (talk) 15:53, 20 November 2018 (UTC)
  • Support Support Lofhi (talk) 17:48, 20 November 2018 (UTC)
  • Support Support Novak Watchmen (talk) 23:59, 20 November 2018 (UTC)
  • Support Support Vulphere 07:26, 21 November 2018 (UTC)
  • Support Support Framawiki (talk) 19:46, 21 November 2018 (UTC)
  • Support Support Nihlus 22:17, 21 November 2018 (UTC)
  • Support Support ElanHR (talk) 22:45, 21 November 2018 (UTC)
  • Support Support Krinkle (talk) 01:24, 22 November 2018 (UTC)
  • Support Support as it seems to have a lot of applications. Anything that makes the Community Tech Team's future work easier (not to mention other people's) seems like a good idea. It might promote some bad reuses, too, not sure if anything can be done about that. HLHJ (talk) 04:01, 22 November 2018 (UTC)
  • Support Support A+ Gryllida 08:13, 23 November 2018 (UTC)
  • Support Support MisterSynergy (talk) 10:26, 23 November 2018 (UTC)
  • Support Support big time. Smjalageri (talk) 12:43, 23 November 2018 (UTC)
  • Support Support ~Cybularny Speak? 15:55, 23 November 2018 (UTC)
  • Support Support NaBUru38 (talk) 18:24, 23 November 2018 (UTC)
  • Support Support Mbrickn (talk) 21:26, 23 November 2018 (UTC)
  • Support Support Viztor (talk) 04:50, 24 November 2018 (UTC)
  • Support Support Winged Blades of Godric (talk) 06:24, 24 November 2018 (UTC)
  • Support Support Matěj Suchánek (talk) 08:45, 24 November 2018 (UTC)
  • Support Support Hmxhmx 09:58, 24 November 2018 (UTC)
  • Support Support Alexei Kopylov (talk) 18:22, 24 November 2018 (UTC)
  • Support Support Tgr (talk) 04:40, 25 November 2018 (UTC)
  • Support Support — AfroThundr (u · t · c) 01:50, 26 November 2018 (UTC)
  • Support Support Dreamy Jazz (talk) 08:48, 26 November 2018 (UTC)
  • Support Support Izno (talk) 01:08, 27 November 2018 (UTC)
  • Support Support PJTraill (talk) 01:09, 27 November 2018 (UTC)
  • Support Support Zache (talk) 03:58, 27 November 2018 (UTC)
  • Support Support Ahm masum (talk) 21:19, 28 November 2018 (UTC)
  • Support Support Kpjas (talk) 09:50, 29 November 2018 (UTC)
  • Support Support GravityUp (talk) 22:46, 29 November 2018 (UTC)