Community Tech/Improved diffs

From Meta, a Wikimedia project coordination wiki

NOTE: The original diff problem from the Wishlist proposal has been fixed, and Community Tech won't do any more diffs work this calendar year -- see the project page for more info and rationale.

Examples posted on this page can be used to write more specific proposals for the 2016 Community Wishlist Survey, which starts on November 7th, 2016. People who helped out by adding more examples to this page: Thank you, and I hope you're not too disappointed that these won't get attention this year. -- DannyH (WMF) (talk) 16:39, 30 September 2016 (UTC)[reply]


This is a page for notes and brainstorming about the #2 request in Community Tech’s 2015 Community Wishlist Survey: "Improved diff compare screen".

From the Wishlist proposal:

Don't you just love diffs like this one. It must be possible to improve this diffcompare-view. For inspiration you can look at the wikEdDiff gadget. http://i.imgur.com/R9ZfCA1.jpg -- The Quixotic Potato (talk) 06:41, 29 September 2015 (UTC)[reply]
[Combined from separate proposal:] When someone moves a paragraph and then edits it, this edit is not shown separately by the "Difference between revisions" functionality. This problem also occurs when someone inserts a blank line above a paragraph and then edits it. Thanks, --Gnom (talk) 10:12, 12 November 2015 (UTC)[reply]

Note for people watching this page: The WMDE Technical Wishes team is investigating a way to improve the problem of an edit that moves a paragraph from one place to another and changing a word in the paragraph. There's interesting (somewhat technical) discussion on Phabricator ticket T138922.

Examples of problems[edit]

Right now, this page is for collecting examples of unhelpful diffs, so we can understand the problems we’re trying to solve. Feel free to add more examples.

Collapsed example that has been fixed

Diff often fails to recognize an unchanged paragraph. The paragraph shows up as both removed text and added text, and it makes a mess diffing it with a different paragraph.[1] Note that this can trigger a chain reaction where several unchanged paragraphs in a row all fail to pair up, yielding an entire screen full of fictional diff. Suggestion: Hash each paragraph so you can match up paragraphs up even if they are rearranged. Perhaps use green highlighting to indicate a moved paragraph? Bonus points if you can manage to pair up paragraphs that are 95% unchanged. I know that raises complexities but it would be real nice.

Similar example triggered by adding a section heading[2]


It's annoying when diff pointlessly starts and ends diff highlighting inside <tags>.[3] Instead of generating the first diff below, start the diff one character later (on the tag boundary) and generate the second version. That will defragment both the beginning and ending tag:

><ref>[http://www.tripwolf.com/en/guide/show/308460/Cyprus/Nicosia/Arabahmet-Mosque Arabahmet Mosque sight in Nicosia, Cyprus], [http://www.tripwolf.com/en/guide/ Tripwolf travel guide].</ref
<ref>[http://www.tripwolf.com/en/guide/show/308460/Cyprus/Nicosia/Arabahmet-Mosque Arabahmet Mosque sight in Nicosia, Cyprus], [http://www.tripwolf.com/en/guide/ Tripwolf travel guide].</ref>
In addition, diffs should be started at the beginning tag when possible to avoid stuff like this: [4][5][6]

A common diff problem is where someone inserts (or is it removes?) a blank line after a heading, and the diff compares the blank line on one side with the whole paragraph on the other. Sorry I can't find an example now! However, this shows simple gnoming which demonstrates the same issue in the original example above. The diff does a few things, one of which was to change "pages=874-9" to "pages=874–9" in a massive paragraph. Johnuniq (talk) 01:46, 23 December 2015 (UTC)[reply]


Sometimes a continuous change is fragmented into little bits by spaces. At some point, in some of these the entire paragraph should be marked as changed.

  • [7]. Additionally, the removed paragraph should have been compared to the second added paragraph and the first added paragraph should be a bulk addition.
  • [8]
  • [9]
  • [10]

I think this is a one character change, but shows the entire chuck of text as added and removed.[11]


This is a bizarre diff.[12] It should have shown up as a simple new ref added at the end, like this:

<ref>{{fr}} Bernard Wuthrich, [http://www.letemps.ch/suisse/2015/12/09/conseil-federal-un-romand-s-retrouve-elu "Conseil fédéral: comment un Romand s’est retrouvé élu"], ''[[Le Temps]]'', Wednesday 9 December 2015 (page visited on 9 December 2015).</ref><ref>{{fr}} Yves Petignat, [http://www.letemps.ch/suisse/2015/12/09/choix-parmelin-un-desaveu-direction-udc "Le choix de Parmelin, un désaveu pour la direction de l'UDC"], ''[[Le Temps]]'', Wednesday 9 December 2015 (page visited on 9 December 2015).</ref>

This diff hides the change which was simple: two newlines are inserted and "after two days" is removed. This kind of edit is common and needs to be handled somehow. Has someone mentioned wikEdDiff which has some sensational algorithms? WikiEdDiff shows exactly what this diff does. Johnuniq (talk) 09:14, 24 December 2015 (UTC)[reply]


This diff[13] took too long to figure out. The yellow text and grey text switched places. The blue text was added. (The +text is an artifact of the switch, and the blue-yellow diff is a mismatch.)


  • In this diff two line breaks are deleted to join three short paragraphs, and "Though the modern" is replaced with "A". WikiEdDiff shows exactly what changed.
  • This diff is completely broken, although WikiEdDiff does a good job of showing that most of the many changes are quite simple. The diff covers 24 intermediate revisions from 31 May 2015 to 21 June 2015.

Fixing these might be awkward from a performance point of view. Consider showing a simple (but improved) diff, with a new button to do a more time-intensive diff. Johnuniq (talk) 04:58, 7 January 2016 (UTC)[reply]


Related to previous examples, diff isn't aware that an open-tag through to close-tag is a logical entity.[14]

Desired diff: <ref>foo</ref><ref>bar</ref>

Current diff: <ref>foo</ref><ref>bar</ref>


This may be tricky to fix.[15] Possibly a good way to catch this would be to specifically look for appended text, trying to identify a diff-chunk starting backwards from the *end* of the paragraph. Or maybe recognize date or signature as an "entity" and avoid splitting it.

Desired diff: Old message. Alsee (talk) 22:40, 13 January 2016 (UTC) New message. Alsee (talk) 06:56, 17 January 2016 (UTC)[reply]

Current diff: Old message. Alsee (talk) 22:40, 13 January 2016 (UTC) New message. Alsee (talk) 06:56, 17 January 2016 (UTC)


Many mismatches, and missed-opportunities-to-match, here.[16]

  • Every {futher} template should have been an insertion. In each case the paragraph at that spot was a near-exact-match and should have been diffed with itself. (The algorithm worked sequentially, it grabbed the next available very poor % match, and missed an extremely high % match.)
  • "cheerfully" text should diff-paired with "cheerfully" text - in fact this resulted in three paragraphs in a row being mismatched.
  • "unprincipled" chunk of text could have been helpfully diff-paired with itself.

Untouched section heading == Glaring omission ==[17] should obviously be paired with itself rather than appearing as + and - text.


No-backspace markup can pollute diffs: both on mobile or on desktop, there is some strange spaces removed and then added when a no-backspace is replaced by a regular space. That's wired for a user.


Unmodified text which has been moved is shown as generic +/- text. It should have an = symbol instead, and use a different color to indicate it is unmodified. (Maybe green?) See this diff. Note that text moves are inherently ambiguous about whether one chunk was moved up, or the other chunk was moved own. A good heuristic is to assume the larger chunk is unmoved and the smaller chunk is moved (as shown in that diff), however I would suggest adding some weight to discourage moving chunks containing =section= title(s). In that particular diff my intention was to move a chunk of text from one section to another. I think the diff would read better if the =section= chunk had been defined as the unmoved chunk.


[18] The minus text should be paired with the paragraph after the =Background=. If possible the end of that paragraph should be shown as moved to =Rebel Soldier= section.


Approximately twenty instances of ></nowiki [19]


Similar to previous examples,pointlessly fragmented ref plus pointlessly fragmented template:[20]

  • Desired diff: <ref>{{Foo}}</ref> Bar<ref>{{Baz}}</ref>
  • Current diff: <ref>{{Foo}}</ref> Bar<ref>{{Baz}}</ref>

Several lines unchanged lines are mismatched and incorrectly shown as shown as +/- text. Specifically year_completed, construction_cost, spire_quantity, and spire_height.[21]


This diff should be a prime test case! The diff is completely unreadable. I would had to pull it into a text editor and check lines one by one to figure out what DIDN'T change. The true diff is:

  • 3 lines added (Brunei, Kota Batu, Kota Batu)
  • Either 2 lines moved up (Kota Negeri and Kota district moved) or 3 lines moved down (Kota Kinabalu, India, Kota Rajasthan)
  • 21 lines with a one character diff (15 with the leading * removed and 6 with the leading * to ;)

IMPORTANT! We allow people to edit other people's comments. The community very deliberately wants to judge and enforce such edits at the social level, not at the technical-capability-enforcement level. Diff fails us badly in this edit. The modified line is mismatched, meaning that we get no diff of changes for that line. This makes it difficult to spot that one editor changed the content of another editor's comment. We really want that sort of change to be highly visible.


Another pointlessly fragmented ref.[22]

  • Desired diff: <ref>{{foo|access-date=July 16, 2016}}</ref><ref>{{bar|access-date=July 16, 2016}}</ref>
  • Current diff: <ref>{{foo|access-date=July 16, 2016}}</ref><ref>{{bar|access-date=July 16, 2016}}</ref>

Useless diff[23]. Diff was confused by a mere blank newline, and fails to show what changed.


Another fragmented diff, this one involving two images. [[File:MTA Flatbush South 15.jpg|128px]]<br>[[File:New Flyer Q70 SBS.jpg|128px]]
It looks like most of these fragmented cases are fixed by highlighting-to-end-of-line when possible. The more general fix is to balance punctuation () [] {} <> in a diff whenever possible.


See the first sentence in this diff. A blank line was deleted, and a hyphen was added. Diff shows the entire sentence as removed and added, instead of diffing the two virtually-identical sentences. (Note: The diff is ok if only one of the two changes is present.) Perhaps diff needs to consider percentage-change when deciding whether to pair up the old and new text, or whether to show unpaired removed-text-block & added-text-block.


Here's a diff I examined in relation to an RFC. A blank line apparantly threw the diff algorithm out of wack and it didn't even attempt to diff any of the paragraphs in section === IQOS ===. In section == Regulations == the first paragraph is diffed, but the next paragraph isn't diffed. This is again apparently due to a blank line.


Another diff mis-aligned with the actual change. It would help if diff attempted to balance markup, when there there's more than one possible way to align the diff.


the minus paragraph on the left is an exact match with the second + paragraph on the right. The paragraphs should be aligned as unchanged text.


Starting at line 975 the misalignment makes the diff much more difficult to interpret. It's very awkward that the table above the edit was needlessly cracked open, and awkwardly re-closed two tables later.


Several valuable examples of diff problems are posted at Phabricator task T15462.


Paragraph "On May 7, 2020, two days after... " moved with NO CHANGES, paragraph "...ABS-CBN's flagship news program TV Patrol..." had minimal changes. Diff badly mismatched the paragraphs. An unchanged paragraph should NEVER get mismatched. Simply hashing every paragraph and pairing exact matched would fix this. Approximate-match-pairings should only be done after exact matches have been exhausted.


[24] This is another example of inserted content causing unmodified paragraphs to misalign, resulting in garbage diffs between unrelated paragraphs.


[25] Splitting a paragraph (adding a linebreak, or replacing whitespace with line break) should show as a split paragraph rather than a deletion and addition. Similarly removing a line break should show as a paragraph merge. It should also be able to display minor changes during a split as a split with only the minor changes highlighted. Also notable is that the diff is including the wrong period at the beginning/end of the highlight. A possible way to detect that here is that the correct highlight-range runs up against a line break at the end - a natural division point. Dividing text at whitespace is also preferable to dividing it in the middle of non-whitespace.


This diff should NOT should any paragraphs as moved. When blank lines are deleted - or even when entire paragraphs are deleted - it is implied that the content below shifts up to close the gap. It is nonsensical to show a paragraph as moving/arriving above a now non-nonexistent blank line.
A second thing, although it may be hard to automatically select the preferred result here:

  • Actual diff:
    *[[Talk:Dan Wagner#Should Dan Wagner’s role as CEO of “Attraqt” be included in the article?| Should Dan Wagner’s role as CEO of “Attraqt” be included in the article?]]
    *[[Talk:Dan Wagner#Should Dan Wagner’s role as CEO of “Rezolve” be included in the article?| Should Dan Wagner’s role as CEO of “Attraqt” be included in the article?]]
    *[[Talk:Dan Wagner#Should the first paragraph of the Powa Technologies subsection be replaced?| Should the first paragraph of the Powa Technologies subsection be replaced?]]
  • Preferred diff:
    *[[Talk:Dan Wagner#Should Dan Wagner’s role as CEO of “Attraqt” be included in the article?| Should Dan Wagner’s role as CEO of “Attraqt” be included in the article?]]
    *[[Talk:Dan Wagner#Should Dan Wagner’s role as CEO of “Rezolve” be included in the article?| Should Dan Wagner’s role as CEO of “Attraqt” be included in the article?]]
    *[[Talk:Dan Wagner#Should the first paragraph of the Powa Technologies subsection be replaced?| Should the first paragraph of the Powa Technologies subsection be replaced?]]

I marked it as I actually edited it. The displayed diff twice results in an extremely awkward single-character segment (the # character). Not only is that a less intuitive way to split the diff, single character segments are very difficult to see. The middle line is split in the middle of two sentences, when instead it is much more natural to use | and ]] as split points. So the suggested heuristics are (first priority) prefer to split on punctuation, (secondary) try to avoid single character or other extremely short segments.

WikidiffLX 2011[edit]

May I lead your attention to mw:User:PerfektesChaos/WikidiffLX?

  • It solves at least some of the problems mentioned here.
  • There is a suitable C++ code implementing the suggestions available since summer 2011.
  • Has been submitted to Bugzilla, but no one was assigned by WMF to have a look at difficult matters.
  • The code worked fine for a pile of easy test scenarios, but would need deeper experiences with real wiki edits.

Greetings --PerfektesChaos (talk) 17:08, 20 December 2015 (UTC)[reply]

Oh, thank you! We'll check that out. -- DannyH (WMF) (talk) 19:05, 22 December 2015 (UTC)[reply]