Community Wishlist Survey 2023/Larger suggestions/Search for sections, not pages (discussion thread search)

From Meta, a Wikimedia project coordination wiki

Search for sections, not pages (discussion thread search)

  • Problem: Searching for two or more words or phrases in discussion pages returns a lot of noise that you have to comb through and ignore, because it returns all pages that contain the search terms, even if no thread contains all of the terms.
  • Proposed solution: A CirrusSearch parameter/filter that specifies the minimum heading level where all keywords must appear, e.g. toclevel=2, which limits the results to pages where all keywords appear under the same == ... ==. If multiple sections on one page match, each section is given a separate link in the results.
  • Who would benefit: Readers
  • More comments: Ideally it should be both a URL parameter and a search filter, so it can be used in <inputbox> as well as in the search query itself (e.g. toclevel:2), much like prefix.
  • Phabricator tickets:
  • Proposer: Nardog (talk) 18:36, 23 January 2023 (UTC)[reply]

Discussion

  • Just noting that apparently the completion of T315510 will make this easier to implement — TheresNoTime (talk • they/them) 21:29, 23 January 2023 (UTC)[reply]
    That is about the discussiontools_persistent table which contains the comment author and timestamp but not the comment text (as the goal is to resolve permalinks which are made from author and timestamp). So I don't think it's useful here. Tgr (talk) 06:33, 5 February 2023 (UTC)[reply]
  • Storing the nested structure of the section might be particularly complicated and costly for a search index. Is specifying the nesting level a crucial component of this request or would it solve most of the usecases if CirrusSearch provided a keyword insection:"one two" which would filter pages that have one and two anywhere in the same direct section? DCausse (WMF) (talk) 17:03, 24 January 2023 (UTC)[reply]
    It doesn't have to store the nested structure. It can just perform a search the normal way and then narrow it down. That too can be costly when the results are large, of course, but it can just time out in that case, which is the way the regex search already currently works.
    I do think specifying the nesting level is a crucial component, because not all pages use the same level (cf. CfD, TfD, etc.), and looking in the same immediate section only will miss many wanted results, because many discussions have subsections such as "arbitrary breaks". See e.g. w:WP:VPR to get an idea. (Also, how is one supposed to search for phrases in that scenario you suggest?) Nardog (talk) 19:12, 24 January 2023 (UTC)[reply]
    Thanks for the precision, knowing that the section level is crucial will help a lot in determining the feasibility of your proposal. Regarding my suggestion sorry if it was a bit vague, you are correct the syntax is not very handy for combining other search features but perhaps something with parenthesis around might be more appropriate: insection:(word1 word2 "one two").DCausse (WMF) (talk) 10:47, 25 January 2023 (UTC)[reply]
  • Semi-relatedly it is worth noting that CirrusSearch is not doing a very good job at presenting section names in the search results, it is not able to correlate highlighted words and their corresponding section: phab:T131950. DCausse (WMF) (talk) 17:03, 24 January 2023 (UTC)[reply]
  • Wouldn't the right solution to the problem be to sufficiently boost hits where the keywords are in the section title? (Which I think might be happening already to some extent.) Readers won't use advanced search keywords, and an inputbox would restrict all the entered words to the section title which probably isn't a great experience. Just like when you are searching for an article - you want the search results to prioritize full or close title matches, but if some terms you have given don't appear in the title but appear a lot in the article, you don't want to disqualify that page from the results. --Tgr (talk) 06:38, 5 February 2023 (UTC)[reply]
    It wouldn't. The proposal comes from frustration at not being able to find (usually archived) threads that contain this and this term (e.g. names of participants) without having to comb through the results involving lots of tabs and Ctrl-F. Nardog (talk) 08:16, 5 February 2023 (UTC)[reply]

Voting