Community Wishlist Survey 2017/Search/Unlimited number of search results

From Meta, a Wikimedia project coordination wiki

Unlimited number of search results

  • Problem: Elasticsearch (Wikipedia search engine) has a hard-limit of 10,000 search results. This is to prevent DDoS attacks. However it means anyone wanting more than 10k will need to download a full Database Dump and use AWB or homegrown tools which is costly and slow. True for API:Search also.
  • Proposed solution: Deploy a solution so the Elasticsearch limit is lifted for trusted users/developers.
  • Who would benefit: Bot writers, anyone needing > 10k search results.

Discussion[edit]

Probably a good idea to bundle this with the bot flag? Headbomb (talk) 00:33, 10 November 2017 (UTC)[reply]

Perhaps it would nice to elaborate on the use-cases here, as described in the phab ticket there are technical limitations that may be hard to circumvent with the current api parameters. For example: is ranking still important for such use cases? Would the api client be OK to maintain more states on its side to help the search engine? The main blocker here is that the search engine needs to hold in mem offset+size results on multiple machines. In short to make this happen we'll certainly have to drop some features or make a dedicated API endpoint with a limited set of features. It's why I suggest to discuss about the use-cases here so that we can evaluate the feasibility. Thanks! DCausse (WMF) (talk) 09:09, 10 November 2017 (UTC)[reply]

Hi User:DCausse (WMF) .. ok great thanks for exploring this more. In my experience the use case is only for generating a list of article titles, wherein the article body (or optionally title) contains the search string. No snippets, ranking etc.. just a list of titles. It's the same use case for AWB users who currently need to download the entire Dump and searches can take a long time. A dedicated API endpoint would be great, it can use the API offset maximum 500 per request or whatever. The search would ideally support regex via the insource:/<regex>/ syntax. -- GreenC (talk) 15:34, 11 November 2017 (UTC)[reply]
@GreenC: What specific use case do you have in mind for this? Ryan Kaldari (WMF) (talk) 00:21, 21 November 2017 (UTC)[reply]
@Ryan Kaldari (WMF):: Use case: A dedicated API endpoint that generates a list of article titles, wherein the article body (or optionally title) contains the given search string. This is a very common task for bot operators. For example, a bot that fixes articles with double 'the' ("the The Washington Post"). ie. search on regex /[Tt]he[ ][Tt]he/ and generate a list of article titles. This list is then fed to your bot or AWB to make the correction, so it knows which articles to target. -- GreenC (talk) 00:51, 21 November 2017 (UTC)[reply]
@GreenC: I'm not sure that having more than 10K results of 'the The' would be helpful. Wouldn't you want it in some sort of manageable amount of numbers to have the bot/person go in and correct and then run another query to show the next set of 10K issues? deb (talk) 19:56, 21 November 2017 (UTC)[reply]
@DTankersley (WMF): - The numbers are managed by the API endpoint which allow one to pull 500 results per request (defined in the API request). It doesn't work well (or at all) to build an application around an API that doesn't allow one to get the full search results for a number of reasons. For example depending on the complexity of the search and number of API calls, the full list needs to be unique'd before being processed by the bot to avoid processing an article multiple times. It also presents challenges to build an application that must do a full run every 10k titles before starting over to get a new set of articles. There are applications where it isn't making writes to the database but only reads in which case the 10k limit is a barrier. I could probably think of other reasons but these are all things I've experienced. -- GreenC (talk) 20:16, 21 November 2017 (UTC)[reply]
  • Bundle it with any advanced user level (bot, admin, template editor, edit filter manager, maybe file mover and page mover) since these already require a level of trust and competence. The ability to do this would be especially useful for those doing big piles of maintenance work.  — SMcCandlish ¢ ≽ʌⱷ҅ʌ≼  08:17, 4 December 2017 (UTC)[reply]

Voting[edit]