Category math

From Meta, a Wikimedia project coordination wiki

Category math is a feature for doing set logic style operations to retrieve a list of pages which are, for example, in both category A and in category B (intersection). This feature is under development, and not available on wikipedia, or the mediawiki install.

Types of category math[edit]

There are various types of category math we would like to be able to do:

  • Category intersection - (implemented)
    • two categories - (implemented)
    • more than two categories - (implemented)
  • Category union
    • two categories
    • more than two categories
  • Category subtraction - (implemented)
  • Full set logic formulae like (!A)|B or (!A)&(!B)

The Implementation[edit]

Aerik has created a partial implementation: User:Aerik/Intersections_code. You can try it out on his server

This doesn't do unions yet (pages in category A or category B)

User Interface[edit]

User interface design for complicated category math options, is a little tricky. Aerik's implementation could be improved in this area.

Currently it is implemented as a 'special page', which presents a query builder in a table format. Click links in the table to build a query which is displayed at the top, and looks something like:

Results having categories: Music AND Literature
And not having categories: Party_Favors_and_Jokes

In the URL, such a query is expressed as

catlist=Music+Literature&catexclude=Party_Favors_and_Jokes

Query results are shown at the bottom.

Other UI ideas[edit]

Other ideas for user interfaces have been discussed. See the discussion page for more on these.

  • The basic 'intersections' queries should be easier and clearer.
  • Extend the category pages display - instead of, or in addition to this Special:Intersections page.
  • Show a formula and allow users to edit it directly (for advanced users). We would need to decide on the best/standard notation for this. Also see scalability issue below.

Yet other idea : Multi inputs with Arithmetic words betwwen inputs. Something as :

[.......] OR [.......] OR [.......] OR [.......] OR [.......]
AND AND AND AND
XOR XOR XOR XOR
Search (French OR English BUT (XOR) German) AND ( Poets AND Writer) : French or English but not German poet and Writer
Cat:French OR Cat:English OR Cat:German OR Cat:Poet OR Cat:Writer
AND AND AND AND
XOR XOR XOR XOR

Scaleability and Server load[edit]

For large wiki installations such as wikipedia (lots of pages in lots of categories) scaleability is an issue. For starters, the current implementation would yield gigantic lists, the HTML for which takes a long time to render, and would be un-useable.

The actual query CPU time is also a consideration. For this to work at a reasonable efficiency, each category should probably be stored as a list of articles belonging to it (rather than each article storing each category it belongs to). how are mediawiki categories presently implemented? Caching features would already be useful if cross-sectioning was limited to just two topics, because that already allows all such things as Finnish Biologists and Bridges in Las Vegas. Common two-topic cross-sections should be cached.

Problematic set formulea[edit]

Requesting something like "!A" (e.g. all pages not in the 'fruit' category) would have the potential to put a lot of load on the server as it goes through all categories except 'fruit'. We should probably disallow expressions like this. Other expressions such as (!A)|B or (!A)&(!B) would also load the server. We would also need to think of a way to detect and block such formulae.

See also[edit]