Beyond categories

Update Much has changed since 2013. Among other things, the Wikimedia Foundation just received a grant for US$3 million to address many of the challenges described here. The "Beyond Categories" project as imagined is now historical but the intent behind it was a contribution to discussion and awareness that led to the grant. Blue Rasberry (talk) 00:18, 11 January 2017 (UTC)[reply]

Numerous problems have been identified with respect to the MediaWiki categories system. DBPedia, the WMF implementation of Wikidata, and the Semantic MediaWiki offer hints of a solution. This page links to current discussions, including listservs and offers a centralized place to discuss problems and proposed solutions, both strategically and technically.

This page is an outgrowth of a discussion on glam-us-l which followed a presentation by Jarekt at the 2013 GLAMWiki Boot Camp D.C., which in turn presented ideas occasionally discussed on Commons for years. Some of those ideas were also voiced by User:Multichill here and here.

Problems with the existing category system

This is a list of problems with the existing system that might be addressed by something that utilitizes Wikidata and Semantic Web technologies.

Maintenance of subcategory trees (see, e.g. commons:Category:Painted portraits of men of France)
Over- and under-categorization (see Commons:Categories#Over-categorization)
The possibility of implied sexism or racism by creating some subcategories but not others (see, e.g. "What’s missing from the media discussions of Wikipedia categories and sexism","Wikipedia's Sexism Toward Female Novelists" (NY Times) and "Women Novelists Wikipedia: Female Authors Absent From Site's 'American Novelists' Page?" (Huffington Post))
Categories are in a single language, on multilingual projects this is generally English. This works against goals of multilingualism and tends to promote the dominance of native English speakers in the administration of the project.
Inconsistent category heirarchies on the various language Wikipedias
The perennial ambiguity about what should be category and what should be a list. For example, the recent consensus (lost the link) that it's okay to create subcategories for birth years, but not for birth days; the latter being more appropriate for a list page.
For some additional issues see page 10 of Jarekt's presentation
For another recent off-wiki commentary on the issue: Wikimedia Commons et système de classement 26 April 2013 by Jean-Frédéric.

Links to stakeholder sites

This is a (tentative) list of groups that we should contact to involve in this discussion.

Goal of this effort

Ideally, we would be able to come up with a new system that replaces the existing category system, across most or all of the Wikimedia projects, but that provides at least as much, or nearly as much, flexibility and ease of use.

(Using Wikipedia as an example, but keep in mind that this should apply to all the projects.) Each article would be marked up with a set of assertions that make statements of fact about: the subject matter of the article (for example, Bob Dylan), and the article itself (this Wikipedia article about Bob Dylan).

Assertions about the subject of the article would be things like, "Bob Dylan was a musician", "Bob Dylan was born on May 24, 1941".

Assertions about the article itself would largely correspond with the existing hidden categories, for example, "This article has inconsistent citation formats".

The data should be stored in Wikidata, for several reasons: so that it can be used across all the wikis, because Wikidata is tightly integrated with the Mediawiki projects already, ...

These assertions should be encoded as RDF triples using a defined logic-based ontology. That would allow very useful relationships to be discerned. For example, if someone is an architect, then they are also a person. Or, if Kate is the child of Paul, then Paul is a parent. Approaches to figuring out what this ontology should look like are described below.

The user interface to allow users to create and edit these assertions, and to assign them to pages, must be very intuitive and easy to use. In particular, the maintainers of these assertions shouldn't have to know anything about OWL or SPARQL queries. To them, it should be just as simple as maintaining a list of tags. Having said that, the types of assertions needs to be well defined, and constrained somewhat (does it?) so that frivolous or redundant tags don't become a problem. For example, "Bob Dylan has five fingers on his right hand". Probably, this issue is not much different than the existing issues and policies around creating new categories.

Michael Hale, in his email to wikidata-l, had some ideas about the user interface.^[1]

To minimize disruption (a certain amount of disruption is unavoidable) the new data should allow for the recapitulation of the existing categories. So, "Painted portraits of men from France" should be a query that could be run against the data that this project produces. Furthermore, ideally, we could auto-generate pages for each of these queries corresponding to existing categories, and have them replace the existing category pages (presumably with a redirect). In short, this would be a form of a round-trip:

Derive assertions from the existing category system
For each page in an existing category, populate wikidata with the appropriate assertions derived from its category
For each existing category, design a SPARQL query whose result set corresponds to all the pages in that category
Redirect the existing category to a page that gives those query results.

It is pretty ambitious, but maybe it can be done. Note that all of the work can be done behind the scenes, up until the last step, without disrupting existing category infrastructure.

Selecting an ontology

Paul Cassidy wrote this to the wikidata-l mailing list:^[2]

If one is interested in a functional category system, it would be very helpful to have a good logic-based ontology as the backbone.

I havent looked recently, but when I inquired about the ontology used by DBpedia a year ago, I was referred to dbpedia-ontology.owl, an ontology in the format of the semantic web ontology format OWL. The OWL format is excellent for simple purposes, but the dbpedia-ontology.owl (at that time) was not well-structured (being very polite). I did inquire as to who was maintaining the ontology, and had a hard time figuring out how to help bring it up to professional standards. But it was like punching jello, nothing to grasp onto. I gave up, having other useful things to do with my time.

Perhaps it is time now, with more experience in hand, to rethink the category system starting with basics. This is not as hard as it sounds. It may require some changes where there is ambiguity or logical inconsistency, but mostly it only necessary to link the Wikipedia categories to an ontology based on a well-structured and logically sound foundation ontology (also referred to as an upper ontology), that supplies the basic categories and relations. Such an ontology can provide the basic concepts, whose labels can be translated into any terminology that any local user wants to use. There are several well-structured foundation ontologies, based on over twenty years of research, but the one I suggest is the one I am most familiar with (which I created over the past seven years), called COSMO. The files at http://micra.com/COSMO will provide the ontology itself (COSMO.owl, in OWL) and papers describing the basic principles. COSMO is structured to be a primitives-based foundation ontology, containing all of the semantic primitives needed to describe anything one wants to talk about. All other categories are structured as logical combinations of the basic elements. Its inventory of primitives is probably incomplete, but is able to describe everything I have been concerned with for years (7000 categories and 800 relations thus far) can always be supplemented as required for new fields. With an OWL ontology, queries can be executed by any of several logic-based utilities. Making the query system easy for those who prefer not to build SPARQL queries (including myself) would require some programming, but that is a miniscule effort compared to what has already been put into the DBPedia database. Tools such as Protégé make it easy to work with an OWL ontology, and there is a web site where an OWL ontology can be developed collaboratively.

Jona Christopher Sahnwaldt responded, regarding the ontology used by DBPedia:^[3]

The ontology is maintained by a community that everyone can join at http://mappings.dbpedia.org/ . An overview of the current class

hierarchy is here: http://mappings.dbpedia.org/server/ontology/classes/ . You're more than welcome to help! I think talk pages are not used enough on the mappings wiki, so if you have ideas, misgivings or questions about the DBpedia ontology, the place to go is probably the mailing list: https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Processing existing categories

The CatScan tool can be used to automatically process existing article categories.

Further information can be found at Help:Category#Extensions and there are additional extensions that can be installed on instances of the MediaWiki software outside the Wikimedia Foundation cluster, some of which are listed at mw:Category:Category intersection extensions

References

Semantic MediaWiki page, Semantic MediaWiki and Wikidata
Semantic MediaWiki FAQ, What is the relationship between Semantic MediaWiki and Wikidata?
Wikidata query tool

[hale-1] Email from Michael Hale to the wikidata-l list

[cassidy-2] Email from Patrick Cassidy to the wikidata-l list

[sahnwaldt-3] Email from Jona Christopher Sahnwaldt to the wikidata-l list

[1]

[2]

[3]