Talk:Categorization requirements

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

I have expanded this a bit, leaving room for what I do not know about: implementation. So, it would be great if we could cover both the design and implementation details, and note where they are separate and where they overlap, and for those better at one or the other to concentrate on that. Brent Gulanowski 08:14, 18 Dec 2003 (UTC)

I'd really rather prefer to keep Categorization as a list of possible categorization implementations, and concentrate here on what we want to accomplish w/r/t categorization. --Evan 17:34, 18 Dec 2003 (UTC)

The most important thing for a working category scheme is simplicity of use. I'm afraid that if people need to think about any kind of special markup just to make the categories work, they won't use it. Even writing something like [[Category:foo]] may be too much for them. I have consistently maintained that having a category box, like the summary box that is already there on the edit page would be the simplest. Many contributors still ignore the summary box, and just as many will ignore the category box; there's no way to prevent that.

You're right, but then again, many users will enjoy doing the categorization work more than the article writing. I am hoping that, under the "Summary" box, there will not be a category box but a "page this page is related to" box. The category of that page (or its primary category, if it has many), becomes the (primary) category of the new page. Brent Gulanowski
Certainly some people will like putting things into categories as much as others enjoy writing articles, or editing. I would count on that as our most reliable basis for getting things done. I still prefer "category" to a relationship list. Many of the relationships are already there through the links on the page and the "what links here" function. I see a category as something that expresses a feature that pages with that category have in common. The term "primary category" can be misleading. Some articles may be categorized in several primary categories. José Echegarray became independently famous in three areas: as a mathematician, as a playwright, and as a politician; which will be primary? Eclecticology 23:31, 20 Dec 2003 (UTC)
I admit that my terminology is too general. I was thinking of a "primary" category, not as a measure of importance, but an organizational tool, for example to be used when simplifying the category graph to a tree for simplified rendering (say, in pure text). I agree understand your desire to distinguish a category relationship versus a link relationship, yet a category is still a relationship. As has been noted, category membership can easily be expressed in terms of links, so it is simply up to editors to have good sense of a categorical relationship.
We may not have the same perceptions. I would find a tree useful as a series of links that connect an article with the main page, and one should be able to expect, perhaps by using "What links here", to be able to trace a path or paths to the main mage to develop a measure of proximity. But I see this and your approach (if I understand it correctly) as a link-centred perspective. In the sense that any two articles are equal because they are both articles, a link defines a relationship between equals. A category is not an article, and therefore is not equal. It perhaps serves some sort of meta function. In set-theoretical terms one must not confuse the set (i.e. the category) with its elements (i.e. the articles). Eclecticology 10:45, 21 Dec 2003 (UTC)
My approach is not defined by hyper-links between articles. I was merely making a comparison. I said that a category is a relationship, and a link is a relationship, but they are different kinds (which I'm sure we agree on). My approach is like this: article A is in set S if P(A) is true for some P. If P(B) is also true, then B is in set S as well. I am focussing on P, instead of S (in the implementation), because you can define P in terms of A and B. So an editor notes that B is "in the same category as A", and uses this relationship, effectively telling the system that P(B). A by-product is that the categories are more wiki-centric, for good or ill. The mechanism for defining P(x), x from {A,B, ...} is an implementation detail. (Also, P need never be explicit, and anyway I would not know how to define it, although it is probably a significant question for epistemology.)
Also, I got confused above, talking about graphs -- I think it is apparent that a category only has one parent (thus the graph is already a tree), although an article can have more than one category. Filing an article in two categories could be interpreted as defining a hybrid category. If you graphed the categories and included the hybrids, you would get a more complex graph. I'm not sure if you would want to do that or not. Brent Gulanowski
I perfectly understand the confusion about graphs; using words to represent something pictorial is bound to be confusing. "Graph" is a very generic termfor the pictorial representation of data, of which a "tree" is only one kind. Rather than to say that "the graph is already a tree" I would say that the reverse is true. Eclecticology 19:21, 23 Dec 2003 (UTC)
This raises a question: should the categories in the wiki be defined by the articles in the wiki (which I, somewhat puritanically, prefer), or in terms of categories already in existence "out there"? I can see the argument for using categories from other media, libraries or other collections, but it would make browsing the wiki by category more difficult, because too many pages would be in categories all alone.
Both! They aren't incompatible. How people browse may be a bigger difficulty. Some people never get the knack of how to find things with Google. Looking for a book in the library stacks is a qualitatively different experience from looking for it in a card catalog. Categories of one with no growth prospects can often be consolidated into appropriate "others" categories. Eclecticology 10:45, 21 Dec 2003 (UTC)
I also agree that my suggested mechanism for defining the pages in a shared category has weaknesses. Its strength is that it tries to reinforce the idea that a category is only useful in relation to some wider collection of knowledge i.e.: to the other articles the category, and to super-categories). So my approach is meant to emphasize that a category defines a partition of a larger category, and that a partition is meant to be made of multiple articles (although single-articles categories cannot be ruled out if articles can be added to a category one at a time). Do you think that these are worthwhile goals? Brent Gulanowski 00:12, 21 Dec 2003 (UTC)
The top level categories are themselves partitions of the universal set. I see your explanation more as a statement of fact than as a goal. Eclecticology 10:45, 21 Dec 2003 (UTC)
Some have said they don't want a "universal" set (i.e.: a default "all" category), by which I mean an explicit main category. Implicit categories are probably meaningless, so can you clarify whether you mean explicit or implicit. Brent Gulanowski
In effect the universl set is the implicit set of all articles. It's not meaningless; it's only useless. It merely serves to make theoretical discussions like this one workable. :-)

If they want to put something in that they believe fairly reflects the article they should also be able to do that without the need to look around for a list of approved categories. At this stage developing habits is far more important. Somebody else can always edit the categories later if need be. Not prescribing categories will allow for an organic development of the categories that reflects need. Some format issues will need to be decided: most importantly what delimiter should be used to separate categories. How do we distinguish when subcategorization?

Can we allow for a case-sensitive coding system for sub-categories that allows for wild-card characters? A coding system can easily fit in approval codes for something like the printed Wikipedia. This would allow for a search in the category box for "ABC*" to turn up anything that begins with "ABC". In a plain text category system the search under "math*" should be able to give "math", "maths", "mathematics" as well as several unpredictable misspellings.

If categories are determined by page relationships, prescribed categories aren't needed. I agree that they are not desirable. A subcategory is merely a set of articles in a category that are chosen to be related in a closer way than the rest of the articles in the original category. By delimiter, are you referring to a syntactical token? See Evan's comment above re:implementation. BTW, my hope is that users will not be permitted to simply add random category titles to pages, thereby reducing (but no eliminating) duplication. Merging duplicate categories will have to be an active task. Searching for categories should be as simple as finding the parent category and looking at the list of sub-categories, but I can see where a name search might be desirable. Brent Gulanowski
I suppose that "syntactical token" could be equivalent to my "delimiter". I see subcategories as subsets of categories without any polyphyletic features. We differ in that I believe that it would be healthy to allow "random" categories. Some duplication will be inevitable, but it allows people to put things in the categories where they feel comfortable putting them. "Movie", "film" and "cinema" can all refer to the same thing; a contributor should have the opportunity to use whichever one he wants, without the need to be always looking in some instruction sheet for the right category. If they are forced to do that, they won't. With time one of those categories may become dominant, and adjustments can be made then. Favoring particular categories from the beginning, however logical, remains a top down approach. My position has shifted considerably from what it was a year ago when categories were first discussed. At that time I favoured a particular category scheme, but now I see that scheme as only one of several possibilities that should be allowed in addition to plain text categories. Eclecticology 00:04, 21 Dec 2003 (UTC)
There is a tension between keeping categorization orderly and making it flexible. What if my idea to use unique IDs for categories and separate titles is expanded to allow for multiple titles? Also, what if authors were permitted to add categories of any kind, but editors were to do a bit of consolidation, for example identifying overly similar categories with different names and merging them into one category, but preserving both titles, so that a search for "movie" would get articles categorized as either "movie" or "film". Otherwise you might get authors adding both categories to the same article.
Yes. But you can't impose orderliness on categories that have not yet been defined. Your idea here appears sound, but as much as I support the IDs (which will almost certainly need to be some kind of code to insure uniqueness and consistency) I can forsee resistence to any scheme that would require or even seem to require looking up codes. The dictates of orderliness inevitably will lead us to some kind of classification system that is not based on plain language. Without it libraries would be useless, because nobody would be able to find anything. This is why I say that at this stage I would be satisfied with a category system that allows both coded classifications and plain text without prejudice to the eventual development of either. Eclecticology 10:45, 21 Dec 2003 (UTC)
The orderliness comes from constraining the way categories are created. Brent Gulanowski
Yes. I support co-existing multiple categorization schemes, with the understanding that disused ones will simply fade away. As a code indicator I would suggest that any category with the first two letters capitalized would be a code; anything else would be plain text. Eclecticology 19:21, 23 Dec 2003 (UTC)
My main motivation for categories is to allow users to find articles more quickly without having to guess the right category name -- either by searching with similar terms or by browsing the category graph. Categories with different names don't show up in searches unless some mechanism is introduced to pull in similar terms -- if the category system can do that, it might save having to add it to the search facility. Categories that are not properly defined as sub-categories, "orphan categories", do not show up in browsing (which is why I am in favour of auto-categorization for uncategorized articles -- although I know this is a lame reason). Brent Gulanowski 00:28, 21 Dec 2003 (UTC)
"Tree" would seem to be a more self-evident term than "graph". "Browsing" can be an unsatisfactory process when the amount of material is very large, though it should remain available. I think that the search facility will remain the most popular technique for finding things even with a category scheme, though I would like to see a more sophisticated search function than what we have now. "Orphan categories" will have similar problems to what "orphan articles" have now; they will frequently need human intervention to repair. I remain suspicious of anything that is done by automatic process. Still some limited controlled application could be acceptable. If we now have a List of poets it would be acceptable to put "poet" in the category box for every person that appears on the list who already has an article; that would be a controlled and limited application of that process. It still would not catch those poets whichare not on the list even though they have an article.
The urgency at this stage is just to get something happening. Without that it seems that our theoretical discussions are living in a vacuum. Eclecticology 10:45, 21 Dec 2003 (UTC)
If an article has more than one category, the category graph is no longer a tree, by definition.
I would also not advocate converting lists into categories directly. I was just pointing out a similarity. I do recommend automatic categorization by some very simple criterion, like alphabetical by article title or date added to wiki -- a stopgap to quickly categorize all articles. Please see my proposal page and its talk page for details. Brent Gulanowski
My concern with anything automatic is as always based on bots that behave like loose cannons. The criteria that you list are already there, but not ordinarily useful. The difficulty with Colon and UDC is the way they use punctuation. LCC and DDS use only the decimal point as punctuation, and it is easy to accept that as an allowable character in a code. UDC especially has a number of unique usages of punctuation, which may conflict with usages that we may want to define. There is likely to be little support for anything that would necessitate frequent "nowiki" type tags. I would tend to want to use and modify something like LCC. Leaving it's big disadvantage of being too Americentric, as an alphanumeric system it has the advantage of being more compact than anything that is strictly numeric. Eclecticology 19:21, 23 Dec 2003 (UTC)

I would avoid using a bot to automate any of the categorization. It has to have ample human intervention. It may take six months to categorize all the 200K articles but that's fine. If need be though there can be an "uncategorized" category (or an appropriate code like "ZZZZ" to that effect) that can be used as a default so that workers can more easily find the articles that still need to be categorized. Eclecticology 02:24, 19 Dec 2003 (UTC)

Whatever is best for the software. I thought that, since alphabetical lists are already generated, it would be harmless to at least make such lists using the same underlying structure as categories, if possible. It is also consistent, but I admit that may be only of aesthetic import. Brent Gulanowski 06:01, 20 Dec 2003 (UTC)
I think that lists as we already know them serve a different purpose than categories. Many of them are already nothing more than wish lists, and although they will often serve to generate categories, they should be used with caution. Eclecticology 00:04, 21 Dec 2003 (UTC)
I think that there should be category tags (like the lang tags) that are put on the top of the article, rather than a edit box for the category. User:Noldoaran 18:18, 16 Feb 2004 (UTC) (I can't log in to meta)

I have tried to take points from the above discussion and add them to the article. Now we can consider controversial or unclear points on an individual basis. Or just move them around as you see fit! Brent Gulanowski 03:10, 22 Dec 2003 (UTC)

Too unwiki

I added my comments to the main page.

So, my big problem with these requirements is how bossy they are. Maximum numbers, must be, must define, yadda yadda. It's probably good to remember that categorization is going to happen one article at a time, done by volunteer editors over the Internet. If categorization is too pushy, they won't do it. If they can't edit without going through categorization hoops, they won't edit anymore.

That's certainly one of my points. Eclecticology 19:21, 23 Dec 2003 (UTC)

Also, remember MeatBall:SoftSecurity. We don't have to build lots of checks into the software: editors will make sure categorization works.

Casual editors are the lifeblood of wiki. It's important to make this iterative, inobtrusive, and most of all not mandatory. --Evan 17:31, 22 Dec 2003 (UTC)

Goodness they're just ideas, not personal attacks. I have approached this as software requirements. They must be clear, concise, and exact. You are reading "bossy" into them. If you don't like some of them, that's OK. Replace them with better ones. I only hope that they are definitive, otherwise what is the point? The kinds of categories you seem to be in favour of are basically just suggestions or key words.
Mandatory does not have to mean obtrusive. Why don't you justify your interpretation with some facts to back up the opinions?
The requirement of unique IDs is not an implementation detail if it is mandated by the logic of the design, but design is a sticky subject so opinions are sure to differ. Brent Gulanowski 00:21, 23 Dec 2003 (UTC)
The important thing is to have at least something tangible to try our various theories with. Much may fall into place when that is there. In the early stages I would prefer flexible over definitive. "Definitive" will evolve in the course of usage. Key words have the advantage of being more intuitive; codes have the advantage of bing more compact. Eclecticology 19:21, 23 Dec 2003 (UTC)

I'm quite new to MediaWiki but familiar with the wiki concept. I have also looked into categorization and tested a few things for both private and larger projects, so I thought I would contribute with some oppinions / suggestions of my own.

I am very much for non-hierarchical categorization for the following reasons:

  • categorization is difficult - if users have to spend time on choosing "Do I put this document into category A or B?", I believe that they will choose not to categorize at all. So why not let them categorize the document as both A and B?
  • creating a hierarchical categorization scheme is difficult, especially with a growing collection such as a wiki.

I am well aware of the face that hierachical organization among categories brings order to the system, but I believe order can be established in a softer way by categorizing categories. To answer some questions on scalability, let me use an example:

Say that I categorize a document as "related to birds". Say that later on the category "biology" shows up. The document can retain it's category "related to birds" while recieving a relationship to the category "biology" through categorization of the category "related to birds".

On the question of helping users select categories, I belive that seeing a list of all categories would help. However, this list will grow and become too large to present while editing, so I suggest a system where it is possible for the user to choose (write) any categories he or she wants and then have the system (if possible) display a list of possible existing categories. If the categories are relatively unique, use them, and let the editors, as you already have discused, sort the rest out. Jody 22:15, 27 Apr 2004 (GMT)