Toolhub/Data model/Research and design

From Meta, a Wikimedia project coordination wiki

Design principles[edit]

Minimum viable set of attributes:

  • The taxonomy should be "just enough" to cover the tools that exist. It should not cover concepts in great detail if the corpus of tools doesn't merit it.
  • The taxonomy should not cover attributes that don't require human curation of the attribute values. Information that can be automatically extracted, or which cannot be controlled/curated, is not appropriate for inclusion in the taxonomy, though in many cases it should still be available for cataloging tools or filtering them in the UI.
  • The taxonomy should not cover concepts already included in the toolinfo.json schema, and for which data has already been populated. "Tool type" is the prime example here.

Useful attributes only: The taxonomy should not include attributes simply because we *could* model them. The attributes should align with the personas and goals documented in Toolhub use cases. For attributes that we intend to include in the Toolhub browsing UI, there's a limited amount of screen real estate to use. Too many attributes reduces the utility of the browsing UI, and may not display well on small screens. Similarly, attribute values should not exist for concepts that only apply to a small set of tools. For example, attribute values like "photo contests" or "new page patrolling" are too specific to be useful – there are not more than 5 tools that would fit each of those classifications. (Those examples would be valuable as free-text annotations or tags, not as controlled terms).

Modular, atomic concepts: Attributes and their values must not combine multiple facets of a tool into one concept. This is the main area in which v2 of the taxonomy differs from v1. The initially-proposed "use case" attribute combines concepts like "audience" and "task" into one attribute. The principle of "one attribute per concept" ensures that attributes are mutually exclusive, and that attribute values are of the same type (this is sometimes called "level homogeneity"). Meanwhile, this approach still supports complex concepts through combination of multiple attributes (compositionality).

No
"Media upload" combines the content type ("media") and the action a tool might apply to that object ("upload").
Yes

"Content Type": values like "Media", "Data", "Text", "Links", "Diffs"

"Task": values like "Upload", "Patrol", "Delete", "Rank", "Analyze"
No

Developers

  • APIs
  • Coding environments
  • Data services
  • ...

Consumers

  • Reading
  • Data and metrics
  • Visualization and remixing
  • Large-scale content analysis
  • ...
Yes

"Audience": values like "Developers", "Consumers"

"Tool type": values like "APIs", "Coding environments"

"Task": values like "Reading", "Visualization", "Content analysis"

Concrete concepts and definitions: The taxonomy should not cover concepts that will or may evolve over time, like the stability of a tool or its number of active maintainers. While these design principles attempt to avoid ambiguous concepts, there is always some ambiguity in any such system. Consequently, attributes should be documented with clear definitions to prevent semantic drift over time. For example, concepts like "use case" or "tool purpose" can be interpreted many ways. If included, these attributes must be defined with clear semantics that constrain their potential attribute values. This also enables both future data model maintainers and data contributors to easily identify values that don't belong.

Research process[edit]

This and the following sections describe how User:TBurmeister_(WMF) drafted the taxonomy and controlled vocabulary.

Concept extraction process[edit]

From all of the above sources, I extracted:

  • Concepts people use to talk about tool discovery or navigating a catalog of tools
  • Concepts people use to talk about where, when, why, and how they use tools
  • Categories and attributes people used when they create lists or catalogs of tools

The result of this concept extraction is an uncontrolled list of concepts (attributes) and attribute values.

Analysis, concept mapping, and terminology standardization[edit]

Using the "megalist" of uncontrolled concepts, I then analyzed the semantics of the various attributes and their values. As part of this process, I:

  • Mapped most low-level concepts to a small set of high-level conceptual themes.
  • Reviewed the uncontrolled values and attribute types that these themes clustered together.
  • Analyzed the semantics of the various attributes and their values, looking specifically for recurring concepts, overlapping meanings, differing levels of granularity, and different tactics for approaching the same underlying concept.
  • Documented common areas of ambiguity that the final taxonomy must address.
  • Started building the controlled vocabulary as I encountered clear cases of synonymy.
  • Developed a standardized set of attributes. The "Task" attribute required the most standardization, because it combined many values that appeared as different attributes in the various legacy categorization schemes. To make this attribute useful and manageable, I mapped all the uncontrolled task terms to higher-level categories that seek to capture the most common and important themes in the data.

Task attribute term mapping[edit]

For a shorter list of just the proposed values for the "Task" attribute, or to provide feedback on this item, see the main Data Model page.

Task attribute values for controlled vocab
Uncontrolled value Controlled value
administrative work Project management and reporting
analysis Analysis
Annotating Annotating and linking
Anti-vandalism and user warning Identifying vandalism; Warning users
Archive content Archiving and cleanup
Assessment Analysis
attribution Citing and referencing
Automated editing Editing or updating
Blocking User management
bulk / quick editing Editing or updating
Categorizing Categorizing and tagging
Change exclude due to vagueness
citation Citing and referencing
Clean up sandbox Archiving and cleanup
Collection curation (curating datasets, curating image sets) Categorizing and tagging
Conduct (interacting with user task) Communication and supporting users
Connect Wikipedia with other sites Annotating and linking
Connect Wikipedia with other wikis Annotating and linking
consuming content Downloading or reusing content
content migration Migrating content
Contest organizing Event and contest planning
contributing content Creating new content; Uploading or importing
conversion Converting and formatting content
convert Converting and formatting content
copy Editing or updating
Copy editing Editing or updating
Copyediting Editing or updating
Copyright management Identifying policy violations
Counseling and social support Communication and supporting users
Create Creating new content
curation / organization Categorizing and tagging
data curation Categorizing and tagging
data upload Uploading or importing content
Deleting Deleting and reverting
Deliver article alert Warning users
deployment Hosting and maintaining tools
Destroy Deleting and reverting
developing content Creating new content
disambiguation Disambiguation
dispute resolution Communication and supporting users
Document user data User management
Drafting Editing or updating
edit Editing or updating
Editing Editing or updating
enhance - categorization Categorizing and tagging
Event planning Event and contest planning
Expanding Editing or updating
Fix content Editing or updating
Fix files Archiving and cleanup
Fix links Annotating and linking
Fix parameters in template/category/infobox Annotating and linking
Format conversion (e.g. OCR, video conversion) Converting and formatting content
Formatting Converting and formatting content
gamification exclude due to over-specificity
generate attribution Citing and referencing
Generate pages based on other sources Creating new content
Generate redirect pages Annotating and linking
get source media / metadata Annotating and linking
Greeting the newcomers Communication and supporting users
Identify policy violations Identifying policy violations
Identify spam Identifying spam
Identify vandals Identifying vandalism
Illustrating Creating new content
importing Uploading or importing content
In-place editing Editing or updating
Large-scale content analysis Analysis
Maintenance tagging Categorizing and tagging
matching with Wikidata Annotating and linking
measure Analysis
media upload Uploading or importing content
Merging Merging content
Moving and merging Merging content
New page patrolling Patrolling recent changes
Online project planning (WikiProjects, etc.) Event and contest planning
organizing projects Project management and reporting
Page creation Creating new content
patrol Patrolling recent changes
Prepare exclude - too vague
Previewing exclude- too vague
Project communication Project management and reporting
Provide suggestions for users Recommending content
Provide suggestions for Wikiprojects Recommending content
Purging Archiving and cleanup
ranking Listing and ranking
Reading Reading
Recent changes patrolling Patrolling recent changes
reconciliation Disambiguation
Renaming Editing or updating
reporting Project management and reporting
reuse Downloading or reusing content
reuse / visualization Downloading or reusing content
Reverting Deleting and reverting
Rollback/reverting Deleting and reverting
search too broad
Send user notifications Communication and supporting users
Socializing users Communication and supporting users
source data cleaning Converting and formatting content
source text transcription Converting and formatting content
Splitting Converting and formatting content
Suppressing exclude - too vague
Tag article assessment Categorizing and tagging
Tag article status Categorizing and tagging
Tag multimedia status Categorizing and tagging
Tag Wikiprojects Categorizing and tagging
Tagging and flagging Categorizing and tagging
Talk page discussion Communication and supporting users
Template editing Editing or updating
Template insertion Editing or updating
thanks Communication and supporting users
track Project management and reporting
tracking Project management and reporting
transfer Migrating content
translation Translating and localizing
Update maintenance pages Project management and reporting
Update statistics Project management and reporting
upload Uploading or importing content
Uploading Uploading or importing content
User activity analysis Analysis
user analysis Analysis
User rights (admin, rollback, etc.) User management
vandalism patrol Identifying vandalism
Warning User management
Welcoming Communication and supporting users
Worklist development Listing and ranking

Content type attribute term mapping[edit]

Content type attribute values for controlled vocab
Uncontrolled value Controlled attribute value
label Categories and labels
articles Articles
articles for creation Articles
audio Audio
automated contributions exclude
batch exclude
Body exclude
books Books
Categories Categories and labels
category Categories and labels
Code Software or code
Commons and files too vague; can't map
Content pages (encyclopedia articles, original texts) Articles
Contributions Diffs and revision data
coordinates Geographic data
csv file format, not a content type
Data (Wikidata items, structured file data) Structured data
Diffs Diffs and revision data
Discussions Discussions
Documentation too broad
Drafts Drafts
edit count Diffs and revision data
Edit filters Diffs and revision data
Edit form Diffs and revision data
Edit summary Diffs and revision data
edits Diffs and revision data
email Email
Feeds too specific
file too broad; indicate more specific content type
Files too broad; indicate more specific content type
Flagged revisions Diffs and revision data
image Images
images Images
infobox too specific
isbn Bibliographic data
lexeme Linguistic data
links Links
list Lists
Listings Lists
lists Lists
Logs Logs
map Maps
maps Maps
media too broad
Media (images, videos, sound recordings) too broad
missing exclude
Modules, scripts and stylesheets Software or code
ogg Audio
open license text exclude
Page information Page metadata
page views Page metadata
pages Page metadata
pageviews Event data
pdf file format, not a content type
photos Images
projects exclude
properties Structured data
Queries Event data
query Event data
random exclude
Recent changes Diffs and revision data
redirect Links
Redirects Links
redlinks Links
reference References
References References
report too broad
reports too broad
rss exclude
Search form too broad
Shortcuts exclude
statistics too broad
statistics (Commons) too broad
statistics (Wikidata) too broad
Stats too broad
svg Images
table Structured data
template Templates
Templates Templates
timeline Event data
user edits Diffs and revision data
User information User data
vandalism Diffs and revision data
video Videos
videos Videos
views Event data
Watchlist Watchlist
web resource Webpages
What links here Links
wikitext Wikitext
written content Articles
youtube Videos

Platform attribute term mapping[edit]

Content type attribute values for controlled vocab
Uncontrolled value Controlled attribute value
command line tool Command-line
Command-line tools Command-line
desktop app Desktop
Image software extensions Extension for existing software (non-MediaWiki)
Integrated tools MediaWiki
mediawiki MediaWiki
mobile Mobile / smartphone
On external website Web app
on mobile devices Mobile / smartphone
Smartphone apps Mobile / smartphone
standalone desktop applications Desktop
Standalone software exclude
web app Web app
web tools Web app

Common areas of semantic ambiguity[edit]

To improve the existing data model, it's important to understand areas where previous attempts to model this space have generated ambiguity or shown inconsistency in how they handled similar concepts. These areas of ambiguity are the most important areas for the new taxonomy to standardize and clarify.

Audience vs. target area / domain vs. wiki project:  Previous categorization schemes and data models have often mixed together attribute values related to the people, fields, wikis, wiki projects, and locales for which (or for whom) a tool may be especially useful.  The final taxonomy must have a clear definition of what any "Audience", "Domain", or "Project" attribute captures, and those attributes should be clearly mutually exclusive. Examples:

  • GLAM, Education → can be modeled as an audience (the humans who are in the GLAM field), but also as a use case or application domain.
  • Chinese Wikipedia, English Wikipedia  → these examples follow the anti-pattern of combining two concepts, language/locale and specific wiki project (Wikipedia) – into one attribute value.

Subjects, verbs, objects: Previous data models frequently differ in how they model the type of task the tool helps with vs. who it helps vs. with what it helps them do stuff. Intersections of audience, content format, and task were sometimes grouped into a "use cases" attribute to avoid this complexity. Examples:

  • "categories" as a content format that a tool acts upon, vs. "Categorizing" as a contributor task.  In the data I reviewed, there were 3 instances of noun usage and only one of gerund, so the final taxonomy prefers modeling "categories" as a noun object in an attribute like "Content type".
  • "contributing content" vs. "Contributions" vs. "Contributors"(verb vs. noun object vs. noun subject).  This set of ambiguous concepts probably suffers from the overall concept being too vague.  Better to more specifically model the type of content being contributed, and a more specific set of contributor types or audiences, like "editors", "developers".  
  • "vandalism" as a content format vs "Patrolling" as a task.  In most existing data models, patrolling has high prominence as a task or activity, and "vandalism" can come in many different content formats, so this ambiguity is best resolved by keeping "Patrolling"-related activities as verbs in an attribute focused on user tasks. However, we should also ensure that all tools that involve patrolling have vandalism as a keyword in their description or tags.
    • A similar, but less clear example: is "statistics/metrics" a content format or a task (i.e. "analyzing")?  For cases like this, we must be guided by user research, feedback, and data like search queries to determine the most likely way that users would expect to navigate to the relevant tools in this topic area. The current taxonomy uses the verb form and puts "Analysis" under the Task attribute, but we should test and iterate on this.

Audience vs. characteristics of audience vs. characteristics of the UI: Some previous data models highlighted "non-English speakers" as an audience, and many recommended a "language" attribute.  The current Toolhub data model includes "available_ui_languages".  The taxonomy must clearly differentiate and define the scope of any language- or locale-related attributes, so that it's clear whether they're describing attributes of the tool itself (UI language) or attributes of the audience it's meant to help.

Attributes of the tool itself vs. attributes of its use or context: In previous data models,  "tool type" attributes have often conflated specific tool attributes with more generic attributes describing the type of general usage or application the tool has.  These should be separate attributes. Examples:

  • APIs, bots, coding environments, displays
  • Productivity tools, editing tools

Levels of granularity: Past data models and existing, project-specific schemes often vary in how specific they are in capturing the following types of concepts:

  • Projects and sub-projects: Wikidata, Structured Data, Commons, Structured Commons → There's limited utility in subdividing projects into sub-projects.  Tools that serve these sub-domains of work on structured data can be easily found by combining a Project attribute like "Wikidata'' with a Content Type attribute like "images"' or "media".  
  • How specific to get in the wiki page structure: because there are scripts that operate on and help with specific subsections or pages, some data models have specified these elements.  Is the granularity of "page" vs. "discussion" too much or too little to provide effective access to these tools? The taxonomy must definitely subdivide the tool space by the content types of  "articles" vs. "media" vs. "data", so we need to be careful about whether we add too much granularity by modeling the subsections of pages as a content type.  
  • How specific to get in modeling file types and file formats? Data models or categorization schemes in specific domains have historically been more specific about the content formats most relevant to them.  For example, Commons-focused tool descriptions and data models are likely to not just specify image vs. video, they care about whether the file type is svg, jpg, png, tiff, ogg, midi.  The question for the Toolhub taxonomy is whether it can support discovery in those specific domains without modeling that level of granularity, and whether there are enough tools in the catalog to make detailed file formats a useful level of granularity.

From a UX perspective, the more shallow the taxonomy is, the better it is likely to work in the current Toolhub UI – there's not a ton of horizontal space for expanding subclasses and showing long lists of possible attributes for filtering.

Nominalizations vs. verb forms: A common inconsistency in past data models appears in how they formalize concepts that can be described with either a noun or a verb form.  For example: "content recommenders" is a nominalization of the action "recommending content". The nominalization makes it easy to think of that attribute as a "tool type", and such nominalizations are common when describing bots (like in this article).  Maybe we do this because we like to anthropomorphize bots, but fundamentally these types of attributes are describing actions that a user is trying to do with the help of a tool. Our taxonomy should choose one formulation and use it consistently.  The proposed taxonomy uses verb forms to model the tasks that a tool helps its users do, rather than using ambiguous nominalizations of such actions.