Jump to content

Learning patterns/Treasures or landmines: detecting uncategorized, language-specific uploads in Commons

From Meta, a Wikimedia project coordination wiki
Treasures or landmines: detecting uncategorized, language-specific uploads in Commons
problemInexperienced users often upload to Commons uncategorized, poorly described and then abandoned stuff, which remains invisible and unutilized for ages.
solutionUse the PetScan tool with some smartly selected settings.
created on06:09, 9 January 2016 (UTC)

What problem does this solve?[edit]

  Inexperienced user
  Description in a small language
  Uncategorized upload
  Unused, abandoned upload

= Learning pattern!

Inexperienced users often upload to Commons content without adding any categories, which generates a huge backlog of media needing categories. The special page Special:UncategorizedFiles only supports a maximum of 5000 images, while the category Category:Media needing categories is simply spanless.

First-time uploaders may upload anything: from very valuable content for a "Wiki Loves X" contest, to candidates for speedy deletion like copyright violations, spam, nonsense, test uploads, or thumbnails of unusably low quality. It becomes even more difficult for evaluation when the file names are not enough specific and meaningful. Thus, uncategorized content thus may hide both treasures and landmines.

Inexperienced users also often describe their uploads in their mother language only, not knowing foreign languages, or not knowing or caring of the international nature of Commons and the future usability of their files. This means that their uploads need to be reviewed by users or administrators who know the language. This makes the chronological organization of the "Category:Media needing categories" not really helpful, and requires other approaches.

Finally, this content often remains unused, because newbies abandon it after upload and do not make the final step of including it in Wikipedia or a sister project. Abandoning happens due to various reasons, most often despair from the rather complicated process of licensing, uploading and using files onwiki (compared to other websites and applications), or the "my job here is done" kind of thinking. Content remaining unutilized is also a result of not enough active Commons-based community speaking that specific language and regularly patrolling.

Whatever the reasons, the result is a huge mass of literally invisible content that may remain in Commons unused or unattended for years.

What is the solution?[edit]

A very useful tool for extracting the uncategorized Commons uploads in a specified language is given by Magnus Manske's tool PetScan (currently version 3.0).


The following settings are required:

Language = commons
Project = wikimedia
Depth = 1
Categories = Media needing categories
Combination = ☑ Subset
Namespaces = ☑ File
Templates : Has all of these templates = <your language code> 
Format: ☑ Extended data for files     ☑ File usage data

The rest settings are optional, depending on your preference for further filtering and organization of the output.


Output processing[edit]

The option

Format: ☑ Extended data for files     ☑ File usage data

returns for all detected files the following extended data:

Column Explanation Notes (Red notes indicate potential problem.)
Title File name Some filenames alone may indicate vandalism, spam, non-free commercial logo, nonsense.
Page ID Oldid of the upload
Namespace Always "6" because of the setting "Namespaces = ☑ File"
Size (bytes) Size of the file description page Small sizes may indicate poor, or even missing, file description.
Last change Timestamp of the last edit in the file description page If the same as the value of "Uploaded", it means that no one has edited the file description page since the upload.
Groups Always "1"
Namespace name Always "File" because of the setting "Namespaces = ☑ File"
File size (bytes) File size in bytes Small sizes may indicate thumbnails found on the internet, possible copyvios.
Width Image width See above.
Height Image height See above.
Media type Possible values: "BITMAP / OFFICE", depending on the filetype
MIME (major) Possible values: "Image / Application", depending on the filetype
MIME (minor) Filetype; Possible values: "jpeg / pdf" PDF's are rarely uploaded, may indicate possible copyvios or spam.
Uploader Username of the uploader
Uploaded Timestamp of the first edit of the media/file description
SHA1 checksum SHA1 checksum
fileusage Pages that embed the file No fileusage may indicate that a user has uploaded valuable content to Commons, but did not manage to include it in Wikipedia or sister project.

Refining the output[edit]

1. Set output format to wiki markup:

Format: ◯ HTML   ◯ CSV   ◯ TSV   ● Wiki   ◯ PHP   ◯ XML   ◯ JSON   ◯ Gallery

2. Generate the query. The result will be somethings like:

{| border='1'
!Title !! Page ID !! Namespace !! Size (bytes) !! Last change !! Groups !! Namespace name !! File size (bytes) !! Width !! Height !! Media type !! MIME (major) !! MIME (minor) !! Uploader !! Uploaded !! SHA1 checksum !! fileusage
|[[:File:Vartop-Ustav.jpg]] || 4629155 || 6 || 422 || 20151221044313 || 1 || File || 30452 || 316 || 320 || BITMAP || image || jpeg || Подпоручикъ || 20080825170039 || pghvlf0v25s8oflakwviesza3sllzr4 || bgwiki:0::Вътрешна_западнокрайска_революционна_организация
|[[:File:Fuzzyproductions.jpg]] || 5305220 || 6 || 411 || 20151216205823 || 1 || File || 31501 || 300 || 431 || BITMAP || image || jpeg || Biso || 20081129062947 || su3dmbh0d1qs5njdxmlfuqans6apku6 || bgwiki:0::Кока-Кола|skwikiquote:0::Práca
|[[:File:Robko01elem.JPG]] || 5969349 || 6 || 327 || 20151216205823 || 1 || File || 2563851 || 2480 || 3296 || BITMAP || image || jpeg || Entusiast~commonswiki || 20090216204421 || 6dznv5k8mhqwwg5luy33s7imhdp3s6u || bgwiki:0::РОБКО-1

3. Copy to a text editing application and apply Find and Replace:

!! Namespace !!                    -->      !!
|| 6 ||                            -->      || 
!! Groups !! Namespace name !!     -->      !!
|| 1 || File ||                    -->      ||

4. Add to the definition of the table the class "wikitable sortable"

{| border='1'         -->      {| class="wikitable sortable"

5. You can now sort the table per columns. Start with those, marked above in red.

Detecting copyvios[edit]

Small file size and/or small width/height are often indicators of possible copyvios. Use the Search by image option of Google Images and/or the Reverse Image Search tineye.com for confirming or refuting this assumption.

If you find a copyvio, make sure to check the rest uploads of the user.

Detecting uncategorized candidates in photo contests[edit]

Sometimes, some users do not upload their submissions for the photo contests in the provided campaign forms, but use the standard upload form. Thus their uploads do not get automatically tagged and categorized, yet, the query will detect them as long as they have relevant description in the specified language, and some other additional clues exist like suggesting filename.

Note, that such uploads are reasonable expected to be of high quality, meaning large file size, large enough width and height.

Detecting successful uploads and unsuccessful embeddings[edit]

Some users manage to upload illustrative content to Commons, and then fail to include it in, e.g., the respective target Wikipedia article. Check the files that have empty column "fileusage", and check the global contributions of the user. If they have edits in a sister project soon after the upload, you will get an idea which article was supposed to be illustrated with the uploaded file.

This is especially useful, when neither the file name, nor the file description give enough information about the depicted object.

Sample scenarios[edit]

Treasure scenario: Successful upload, unsuccessful embedding
  1. 2010-12-05: A file is uploaded to Commons under a random name, without category. File:Zx450y250 753141.jpg. This is the third user's upload.
  2. 2010-12-05: The user makes a series of unsuccessful attempts to include the file in the respective template in the target page: [1], [2], [3], [4]. The user abandons the file.
  3. 2011-07-02: The user makes another attempt half year later: [5]. The fifth attempt is placing a gallery between the parameters of template in a way that it remains completely invisible in reading mode. The user abandons the file.
  4. 2014-11-15: Second user removes the invisible piece of code from the edit mode: [6]
  5. 2016-01-09: Third user, Commons admin, discovers the image using the PetScan tool, and from the file description, provided only in Bulgarian and the history of the article about the subject, renames the file to the more meaningful file name File:Iskren Veselinov.jpg, categorizes the file, edits its description [7], and finally places it in the article for illustration [8].
Treasure scenario: Two different useful, but overwritten files, none used
  1. 2013-06-02: Two images were subsequently uploaded under the same name File:Село Тученица.JPG. The first upload depicts a small Bulgarian village and the filename refers to that village. The second upload depicts the nearby river with the same name. Though namesakes, the two images depict different objects. For both of them there are articles on Bulgarian Wikipedia: bg:Тученица (село) vs. bg:Тученица (река). Since then (as of January 2016), the uploader never attempted to upload any other file, or use at least the last uploaded to any page in Wikipedia.
  2. 2016-01-23: A Commons admin detected the image due to discrepancy between filename (referring the village) and the current version, depicting the river. Following the instructions for file history splitting, the file history was split successfully, and now both uploads are described, categorized and in use.

Things to consider[edit]

When to use[edit]

  • This learning pattern will surely benefit projects that have disabled or restricted their local uploads, and their users who want to add multimedia content can only do so in Commons.
  • It is also considered to be helpful when a given language is spoken by a relatively small community, and the patrolling editors or administrators speaking the language are very few.
  • When countries organize photo contests, like "Wiki Loves Monuments" or "Wiki Loves Earth", which are targeted mainly at newcomers, who are - by definition - not experienced with categorization, description and usage of uploads.

Note: Please note, that the tool cannot parse and recognize natural language, so if there is e.g. Bulgarian description inside the "en" template or English description inside the "bg" template, this cannot be detected.


See also[edit]

Related patterns[edit]

External links[edit]