Wikimedia Conference 2010/Developers' Workshop/Notes/MetaDataSearch

From Meta, a Wikimedia project coordination wiki

Problem: search on commons is source-text only. Inadequate for searching images

Things that need to be included: licences, meta-data, categories.

Goal for today's discussion: can we cover all three things with one solution, given that there are 4 different groups (SMW people, GSoc student, WMDE contractor & multimedia usability team) separately working on each of these problems.

current concurrent projects:

  • WMDE: hired a contractor to rewrite catscan / category intersection
    • recursive category evaluation/intersection remeins a standalone component. aids in evaluating meta-data, does not provide any.
  • GSoC project: IPTC/XMP metadata extraction (mentor: ^demon)
  • multimedia usability project (guillom / neilk)
    • searching for images by meta-data properties
    • suggesting categories based on meta-data provided on upload
  • SMW folks

EXIF info in the DB currently stored as a serialized PHP array, which doesn't facilitate search goal: do better for IPTC/XMP in order to make it searchable

Two ways of exposing data to Lucene search backend:

  • make a special table in database with key/value pairs, query it
  • expose data about the article in XML dump (export-0.4.xsd)

current schema for xml dumps: docs/export-0.4.xsd

Conclusions[edit]

  • exposing via xml dump can be quite complicated, how to structure the metadata?
  • we will expose image metadata via xml dumps (Chad's student). Step 1 is extracting the data and storing it in some sane format. Serialized arrays aren't sane.
  • semantic stuff need more pondering
  • CatScan remain stand-alone TCP server written in C


Examples[edit]

Current format (real example)[edit]

<page><title>
File:Bundesarchiv B 145 Bild-F024214-0006, Bonn, Landesvertretung Bayern, Kommunalpolitiker.jpg
</title>
<id>5453390</id><revision>
<id>37106144</id>
<timestamp>2010-04-01T00:00:04Z</timestamp><contributor>
<username>BotMultichill</username>
<id>211386</id>
</contributor>
<minor/><comment>
Adding author from {{BArch-description}} to {{BArch-License}}
</comment><text xml:space="preserve">
== {{int:filedesc}} ==
{{Information
|Description={{BArch-description
|comment= <!-- add translations and/or more description -->
|biased=<!-- if the original description text is biased, write here why! -->
|headline=Bonn, Landesvertretung Bayern, Kommunalpolitiker
|caption=Bayerische Kommunalpolitiker mit Minister Höcherl in der Landesvertretung Bayern und Jugendgruppe Volkshochschule Hesselberg
|extra=
|people=
}}
|Source=Deutsches Bundesarchiv (German Federal Archive), {{BArch-link|B 145 Bild-F024214-0006}}
|Author=Gathmann, Jens
|Date=1967-03-15
|Permission=[[Commons:Bundesarchiv]]
|other_versions=
}}

=={{int:license}}==

{{BArch-License
|signature=B 145 Bild-F024214-0006
|batch=B 145
|author=Gathmann, Jens
|year=1967
|month=<!-- 03 (omitted to avoid overly detailed category structure) -->
|location=Bonn <!-- Please leave as is, add appropriate categories directly. Exception: if needed, change "location=" to "topic=". -->
|topic=    <!-- Please leave as is, add appropriate categories directly. Exception: if needed, change "topic=" to "location=". -->
|PD=<!-- set this if you are sure the image is PD -->
}}


[[Category:Landesvertretung Bayern Bonn]]
[[Category:Photographs by Jens Gathmann]]
</text>
</revision><upload>
<timestamp>2008-12-10T23:32:51Z</timestamp><contributor>
<username>BArchBot</username>
<id>465132</id>
</contributor><comment>
== {{int:filedesc}} ==
{{Information
|Description={{BArch-description
|comment= <!-- add translations and/or more description -->
|biased=<!-- if the original description text is biased, write here why! -->
|headline=Bonn, Landesvertretung Bayern, Kommuna
</comment><filename>
Bundesarchiv_B_145_Bild-F024214-0006,_Bonn,_Landesvertretung_Bayern,_Kommunalpolitiker.jpg
</filename><src>
http://upload.wikimedia.org/wikipedia/commons/9/98/Bundesarchiv_B_145_Bild-F024214-0006%2C_Bonn%2C_Landesvertretung_Bayern%2C_Kommunalpolitiker.jpg
</src>
<size>47316</size>
</upload>
</page>

Daniel's suggestion for XML format of metadata:[edit]

<page...>
  <revision>...
    <data ref:about="revision-uri">
            <rdf:item rdf:property="some uri">...value...</rdf:item>
    </data>
    <data  ref:about="page-subject-uri">
             <rdf:item rdf:property="some uri">...value...</rdf:item>
    </data>
    <data  ref:about="upload-uri">
             <rdf:item rdf:property="some uri">...value...</rdf:item>
    </data>
  </revision>

    <data  ref:about="page-uri">
             <rdf:item rdf:property="some uri">...value...</rdf:item>
    </data>
  
  <upload>
     <data  ref:about="upload-uri">
              <rdf:item rdf:property="some uri">...value...</rdf:item>
     </data>
  </upload>
<page>