Wikimedia Conference 2010/Developers' Workshop/Notes/MetaDataSearch
From Meta, a Wikimedia project coordination wiki
Problem: search on commons is source-text only. Inadequate for searching images
Things that need to be included: licences, meta-data, categories.
Goal for today's discussion: can we cover all three things with one solution, given that there are 4 different groups (SMW people, GSoc student, WMDE contractor & multimedia usability team) separately working on each of these problems.
current concurrent projects:
- WMDE: hired a contractor to rewrite catscan / category intersection
- recursive category evaluation/intersection remeins a standalone component. aids in evaluating meta-data, does not provide any.
- GSoC project: IPTC/XMP metadata extraction (mentor: ^demon)
- multimedia usability project (guillom / neilk)
- searching for images by meta-data properties
- suggesting categories based on meta-data provided on upload
- SMW folks
EXIF info in the DB currently stored as a serialized PHP array, which doesn't facilitate search goal: do better for IPTC/XMP in order to make it searchable
Two ways of exposing data to Lucene search backend:
- make a special table in database with key/value pairs, query it
- expose data about the article in XML dump (export-0.4.xsd)
current schema for xml dumps: docs/export-0.4.xsd
Contents |
[edit] Conclusions
- exposing via xml dump can be quite complicated, how to structure the metadata?
- we will expose image metadata via xml dumps (Chad's student). Step 1 is extracting the data and storing it in some sane format. Serialized arrays aren't sane.
- semantic stuff need more pondering
- CatScan remain stand-alone TCP server written in C
[edit] Examples
[edit] Current format (real example)
<page> − <title> File:Bundesarchiv B 145 Bild-F024214-0006, Bonn, Landesvertretung Bayern, Kommunalpolitiker.jpg </title> <id>5453390</id> − <revision> <id>37106144</id> <timestamp>2010-04-01T00:00:04Z</timestamp> − <contributor> <username>BotMultichill</username> <id>211386</id> </contributor> <minor/> − <comment> Adding author from {{BArch-description}} to {{BArch-License}} </comment> − <text xml:space="preserve"> == {{int:filedesc}} == {{Information |Description={{BArch-description |comment= <!-- add translations and/or more description --> |biased=<!-- if the original description text is biased, write here why! --> |headline=Bonn, Landesvertretung Bayern, Kommunalpolitiker |caption=Bayerische Kommunalpolitiker mit Minister Höcherl in der Landesvertretung Bayern und Jugendgruppe Volkshochschule Hesselberg |extra= |people= }} |Source=Deutsches Bundesarchiv (German Federal Archive), {{BArch-link|B 145 Bild-F024214-0006}} |Author=Gathmann, Jens |Date=1967-03-15 |Permission=[[Commons:Bundesarchiv]] |other_versions= }} =={{int:license}}== {{BArch-License |signature=B 145 Bild-F024214-0006 |batch=B 145 |author=Gathmann, Jens |year=1967 |month=<!-- 03 (omitted to avoid overly detailed category structure) --> |location=Bonn <!-- Please leave as is, add appropriate categories directly. Exception: if needed, change "location=" to "topic=". --> |topic= <!-- Please leave as is, add appropriate categories directly. Exception: if needed, change "topic=" to "location=". --> |PD=<!-- set this if you are sure the image is PD --> }} [[Category:Landesvertretung Bayern Bonn]] [[Category:Photographs by Jens Gathmann]] </text> </revision> − <upload> <timestamp>2008-12-10T23:32:51Z</timestamp> − <contributor> <username>BArchBot</username> <id>465132</id> </contributor> − <comment> == {{int:filedesc}} == {{Information |Description={{BArch-description |comment= <!-- add translations and/or more description --> |biased=<!-- if the original description text is biased, write here why! --> |headline=Bonn, Landesvertretung Bayern, Kommuna </comment> − <filename> Bundesarchiv_B_145_Bild-F024214-0006,_Bonn,_Landesvertretung_Bayern,_Kommunalpolitiker.jpg </filename> − <src> http://upload.wikimedia.org/wikipedia/commons/9/98/Bundesarchiv_B_145_Bild-F024214-0006%2C_Bonn%2C_Landesvertretung_Bayern%2C_Kommunalpolitiker.jpg </src> <size>47316</size> </upload> </page>
[edit] Daniel's suggestion for XML format of metadata:
<page...> <revision>... <data ref:about="revision-uri"> <rdf:item rdf:property="some uri">...value...</rdf:item> </data> <data ref:about="page-subject-uri"> <rdf:item rdf:property="some uri">...value...</rdf:item> </data> <data ref:about="upload-uri"> <rdf:item rdf:property="some uri">...value...</rdf:item> </data> </revision> <data ref:about="page-uri"> <rdf:item rdf:property="some uri">...value...</rdf:item> </data> <upload> <data ref:about="upload-uri"> <rdf:item rdf:property="some uri">...value...</rdf:item> </data> </upload> <page>