Wikibase Community User Group/Meetings/2021-12-30

Online meeting of the Wikibase Community User Group.

Schedule

16:00 UTC (17:00 Berlin), 1 hour, Thursday 30th December 2021
Notes on Etherpad
Video

Agenda

This month we will have a guest presentation by the team at The Semantic Lab at Pratt Institute. They will present how they are using Linked Open Data (LOD) in their projects with the help of Wikibase. All are welcome!

Participants

Sam Donvil (meemoo institute for archives) (BE)
Mohammed/Masssly (WMDE)
Julie Carlson (Semantic Lab at Pratt)
Cristina Pattuelli (Semantic Lab at Pratt)
Giovanni Bergamin
Laurence 'GreenReaper' Parry (WBUG/WikiFur)
Rob Hudson (Carnegie Hall Rose Archives)
Lozana Rossenova (TIB)
Jessika Davis (Semantic Lab at Pratt)

Notes

The Semantic Lab at Pratt Institute
- Walking through projects and how Wikibase is used in support of this project
- Presentation by Julie
Session recorded after participant agreement
Julie is a graduate student at Pratt in New York, last year is a research assistant at Pratt.
Focused on intersection on linked open data and cultural heritage in GLAM
Team led by Cristina
Linked Jazz: Began in 2011, then joined Semantic Lab which leads it and other projects.
- Other projects DADAlytics, ....
  - Presented at WikiDataCon and LD4Con
- Used WB as data repository since 2019 with Linking Lost Jazz Shrines and Women of Jazz
- Team worked with RDF since 2011, familiar with it and graph data and saw Rhizome using it
- Flexible data modelling, propose properties after discussion like Wikidata, modelling multiple projects and need that for multiple data sets. Also lot of turnover with grad students, each adding input.
- Versatile data sources, oral history, newspaper, photos across art and music domains. Wikibase makes it easy to ingest new data.
E.A.T. + LOD Project - Art and Technology, now collaborating with Robert Rauschenberg Foundation Archives
Transforming archival docs into linked data, example: six individuals in a photo looking at a piece of technology, a wireless transmission tool - need to represent all the information in it into linked data.
Photos, notebooks, project descriptions, etc. Could manually enter data but would be cumbersome at scale, so developed tools to scale this process.
Digitized document - > Pomodoro (OCR) generates .txt, hand-coding relationships, -> QuickStatements create entities & properties -> Custom tool 'Selavy' to build triples -> Wikibase
https://pomodoro.semlab.io/
http://159.89.242.202:3000/
Example: Tool created by E.A.T. for performance art, tennis rackets sound outfitted with RF so that vibrations from a hit racket vibrates through speakers in the event space.
Pomodoro example (demo at pomodoro.semlab.io): after a high-quality JPG is obtained, upload and Pomodoro will put text in orange blocks and highlight and select to move to text in a separate block. Designed to work with newspaper or even handwritten documents. Notice anomalies in text and preserve exactly as it appears[ (including misspelling/crossed out sections) - add as alias in Wikibase. Follow LoC model for ineligible text (???) or crossing out [in brackets]. Output is text manually formatted by the user.
Relationship coding: Identify entities, people, places, technology; model triples. Often the bulk of the work, like Wikidata there are many discussions about appropriate properties. Time-intensive due to research and things mentioned in documents, who is W Kaminsky, etc. Bill (William) Kaminsky. Research other properties from Wikidata, Carnigie Hall, Music ontology and Schema.org, hoping to make it easier to be interoperable down the line. E.A.T project discussing with larger Semantic.Lab team, Linked Jazz etc. Need to make sure it fits a variety of projects and there is consistency between them. Create a relationship coding spreadsheet with QIDs and PIDs listed along them to see statements that you hope to create. 'TBD' used to create it in QuickStatements, used to create entities/properties not already in the Wikibase. Includes aliases and variations that may be associated with other documents,.
After entities/props created, return to text file to Selavi (demo). Can upload text and go through four steps: cleanup, block breaks (in a longer document you can reference the exact section a triple comes from), named entity recognition (text processed attempting to tag entities from Wikibase or Wikidata generally. Tried to cluster persons, organizations, and other things), and in this case consumer goods (tennis racket, FM transceiver). Things with a dotted underline might be entities but were not matched. Coloured sections have been linked and matched. Can add entities missed on the right, load entities. So Engineer for example, can add QUID for Engineer and Open Score (the project that is being described), they are "painted" (and will at some point become coloured in a future version). Tennis rackets are used in OpenScore, so you can add qualifiers, e.g. Kaminski is a contributor in a role of engineer of the tennis rackets entity.
Eventually this will be fed directly into Wikibase but are currently manually added. References are to come alongside them so you see the link to the statements in the archival doc. You can then do a query on e.g. depictions or main subject of tennis rackets. The documents found by the SPARQL query (using the standard query interface) are three photographs linked to that project, with people who may also be associated with the project, and you can also do queries to list all the people associated with such photographs, which can then be used to illustrate the web of relations between them (which can then be depicted in different formats).
The possibilities increase as more data is ingested. For researchers this can be very useful, and the links between projects can be explored, like a board member of EAT is a member of the American Federation of Musicians. Pedagogical value to Wikibase: allows students to learn bout the principles of linked open data and then implement on the same platform. Appreciate collaboration with Wikibase community.

Questions

Issue of federation arose, wonder if federation is appropriate and how that would work. Some entities are niche, but there might be overlap in others.
Have questions about Wikibase Docker, how easy it is to upgrade, and to customize the Wikibase to add extensions.
Lastly, there is a bit of a learning curve for SPARQL and QuickStatements, but as a student based team, constantly retraining people, and if they are not familiar with linked open data and SPARQL they might find it hard to interact.
Next steps: What parts of data might be valuable to share with community? New ways of engaging users, queries are a barrier unless they are familiar. Subproject example: citation information. Interface created (by a specific person on a team, Matt?) to interact with data to search and sort data as desired.
Pomodoro is up and running, Selavi is still in development, you would have to have a login for the wikibase and would want to know what the use-case was.
Lozana: Thanks for presentation, for context I've worked on Rhizome Artbase, was interested to learn it was an influence.
Selavi: What is the output format? How does the ingest work? Tools I work with usually use a specific format, are you outputting raw RDF that you use scripts to upload, how does the tool work? -- Matt Miller is finalizing it but will add to notes to discussion and loop back to this group. -- Would be great to learn more and try to abstract tool so it is not just your own Wikibase, but open for others to use. People in the GLAM community in particular, do you want to collaborate? -- Matt is hoping that it would be more widely useful, don't know about timeline, but its something we are focussed on. -- Cristina: Final goal is use-case versatile for others to use, we are not there yet but this is the goal.
Lozana: How are you handling Media, are you simply describing media files or including them in the database? -- At this point we don't have images uploaded, some discussion of copyright issues, we'd love to be able to host the images of the archival documents, but working out the rights we do/don't have first.
Lozana: Federation question was interesting to me, in WBSG we talked with the EU Knowledge Graph team who have a syncing facility to draw on Wikidata - one-way federation but within the context of the group we discussed how it could be extended to a two-way method - do you want to push data to Wikidata, pull data, or federate with other Wikibases or culture organizations? -- Mostly discussed pushing data to Wikidata, might be interested in properties or other cultural institutions. -- L: What we talked about is using OpenRefine, but it's not that automated; the EU Knowledge Graph tool offers automated syncing, checks items/properties every five minutes - always the problem is keeping the data fresh, not just a single upload and things fall off. Also for federation, is it about pulling/pushing data or is it search, and if so since you do have the SPARQL endpoint, potentially possible to query across SemLab and Rhizome. -- Haven't tried it but would be of interest. -- Docker updates - Mohammed would be the best person to talk about that? With extensions WBSG is focusing on building extensions in the following year. Already released Local Media extension. (Lozana then says she will stop being marketing for WBSG ;-)
Mohammed: What tools and help would you like help with, do you have specifications etc? -- Not currently but will take that to Matt Miller to see if there are tools they would need help wit.
Laurence: [Proposed that the second version of federation might be worth investigating as you could then import/export just individual properties, +notes that security may also be an issue, got to keep up to date]
Lozana: If you want you can scan a newspaper and curate flow of blocks manually? -- Yes, demo has that in on ingesting a newspaper article, highlighting in a certain order to preserve the flow of text -- Is it only working with English or other languages? -- Haven't tried putting other languages through, will have to try! -- Was asking because if you plan to open-source I know a use-case in Germany, research in Asian art and culture, Chinese republican era, talking a lot with them on OCR recently, handling very complex layouts, and Chinese is an extra challenge, but I really like the Pomodoro tool, open-sourced it might be extended to other languages -- Will take a look. -- Mohammed: Especially languages outside Europe and North America, that would be amazing -- Lozana: Yeah, just thinking about academic projects but I'm talking with Bengali community members in Wikicommons, books are uploaded from the British Library but OCR is pretty bad so local community members are doing manual work - manuscripts, illustrations, etc.
Mohammed: Telegram chat for Wikibase Community which may help.
Session recording