Massively-Multiplayer Online Bibliography

From Meta, a Wikimedia project coordination wiki

Massively-Multiplayer Online Bibliography (MMOB) is the name for a series of projects aiming to perform significant feats of online bibliography in a fun, collaborative, and principled way, that would be useful to everyone and acceptable to professionals. It will rely on volunteer labor, free software, and open Web standards.

The Aboutness Project[edit]

The Need[edit]

There are hundreds of millions of essays and articles out there. Many of them are already available online, in one or another of the large free repositories, such as Project Gutenberg, the Internet Archive, the Hathi Trust, etc.

However, while their full text is available, searchable, and indexed by search engines -- hence discoverable when searching by words in the text -- there is no good way to distinguish between the tens of thousands of articles that mention Timbuktu and the significantly fewer that are about Timbuktu.

"Oh, but Google will send you to good relevant articles about Timbuktu", you might say. That's most probably true for most topics, however, the way PageRank and similar algorithms work, it would only send you to those resources already identified as relevant or useful by people linking to them, which reinforces and recirculates largely the same group of resources for most queries. How would we ever discover additional resources already available to us in the large and growing open repositories?

Traditional library catalogues offer (human-curated) "aboutness" statements for catalogue items. However, catalogue items are typically book volumes rather than individual essays and articles. Thus, a library catalogue will tell us Francis Bacon's Essays (Dover edition) is about "1. English essays -- Early modern. 1500-1700." Not terribly useful, is it? That, itself, does not tell us some of these essays are about "truth", "envy", "sedition", "revenge", etc.

T. S. Eliot's The Sacred Wood is about "Criticism" and "Literature", according to the Library of Congress, but this does not help us discover the influential essays on "Hamlet and His Problems" or "Tradition and the Individual Talent" inside. Let us also remember that even tables of contents are not enough: another essay in Eliot's book is called "A Romantic Aristocrat"; a fine title, but it gives no clue as to who it is about. Lord Byron, perhaps?

Conversely, if we had an extensive data collection of "aboutness" statements (essay X is about topic Y), much of the currently-invisible cultural wealth already available online will become discoverable, and therefore found, read, used, discussed, and built upon, once again enriching our present and future culture and research. It would tell us, for example, that "A Romantic Aristocrat" is in fact about George Wyndham, and it would contribute to a large collection's ability to answer the question "What works do you have that are about (not just mention) George Wyndham?". Wouldn't that be tremendously helpful?

The Proposed Solution[edit]

Wouldn't it be nice to be able to browse a huge list of essays -- by language, period, author, title -- pick one that you'd like to read, read it, and then pick one or more topics it was about, from a standardized tree-like list of topics? Or to go over other people's previous classifications and endorse or question them with a single click, to help create a more robust result?

We can build such a collection, one essay and one "aboutness" statement at a time. And we can do so in a way that builds on and interoperates with other large-scale bibliographic efforts, so nothing is wasted.

Essentially, we would build a crowdsourced curation system that would attach multiple "aboutness" values to each individual work (article, essay):

  • Volunteers would read a work, pick a language to classify in (remember: different classification schemes break the universe down into different ontologies), pick a classification source to classify by, where more than one is available (e.g. Library of Congress Subject Headings, Wikipedia article titles, Library of Congress item titles), be presented with a convenient, browsable, navigable, searchable tree-like view of classifications, and select one or more classifications to attach to the work.
  • Volunteers would also be able to "upvote" or "downvote" other volunteers' classifications, to help gain confidence in some classifications over others. (This later allows a user searching for material to constrain the search to, for example, only works that have a particular classification at confidence level 3 or more, if an unconstrained search produced too many false positives.)
  • Users would be able to search for materials on the open Web according to one or more of these classifications

Bibliographical Aspects[edit]

Stable URIs[edit]

To create an aboutness statement, we need stable identifiers for both the individual work and the topic. Most databases do not, today, catalog at the individual work level, so there's much work to be done.

  • We can begin with ad-hoc subdivisions (e.g. Project Gutenberg text number N, article number M, can be made into an ID of the form http://aboutness.org/work/pg_N_item_M
  • We can also begin by working only on essays contained in databases that do catalog at the work level (e.g. Project Ben-Yehuda [disclosure: Ijon is its founding editor])
  • Gradually, the Table of Contents for Everything project will provide us with stable URIs for more and more essays and articles we can classify.
  • As always in linked open data, sameAs relationships can subsequently be established between whatever URIs we end up using and the URIs of major databases (e.g. Library of Congress), when they get around to cataloging at work level.

Subject authority data[edit]

There are already several authority files (i.e. sets of data including possible "subject headings" one might assign to a work for an "aboutness" statement) from libraries and related institutions published as (linked) open data on the web. For an overview see the datasets tagged with "authorities" on the Data Hub. Datasets from other institutions (e.g. wikidata) might be relevant as well.

Simple examples[edit]

T.S. Eliot's "Hamlet and His Problems" -- could be classified as ABOUT (or dcterms:subject, etc.) --

  1. http://id.loc.gov/authorities/names/n80008522 -- "Hamlet (work)" (from LCSH)
  2. http://id.loc.gov/authorities/subjects/sh85058566 -- "Hamlet (Legendary character)" (this is from the Library of Congress Subject Headings)
  3. http://id.loc.gov/authorities/subjects/sh2008112835 -- "Theater--England--History--16th century" (likewise)
  4. http://www.wikidata.org/wiki/Q2447542 -- "Prince Hamlet" (an item on Wikidata, about the fictional character Hamlet) -- sufficient to retrieve multi-lingual labels, link to Wikipedia articles, etc.
  5. http://www.wikidata.org/wiki/Q41567 -- "Hamlet" (an item on Wikidata, about the play by Shakespeare) -- likewise
  6. http://viaf.org/viaf/176993890 -- "Hamlet (work)" (from viaf)
  7. http://d-nb.info/gnd/4099350-4 -- "Hamlet (work)" (from GND)
  8. http://d-nb.info/gnd/118545345 -- "Hamlet (fictive person/legendary figure)" (from GND)
  9. http://data.bnf.fr/ark:/12148/cb11936813g -- "Hamlet (work)" (from Rameau)
9-19. other examples in English or or other languages

All of these classifications are stored (either as Linked Data triples or in some conventional RDBMS [exposable as triples]) and can then be reviewed, revised, upvoted/downvoted, and of course searched.

Important note: The above is a mixture of library metadata and linked data geekery. If it makes no sense to you, please don't worry, you can still be involved in the project!

Relevant Data sets[edit]

Please help collect some information about available text repositories we might begin classifying.

Web site Open texts? Stable URIs? Work-level URIs? Notes
Wikisource yes yes it's complicated :)[1] multi-lingual
Project Gutenberg yes yes no mostly in English, but other languages as well
Project Runeberg yes yes no mostly in Swedish
Project Ben-Yehuda yes yes yes all in Hebrew
Internet Archive yes yes no all languages
PubMed Central (Open Access Subset) yes yes yes English; biomedical research articles
... ... ... ... ...

There is a list of open collections curated by the OpenGLAM initiative that may be of interest in this context. Only few of the collections are collections of textual material (mostly manuscripts), most are collections of digitized works of art, of digitized photographs (sometimes containing manuscripts). of digital sound or of digitized comics.

Technological Principles[edit]

  • All work will happen on the Web, via a modern browser. (i.e. no required downloads, no Flash, no IE6 :))
  • The Aboutness Project is humble: it seeks to create value in an underserved area (discoverability of non-academic non-fiction resources), in a non-exclusive and non-authoritative manner, and it makes no claim for being comprehensive (yet).
  • The Aboutness Project is a good netizen: we build on free software and open resources, and we aim to not duplicate efforts or reinvent wheels. We give back: our code and data will be placed in the public domain (and/or CC0).
  • The Aboutness Project starts with low-hanging fruit: We start with resources that are readily available with work-level URIs (e.g. some works on English Wikisource), and with authority data that's available and open. We'll learn as we go, and will gradually reach for higher fruit.
  • much more TBD

Technical Questions[edit]

  • Where do we have the conversation? -- on this wiki page? On a mailing list (which?)?
  • Shall we store the aboutness triples on Wikidata? We can share them or publish them in any number of ways, but what is to be our primary store? (storing them on Wikidata means a Wikidata item for every essay!)
  • Consider using CiTO[2] or something like that?

How can I help?[edit]

Right now we're still hatching the idea. But down the road we'll need:

  • library metadata and linked open data geeks (MARC, Dublin Core, FRBR, SKOS, RDA, OAI-PMH, etc.)
  • Web hackers (Ruby, Python, Javascript, PHP)
  • UI designers, usability experts, graphics artists
  • outreach volunteers (bloggers, social media gurus, Wikimedians, librarians)

I'm interested![edit]

Great! Please sign your username below, and we'll get in touch when we set up a mailing list or something. Also, add this page to your watchlist and participate in the brainstorming! :)

  1. Asaf Bartov
  2. Ole Palnatoke Andersen
  3. Noopur
  4. Aubrey
  5. Bob Kosovsky
  6. Ed Summers
  7. Kevin Clarke
  8. Mathias Schindler (talk)
  9. Adrian Pohl
  10. Pascal Christoph
  11. Rene Wiermer
  12. User:brest
  13. Micru
  14. Pepato
  15. Lukas Koster
  16. Laura Akerman
  17. Hila Levy
  18. Lambert Heller
  19. Jonathan Gray
  20. Sannita
  21. Luc Gauvreau
  22. Scott Morrison
  23. Susanna Giaccai
  24. William Gunn
  25. Ocaasi
  26. Chris Maloney (Klortho)
  27. Juergen Bunzel
  28. Peter Murray-Rust
  29. User:Maximilianklein
  30. User:Freemoth
  31. User:Ham II
  32. Hay Kranen (Husky)
  33. Ryan Shaw
  34. Liang (WMTW)

Parallel project: The Table of Contents for Everything[edit]

A volunteer project to make detailed digital tables of contents freely available to all, with stable URIs for each work, to serve e.g. in the Aboutness Project above.

The Need[edit]

A huge amount of books are now available as either scans/PDFs or text thanks to massive digitization projects such as the ones by The Internet Archive, Google, the Hathi Trust, etc.

Those projects focus on quantity over quality, perhaps leaving the meticulous improving of metadata for later, but quite probably, never.

Among those books, the ones that are least well described by metadata are non-fiction collections -- essay collections, article anthologies, digests. That's because book-level metadata can never do justice for the multiple items inside.

The Aboutness Project (above) can help classify these individual works by their content, but, it needs a way to refer to these individual works in the first place, and that's not available for individuals essays in the book-level resources exposed by the aforementioned services.

The solution[edit]

The "proper" solution would, of course, be to change the way the content hosts (Internet Archive etc.) operate, and add work-level cataloging and content-management. Since that is a formidable task and beyond our control, what MMOB can do about it is this:

We can create an extrinsic catalogue for these works, all pointing at the one (book-level) resource, but featuring individual data entities for every work (essay, article) inside. Our data entities (themselves metadata for the actual content at the original host) can then be used in The Aboutness Project. The data would be created by volunteers, typing (or proofreading OCRed) tables of contents and identifying authors (with VIAF etc.).

No less importantly, the data entities we produce can serve as the basis for the original hosts' catalogue, if and when they begin supporting work-level cataloguing.

More description TBD.

See also[edit]

References[edit]