Wikicite/Wikicat

This project was described in 2006, related to the Wikicite proposal of that time.

Wikicat is the bibliographic catalog used by the Wikicite and WikiTextrose projects. It will be implemented as a Wikidata dataset using a datamodel design based upon IFLA's Functional Requirements for Bibliographic Records: final report (FRBR) [1], the various ISBD standards, the Library of Congress's MARC 21 specification, the Anglo-American Cataloguing Rules' The Logical Structure of the Anglo-American Cataloguing Rules and Resource Description and Access (RDA), and the International Committee for Documentation (CIDOC)'s Conceptual Reference Model (CRM)[2]. The history and inter-relation of these various cataloging standards is described in RDA presentations.

Purpose[edit]

Though an adjunct to several other projects, the reason for having a bibliographic catalog goes beyond the need to cite: there must also exist the ability to annotate and at a minimum summarize those works that are obscure, hard to obtain, or in a language different from the user's native one. In addition, there is also the need to facilitate the navigation of the information universe so that significant related works (sequels, alternate editions, criticisms and overviews) can be easily discovered and correctly situated to the task at hand.

Wikicat will support these important functions and so become a useful resource in its own right.

Design[edit]

Wikicat employs a hybrid design, indebted as much to IFLA's Functional Requirements for Bibliographic Records (FRBR) as the AACR2's The Logical Structure of the Anglo-American Cataloguing Rules. Though the two specifications naturally overlap at times, in many cases they are perfectly complementary, with FRBR concentrating on the conceptual and content characteristics of resources and The Logical Structure... carefully detailing their physical and embodied aspects, particularly their fabrication and distribution. For example, though an entity for artistic/intellectual content is present in both models, only in FRBR is it abstracted- through the Work entity- into the commonalities that exist between largely identical representations of that content (e.g. alternate editions) as the creation evolves over time. Only AACR2's The Logical Structure..., though, describes precisely how that content is formatted, imprinted on a physical carrier, and then released into the world.

By embodying both models Wikicat will thus be able to describe a resource from both perspectives: the conceptual as well as the "real-world".

Entities[edit]

The primary entity within Wikicat is the resource. Though the most typical sort of resource is a collection of writing (i.e. a book), the concept extends beyond solely linguistic artifacts to include such objects as recorded music, artwork, maps, and even naturally-occuring realia- for example, fossils.

A resource is modeled using the following entities: Work, Expression, Manifestation, and Item. In brief, a Work is a "distinct intellectual or artistic creation" (FRBR, p. 16); an Expression is "the intellectual or artistic realization of a work" in a fixed format such as alpha-numerical, musical, or cartographic notation (FRBR, p. 18); a Manifestation is the fixed "physical embodiment of an expression of a work" (FRBR, p. 20); and an Item is "a single exemplar of a manifestation" (FRBR, p. 23).

These 4 different entities correspond to the important ways in which resources are produced, handled, and experienced. The Work and Expression entities consider only the artistic/intellectual qualities of a resource, while the Manifestation and Item entities describe its physical and embodied characteristics.

In particular, the Work entity is nothing more than the intellectual/artistic commonality of several basically similar Expressions and reflects the fact that when we speak of a resource we usually do not bother distinguishing between the minor variations that exist across different editions or revisions of it. For example, references in English literature to The Odyssey often do not bother specifying the particular translation being used; and even academic references to Rawls's A Theory of Justice omit the particular edition cited, despite some variations in content between them, etc.

On the other hand, the Manifestation entity reflects the fact that for every Item there is a specification which completely describes the Item's intellectual content and physical characteristics. For many pre-modern resources, a Manifestation will have only a single Item exemplifying it, while modern, mass-produced resources have 1000's of essentially identical Items for each Manifestation. Just as Expression is typically the lowest-level entity of concern in the context of intellectual and creative activity, the Manifestation entity is the primary entity in such domains as curatorship- outside of particularly strong idiosyncrasies in an individual Item (missing pages from a bad print-run, author's signature on the copy, etc.), librarians do not typically distinguish between particular copies of the exact same edition of Gravity's Rainbow in their collections.

The utility of this model is that it segments the aspects of a resource into a (optimally?) signficant functional hierarchy: Manifestations are grouped according to the Expression they embody as in most cases users are primarily interested in the content of a resource and not its form, and Expressions into a common Work to connect similar, historically-related content. Yet there are certain deficiencies in the model. One is that Work is not granular enough or, conversely, that Expression is too sharply defined. For example, FRBR does not directly support finding a Work in a particular language or, conversely, the set of all Expressions of a particular Work for one language. In the FRBR model, translation into a foreign language is no more significant than modification of the forward to a new edition: both are treated as new Expressions of the same Work. Another issue is that the 4-tier FRBR model is often problematic for describing serials or continuing resources.

Work[edit]

A Work, defined as a "distinct intellectual or artistic creation", is an abstract entity in that there is no single physically or linguistically fixed object representing that Work. Rather, a Work is the artistic and intellectual commonality of one or more resources as they are multiplied through translation, abridgment, revision, or any other process which does not substantially alter core content.

For example, A Christmas Carol is a distinct Work with the following Expressions:

w: Charles Dickens' A Christmas Carol
- e₁: text and illustrations for the first edition of A Christmas Carol by Charles Dickens and John Leech
- e₂: text and illustrations for A Christmas Carol: The Heirloom Edition by Charles Dickens, Jane Parker Resnick, Charles Birmingham

All film or TV adaptations of A Christmas Carol, however, are distinct Works since they include significant original and independent content:

w₁: Charles Dickens' A Christmas Carol
w₂: 1938 motion picture A Christmas Carol
w₃: 1988 motion picture Scrooged

In general, the following processes applied to a resource result in another resource realizing the same Work:

translation
annotation
abridgement
revision

Processes applied to a resource resulting in a resource realizing a different Work are:

parody
paraphrase
adaptation (for children, to different literary form, to different medium)

The Wikicat datamodel provides implicit clues as to which processes result in distinct Works. For example, form of notation is an Expression entity attribute, not a Work entity attribute. Therefore a conversion of A Christmas Carol into spoken word form does not necessarily constitute a new Work. Despite such guidelines it should be noted that "the concept of what constitutes a work and where the line of demarcation lies between one work and another may in fact be viewed differently from one culture to another" (FRBR, p. 16).

There is no equivalent to Work in the AACR2 Logical Structure.

Expression[edit]

An Expression is an intellectually/artistically concrete entity, being the realization of a Work in fixed alpha-numeric, musical, choreographic, cartographic, etc., notation. Thus it is somewhat similar to the final galley proof used for a published book. Yet it is different from it in that an Expression has no physical characteristics: in the case of textual Expressions, for example, the Expression encompasses the words, sentences, and paragraphs of the creation, but not its font and font size and hence the number of pages it constitutes when in a particular physical format.

Expressions may be related by being realizations of the same Work. As the notation used to fix an Expression is one of its major attributes, different Expressions of the same Work can be created by fixing it in different forms. For example:

w: The Da Vinci Code by Dan Brown
- e₁: final proof text by author
- e₂: audio-book narration (unabridged) by Paul Michael
- e₃: audio-book narration (abridged) by Paul Michael

w: Madama Butterfly by Puccini
- e₁: original composer's score and lyrics
- e₂: June 16, 2006 performance by the San Francisco Opera

Finally, it should be noted that despite their abstractness, Expressions and Works are still historically-contingent entities in which the manner in which the content is produced is just as important as what that content consists in. For example, even if two virtually identical resources were realized independently by different creators they would still constitute distinct Works. Every Expression of the same Work, therefore, (except the first, originating one) must be created under the influence of some previous Expression, though it may be realized by a different creator each time, as in the case of translation into new languages.

Expression is equivalent to entity Content in the AACR2 Logical Structure.

Manifestation[edit]

A Manifestation is the physical embodiment of an Expression of a Work. Thus it represents a physical specification for embodying fixed content. Because every examplar of a particular Manifestation is effectively identical, Manifestation is typically the lowest-level entity of concern for mass-produced resources outside of specialized domains such as curatorship. In some instances there may only be a single Item exemplifying the Manifestation, as in the case of antiquarian manuscripts or most works of art. In such situations it is important to realize that the Manifestation is still not the Item itself, even if realistically there is no possibility of additional Items ever coming into existence to exemplify that Manifestation- for example, in the case of an oil painting.

Changes to the physical specification resulting in new Manifestations include:

formatting changes (e.g typeface, font size, or page layout)
physical carrier changes, (e.g. impression onto a CD rather than a vinyl record)

For example:

w: Blade Runner by Ridley Scott
- e₁: U.S. theatrical release
- e₂: director's cut
  - m₁: DVD release
  - m₂: HD-DVD release

Even inadvertant changes that occur during the production process which result in different exemplars (Items) constitute distinct Manifestations (FRBR, p. 22). Changes to Items which occur after production and release do not constitute different Manifestations, however.

Manifestation corresponds to the aggregate of the Infixion, Physical Carrier, and Container entities of the AARC2's Logical Structure.

Item[edit]

Item is the only absolutely concrete entity in the model, and is a single exemplar of a Manifestation. This entity captures the history of a resource after the production process. Sometimes an Item is signficant because of its uniqueness, as in the case of most works of art. Other times the item becomes significant through its embodied, post-production history- such as when a book is signed by its author. For example:

w: King James Bible
- e₁: 1611 text
  - m₁: American printing
    - i₁: copy owned by Herman Melville

Since an Item can consist of several distinct physical objects, its unity is pragmatically determined, typically according to how it is physically packaged as well as its release history. A box set of CD's is obviously one Item, but so are two separately bound volumes with no common sleeve/box that were issued and sold together.

Item is most equivalent to entity Copy in the AACR2's Logical Structure.

Serials[edit]

A serial is something which exhibits seriality- i.e. it is issued incrementally over time. This can occur both over a fixed period (e.g. a series of 6 books, each treating different aspects of the series topic) or continue indefinitely, as in the case of most newspapers and periodicals. The issues of a serial can either be discrete with regard to the whole (both logically and physically- again, as in the case of most newspapers and periodicals) or integrated, as in the case of databases or websites such as Wikipedia.

A fundamental design decision is whether to model serials or seriality. The former option requires the introduction of a new entity, Serial, while the latter entails adding seriality attributes to the entities of the 4-tier FRBR model. Unfortunately, issues with attempting the latter are not fully explored in FRBR ("In particular, the notion of 'seriality' and the dynamic nature of entities recorded in digital formats merit further analysis", FRBR, p. 6).

Operational Stages[edit]

Initially, Wikicat will only support data population through its import function, whereby a bibliographic record is fetched using the Z39.50 protocol and then mapped to the Wikicat data model, resulting in the creation of a new Manifestation entity. This process is necessary partly because Wikidata will not be sufficiently functional in its first iteration to support data entry.

Yet there are also organizational reasons for making Wikicat initially "read-only" to the community, the most important of these being the need for a training or at least orientation regime for new editors. Such a regime is not about preventing vandalism so much as ensuring that well-meaning users contribute content that follows Wikicat's own cataloging rules; without such standards in place, the data in Wikicat will quickly become incoherent. Thus cataloging rules and a training/orientation regime are among the most important "soft" deliverables of the project.

The following is the operational road-map for Wikicat as new functionality is delivered:

Stage 0.5[edit]

Import function test. The Cite extension is enhanced to support automatically fetching bibliographic header information when only a key (i.e. ISBN number) is provided to the Cite reference tag. The imported bibliographic data replaces the key within the article wiki-text, but is not stored within any database.

The purpose of stage 0.5 is to validate the correctness of MARC 21 import capabilities before such data begins populating the (initially) read-only Wikicat database. As the results of import are immediately visible within article wiki-text, errors should be quickly spotted by a large and vocal user-base.

Stage 0.75[edit]

Wikicat imports high-quality bibliographic records from open bibliographic servers, typically in a MARC-flavored format using the Z39.50 protocol. The main source of data will be the Library of Congress's Voyager server, though other servers are available as well.

Despite its quality, such data is inherently limited: only the two lowest-level entities of the model (Manifestation and Item) are fully supported, and done so in a "flattened"/aggregate manner which makes populating the internal entities of a Manifestation (e.g. Infixion and Physical Carrier) impossible. Also, it will be difficult to correctly normalize data within the Wikicat model so that, for example, all Manifestations written by the same author point to the same Person entity in the datamodel. Though the Library of Congres addresses this issue through authority controlling its records, it must be determined whether its system is consisent with those of other institutions whose records will also be imported into Wikicat.

The main goal at this stage of operations is to support citing sources in Wikipedia through the Wikicite project. As citation information is relatively minimal, Wikicat will create sparse entries in its dataset, taking care especially to avoid creating duplicate entries for an entity that will require an involved process of manual unification later on.

Data will enter Wikicat typically during the process of citation creation in Wikipedia. When a source is first cited, its bibliographic record is pulled into Wikicat. Subsequent queries of the source, such as during article rendering, will pull the data from Wikicat directly, keeping the load on the external bibliographic servers reasonable. Since Wikicite will initially only support referencing sources by unique identifiers such as ISBN or LCCN number, it will be simple to avoid creating duplicate records in Wikicat.

Stage 1.0[edit]

Wikicat becomes a fully-editable Wikidata dataset. At this point abstract entities such as Expression and Work can be populated and the creative as well as physical relations between entities in the model correctly expressed. Though data import is now no longer the main mechanism of adding content to the model, it should still be used to, in effect, helpfully "pre-populate" certain data fields.

As a prerequisite of Wikicat 1.0, cataloging standards and a training regime should already be in place so that new editors' contributions are consistent with existing content. WikiProject Librarians would be ideal to take ownership of this regime and its standards.

Stage 1.25[edit]

Export Functions are added to allow easy export to such formats as ONIX and, eventually, MARC. These software extensions can be applied to any Mediawiki installation and so may perhaps even be implemented by publishers. This would be especially beneficial if it helped small serial publishers add article-level descriptions to their ONIX for Serials SRN documents, and thus avoid the need for onerous cataloging work within Wikicat itself.

Stage 1.5[edit]

Enough data exists in the model to make the computation of certain metrics, such as authority within a literature useful. This is done by offline programs or bots.

Stage 2.0[edit]

Multivalency added to the Wikidata infrastructure, allowing real-time, controlled data integration between Wikicat and other resource databases (e.g. content websites, OPACs, etc.).

Milestones[edit]

Stage 0.5:

create initial datamodel
- status: done
submit model to WikiProject Librarians for functional feedback
- status: in progress...
submit project proposal to foundation list:
- status: done[3]
solicit technical feedback from heads of the Wikidata and OmegaWiki projects, respectively, for the purposes of interoperability/integration
- status: in progress...
solicit technical feedback from the wikitech-l list for general and performance/scalability issues
- status: to be done...
implement datamodel
- status: in progress...
implement data import functions
- status: beta (see screenshot)

Appendix: Original Requirements[edit]

Need for Live Data[edit]

Card catalogs which will allow users to annotate the work, and to link to other works, which could include later editions, bibliography and textual apparatus. To make the card catalog live data, rather than dead data.
Suppported either through the Wikicat datamodel (editions) itself or through WikiTextrose (bibliography/citation). Annotations in the form of free text should only be allowed as a last resort, however, since it is preferrable to store such inforatmion in structured form.
Support a footnote system in wikimedia. To improve the ability to assess credibility and standards compliance of articles and their information.
To be supported through Wikicite extensions; Wikicat will act as a back-end for the storage and retrieval of bibliographic data.

Additional Functionality[edit]

Add journal articles, at least for the major journals, this is particularly important in the case of many fields in the sciences where the paper, rather than the book, is the basic means of information distribution.
Serials/journals supported by the datamodel.
Edition linking, so that editions of the same book could be compared.
Supported through the 4-tier entity model.
Bibliography project, to add the bibliographies of books themselves, so that searches can go down, and not just up, the chain.
Supported through WikiTextrose.