Wikicite/Wikicat Technical Design

From Meta, a Wikimedia project coordination wiki

The following is a technical design for the Wikicat bibliographical database. It will be implemented as a Wikidata dataset designed according to the functional specifications set out in IFLA's Functional Requirements for Bibliographic Records (FRBR) [1], its various ISBD standards, the Library of Congress's MARC 21 specification and its derivatives (particularly IFLA's UNIMARC), and especially the Anglo-American Cataloguing Rules' (AACR2) The Logical Structure of the Anglo-American Cataloguing Rules.

Bibliographic vs. "Real" Entities[edit]

In the The Logical Structure of the Anglo-American Cataloguing Rules, a distinction is made between purely bibliographic entities which are "the abstract concepts used... as points of reference or as structuring devices for the [bibliographic] rules" and the "real world" entities- i.e., agents, processes, and objects—which the bibliographic record describes (p. 4).

This distinction is particularly important in a relational database setting, where the process of normalization leads to sharing of entities across applications with very different areas of concern. For example, in the bibliographic model, the Person entity (typically referenced as the creator of some work) has minimal information associated with it, usually just name and date of birth. "Normalization" of this entity in the bibliographic context occurs through the use of an authority controlled version of its name, so that all works authored by "Mark Twain" can be accessed consistently, regardless of the actual form his name takes on a work.

In a Wikidata relational model, though, all bibliographic works having the same author must not only point to the same entity, but that entity must be shareable across multiple applications, each with its own functional concerns. Thus the Person entity referenced within Wikicat must ultimately support the complex set of relations and attributes that are of interest to other applications, and indeed accrue as much "real-world" detail as possible to eventually support such applications as Semantic MediaWiki and the semantic web [2].

Yet taking the time to properly design every entity that will either be sharable or ultimately go into some other Wikidata dataset is not feasible, especially without the input of domain experts. Nor is it clear that satisfactory models can even be supported by the traditional relational database paradigm. For example- a Corporate Body can be the creator or facilitator of a work just as easily as a Person. Yet how to model this entity when it must represent both modern corporations and research universities as well as the committees which produced the King James Bible? Even more perplexing is how to properly model the generative context of a Work, defined in FRBR as "the historical, social, intellectual, artistic, or other context within which the work was originally conceived (e.g. the 17th century restoration of the monarchy in England, the aesthetic movement of the late 19th century, etc.)" (p. 42).

Because of such conceptual and technical difficulties (not to mention the fact that Wikidata may not be fully functional at the time of Wikicat's deployment), a "hybrid" model is used in which purely nominal bibliographic data co-exists with "real-world" data. For example, rather than capturing authorship through the instantiation of a link between entity Person and the artistic or intellectual content they are responsible for, we instead record the purely bibliographic title statement of responsibility and store it in an attribute itself defined in purely bibliographic terms- "the statement of responsibility text as it appears on the item". Because of the completely nominal way this attribute was just defined, the various burdens of modeling the propositional content contained by its text (e.g. data normalization) are avoided.

Nominality vs. Actuality[edit]

An issue broached in the above discussion is the nominality vs. actuality of bibliographic data with propositional content. Most bibliographic data come from the cataloged item itself, and obviously can be in error or even intentionally dishonest- e.g. in cases of pseudepigraphy. Yet such data, because it appears on the item, is inherently significant and must be captured even if its propositional content is never modeled. For example, an exact transcription of the statement of responsibility can help identify a particular manifestation. Exact transcription also guards against future mutations being retroactively ascribed to resources—such as new names for territorial entities (e.g., from "Russia" to "USSR" and back to "Russia" again), changed names for persons (through marriage, etc.). This is accommodated in Wikicat through information nominality co-attributes, whose function is to signal whether the assertion in the modified "purely bibliographic" attribute/column is both nominal and actual (i.e. the assertion both appears on the item and is true), nominal only (on the item but not true), actual only (i.e. the data was supplied by the—one hopes, trustworthy—cataloger and does not appear on the item), or mixed (i.e. actual data on the item but altered by the cataloger to make it true). A good illustration of the use of such attributes may be found on the Manifestaiton publication and fabrication-related entities.

Terms and Definitions[edit]

Data Model[edit]

The Wikicat data model takes FRBR's 4-tier conceptual Work-Expression-Manifestation-Item structure and elaborates upon it by using the "real-world", production and distribution-focused, entities of The Logical Structure of the Anglo-American Cataloguing Rules. Detailed descriptions of each of these 4 main entities is found in the design section of the main Wikicat project page. The design is engineered using a model-driven architecture approach.

Enumerated Values[edit]

Various attributes in the Wikicat model are typed according to controlled sets of values. These values will be stored in their own tables and seeded at deployment time. See Wikicat Technical Design/Enumerated Types for more details.

Common Entities[edit]

See Wikicat Technical Design/Common Entities for shared datamodel elements.

Series Entities[edit]

See Wikicat Technical Design/Series Entities for the series datamodel.

Work Entities[edit]

See Wikicat Technical Design/Work Entities for the work datamodel.

Expression Entities[edit]

See Wikicat Technical Design/Expression Entities for the expression datamodel.

Manifestation Entities[edit]

See Wikicat Technical Design/Manifestation Entities for the manifestation datamodel.

Item Entities[edit]

See Wikicat Technical Design/Item Entities for the item datamodel.


Compositional Entities[edit]

See Wikicat Technical Design/Resource Relationship Entities for the model composition datamodel.

Data Import[edit]

An important feature of Wikicat is the bibliographic record import function whereby records are imported into Wikicat from open bibliographic catalogs such as the Library of Congress's Voyager server as well as publishing industry data feeds. The import function will initially support the retrieval of MARC 21 records via the Z39.50 communication protocol for open catalogs, and ONIX for industry data feeds.

Connection Management[edit]

Connection management to open bibliographic servers will be handled by YAZ Proxy, a stand-alone GPL-licensed server developed by IndexData. The purpose of YAZ Proxy will be to control access to the various targets queried by Wikicat and keep their loads at a reasonable level. During deployment of Wikicat, a Wikimedia representative should contact each target and negotiate that maximum load.

Mapping[edit]

See Wikicat Technical Design/MARC21 Mappings for mappings between MARC 21 and the Wikicat datamodel, and Wikicat Technical Design/ONIX Mappings for mappings between the Wikicat datamodel and the ONIX standard.

Technical Dependencies[edit]

See Wikicat Technical Dependencies

See also[edit]

Appendix: Technical Specifications[edit]

MARC[edit]

UNIMARC[edit]

Anglo-American Cataloguing Rules (AACR2)[edit]

FRBR[edit]

Networking[edit]

Catalog Servers[edit]