Jump to content


From Meta, a Wikimedia project coordination wiki
This is a proposal for a new Wikimedia sister project.
Status of the proposal
ReasonInactive proposal. --Sannita (talk) 18:09, 21 September 2013 (UTC)[reply]

Page created by Miguel Andrade in January 2006. This page is hosted in the wiki from WikiMedia. Check the tabs above this page where you can discuss this project, edit this page, and see the version history of this page.


MEDLINE is the major online database of scientific references currently reaching over 11 million of references to articles in biomedical journals. The Entrez server maintained by the NCBI allows querying the database so that researchers can find manuscripts of relevance to their interest. MEDLINE is also useful for data mining of biological information and research on scientific trends (see for example, this paper or Pubnet).


One issue of difficult solution in this database is the identification of authors. In trying this, two problems arise, namely, same authors may have used different names, or different authors might have the same name (e.g., did Smith, J. really publish 2,060 papers?).

There are multiple reasons of why identifying MEDLINE authors unambiguously can be useful.

  1. Data mining algorithms could use authorship as a measure of the relation between two publications.
  2. Creation of databases of manuscripts associated to particular researchers would become trivial. Academic institutions are often faced with the need to compile such collections. See for example this abstract from a report by Scoville et al. (University of Missouri-Columbia).


The solution would be to assign a unique identifier to each author that has one publication in MEDLINE and use those identifiers to annotate the references in MEDLINE. However, this is an impossible task for a small group of annotators. There would be no better annotators to do this than the authors themselves or their colleagues.

Why using Wiki for this?[edit]

  • Many features of a Wiki are useful for this project: the mechanisms of user management, discussion associated to users and entries, and the entry history tracking.

Why a WikiMedia Project?[edit]

  • I don't have the technical knowledge to set up a Wiki server or even the time if I knew how to. But I think I can support the start of the project with the initial MEDLINE analysis needed, and advertise and advocate for it in the biomedical scientific community.
  • I feel that this project should have continuity (exist forever as MEDLINE exists) unless it was picked up by the NCBI. I think the support of the WikiMedia Foundation would be necessary.

Practical protocol (the real thing)[edit]

How do I think this project could be implemented.

  • I would take a given version of MEDLINE, make a list of unique author names, assign a unique number to each one (author_id), and finally make a table with the author_ids corresponding to each MEDLINE reference. I have the knowledge and means to do this in about a couple of days.
  • Each MEDLINE reference would be translated into a Wiki entry whose identifier would be the existing MEDLINE identifier (a number), and that would display may be just the list of authors of the reference, each one followed by its author_id.
  • Now, users of the resource are allowed to modify the author_ids attached to authors in any reference. But the entries could not be modified in other way, deleted, or created. Users would be allowed just to change author_ids to their own unique id, thereby claiming authorship.
  • Note that users could claim authorsip for another researcher. I hope colleagues of deceased researchers would do this.
  • Non-claimed authorships would retain their original numerical author_id. In any case, even a partial annotation of the authorships in MEDLINE would be better than the current situation.
  • New updates of the MEDLINE database would require just the inclusion of the new entries. The new entries would receive numerical author_ids according to the original algorithm.
  • Authors/Users could watch their references and periodically login in the system to claim newer authorships.
  • The system could include the possibility of showing all references by a given author.

Example entry[edit]

This is a made up example based on a real MEDLINE reference which has PMID identifier 16413578.

The address for this entry in WikiAuthors would be http://en.wikiauthors.org/wiki/16413578

In this example, user Xia_Yu_5 would have claimed authorship. No other action would be required or needed from him. The other author identifiers are numbers because authorship would not have been claimed yet for them.

PMID 16413578

Technical details[edit]

MEDLINE is distributed in XML format and in flat file format.

In the flat file format, author names in MEDLINE entries are in a field named AU (one line per author). For example in this entry:

 AU  - Xia Y
 AU  - Lu LJ
 AU  - Gerstein M

Some MEDLINE entries may also contain a field named FAU (one line per author) that gives more detailed author names. For example in this entry:

 FAU - Ramani, Arun K
 AU  - Ramani AK
 FAU - Bunescu, Razvan C
 AU  - Bunescu RC
 FAU - Mooney, Raymond J
 AU  - Mooney RJ
 FAU - Marcotte, Edward M
 AU  - Marcotte EM

We would take the detailed author identifiers when present.

Doing a bit of the work automatically[edit]

The problem of trying to disambiguate author names automatically has been already approached or thought about. There are many possibilities based on author's addresses, emails, subjects, co-authorship, etc.

Of those properties, the one I find more appealing to use is co-authorship, because it is also related to authorship. In short the idea is that if two papers have been co-authored by A & B it is likely that both A and B are the same person for those two papers. Of course, this is not necessarily true.

The use of this idea would be that given the set of papers associated to author name A, we could pre-partition the set in those sets that could not be connected by papers with other common names.

For example, for five manuscripts co-authored by "A B C", "A D", "A E", "A F B", "A F", we would separate those in set 1: "A B C", "A F B", "A F"; set 2: "A D"; set 3: "A E".



I have discussed my ideas with Jane Rosov from NCBI, who is responsible for the distribution of MEDLINE, and, though she recognised that they had considered it, indicated that it is not currenlty in their list of priorities.

Finally, a bulletin from the NLM (posted on 1 November 2010) describes a PubMed Author ID Project.


I see some problems of my proposal.

  • Possible lack of usage by the people involved, that is, the authors or their representatives.
  • If there is no effort by other resources such as NCBI to pick up the identifiers, I think that the project would eventually die.
  • Interest limited to users of MEDLINE.
  • May be using Wiki for this is over-doing it and there are other collaborative tools more appropriate. Suggestions welcome.


  • This scheme could be extended for application to other databases of authors or names. May be you have ideas for this. I am just aware of the problem in MEDLINE because of my expertise.
  • A dump of the database of relations (that is identifiers of MEDLINE references and the identifiers of the authors involved) should be easy to generate and useful for standalone applications. That would be a public resource.

Researchers supporting this proposal[edit]

Mark Gerstein (Yale University)

Andrey Rzhetsky (Columbia University)

Jonathan Wren (Oklahoma University)

Peer Bork (European Molecular Biology Laboratory - Heidelberg)

Barend Mons (Erasmus University Rotterdam)

Erik van Mulligen (Knewco, Inc) and (Erasmus Medical Center)

Burkhard Rost (Columbia University)

Marc Weeber (Knewco, Inc)

Related projects[edit]

Author disambiguation[edit]

External links[edit]