Research talk:Understanding the context of citations in Wikipedia

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

Notes from meeting today[edit]

Hey folks, I just wanted to capture some notes from the meeting yesterday. I got in this morning and noticed I hadn't posted this.

citation: <ref>""</ref>
source:  <ref> --> (source)
article: "....." ~ concept

(source, article)

(citation, article) -- {{cite}}

--Halfak (WMF) (talk) 15:13, 18 April 2017 (UTC)

Operationalizing a "reference"[edit]

A bare reference
has no name and is identified by the content between the <ref> tags
<ref>...</ref>
A named reference
has a "name" attribute and is identified by that name. Name changes will need to be tracked between revisions. Named references can be re-used multiple times in the
<ref name="...">...</ref> or <ref name="..." />
A reference has the following fields
  • content -- appears between the <ref></ref> tags (in the primary tags if named)
    • raw_content
    • templated -- boolean, the entirety of the content a template call
    • cite_template -- citation templates are usually named something like {{cite book|...}} This field will contain the type the citation template if present.
    • urls -- all the urls that appear in the content
      • domain --
    • identifiers -- all of the structured identifiers contained in the content
  • occurrences -- the locations in the text that the <ref> tags appear
    • section -- the section # as defined by MediaWiki (starts at 0, split by headers)
    • text_offset -- the # of chars between the beginning of the article and the beginning of the <ref> tag
    • preceding_text -- the immediate 250 chars before the beginning of the <ref> tag.
    • header_level -- the level of the header immediately above the <ref> tag (0 for lead)
    • header_text -- the text of the closest header
    • level_2_text -- the text of the nearest level 2 header (see Research:Investigate_frequency_of_section_titles_in_5_large_Wikipedias)
    • level_2_offset -- the # of chars between the nearest level 2 header and the beginning of the <ref> tag

From this, I think we can get a lot of signal about the reference. --Halfak (WMF) (talk) 22:42, 24 April 2017 (UTC)


A source has the following fields
  • type (categorical) -- books, journal articles, conference proceedings, magazines, mainstream newspapers, etc.
  • level (categorical) -- primary, secondary, and tertiary
  • style (categorical) -- full, inline, short, in-text, and general reference -- this property should be bound to reference
  • quality
    • verifiability (see Wikipedia Verification Check: A Chrome Browser Extension )
      • technical verifiability (boolean) -- the trustworthiness of a source' identifier (e.g. ISBN, DOI)
      • practical verifiability (boolean) -- the open accessibility to a source. It may raise an interesting tension between open access sources and those that require payment in terms of the value added up to the body of sources in Wikipedia.
    • persistence (numerical) -- a continued duration of a source based on the history of reference revisions
      • revision history -- timestamp, article title, article ID, user name, user IP, added edits, and removed edits
  • relevance
    • content similarity
      • raw distance (numerical) -- a source' textual proximity to titles, headers, specific sections, preceding text, etc.
      • semantic distance (numerical) -- the semantic proximity referenced by WordNet
    • topological similarity (numerical) -- the neighbor distance between sources, shortest paths, sigma of network diameter, etc.
  • importance
    • cumulative measures (numeric) -- cited frequency, citation burst, etc.
    • topological measures (numeric) -- PageRank, HIT, betweenness centrality, sigma of modularity, network diameter, or clustering coefficient, etc.
    • possible measures: TBD

We spelled out the possible feature set, based on the conceptual separation between a reference and a source. --Scienceinpython (talk) 16:40, 25 April 2017 (EDT)

Adding field for "cited text"?[edit]

Preceding / Following text hint at what text a citation refers to, but we'll eventually have this identified precisely (at worst, by hand). It would be nice to have a field for the precise string the cite is associated with [often the preceding word, clause, sentence, or para]. SJ talk  14:12, 25 May 2017 (UTC)

Welcome[edit]

Hi Andicat and Scienceinpython. I just dropped by to say welcome and thanks for kicking off this research. Looking forward to learning from your research in the coming months! :) --LZia (WMF) (talk) 20:12, 1 May 2017 (UTC)

thanks! looking forward to learning from it too :) --Andicat (talk)

What about direct links?[edit]

So, if I stick a URL into an article and I think that's supporting a statement [1], we won't capture that now, but in the example I just used here, it could easily be linked to a full set of metadata. I don't imagine that will happen much these days given the tools we have for reference insertion, but especially looking back in time, should we incorporate formats other than ref tags to capture relevant text? --Andicat (talk) 20:06, 2 May 2017 (UTC)

Decided in meeting 5/8/17 no, most interesting data will be from years after which citation syntax and standards stabilized. --Andicat (talk) 15:49, 9 May 2017 (UTC)
On top of what Andicat (talk) spelled out above, many Wikipedia articles use other Wikipedia articles as sources to support their arguments whereas Verifiability says, "Do not use articles from Wikipedia as sources." Should we consider them as supporting sources in our study? --Scienceinpython (talk) 14:55, 8 May 2017 (UTC)

Potential relevant paper[edit]

Hi, during a discussion of this project, I was reminded of a paper reviewed in the Research Newsletter that might be relevant. Here's a link to the review: Research:Newsletter/2016/February#Test of 300k citations: how verifiable is "verifiable" in practice? Cheers, Nettrom (talk) 16:44, 12 May 2017 (UTC)

Are the preceding 250 characters enough?[edit]

The point was raised in discussion at WikiCite that sometimes people cite a course to support a sentence and then continue with information from that same source in the following sentence. So for example, "Author established a thing in 2016 (Author, 2016). Furthermore, she found that the thing had important characteristic Y." Perhaps check a sample of citations and see if/how often this kind of "pre referencing" happens in WP? Maybe not common enough to worry about. --Andicat (talk) 08:55, 23 May 2017 (UTC)

Citation Span Distribution[edit]

The plots shows the citation span for a random sample of 500 Wikipedia citations.

Bfetahu (talk) 11:33, 25 May 2017 (UTC)