Grants:IdeaLab/All the Dates in the English Wikipedia

From Meta, a Wikimedia project coordination wiki
All the Dates in the English Wikipedia
We have collected 1/2 a gigabyte of compressed data that includes all the dates in the English Wikipedia associated with their titles and paragraghs.
idea creator
Jroehl
this project needs...
volunteer
developer
designer
project manager
community organizer
advisor
researcher
join
endorse
created on15:15, 20 February 2017 (UTC)

Project idea[edit]

What is the problem you're trying to solve?[edit]

We would like to better understand how history has been recorded over time.

What is your solution?[edit]

We think it would be interesting to put together a master timeline of all the history in the English Wikipedia. We have extracted 49 million dates from the English Wikipedia. The data, when compressed is 1/2 a gigabyte including all titles of most articles broken down by paragraphs, sentences and dates (with date ranges). We would like to give this data to anyone who requests it.

Our preliminary data is here (this file is 1/2 a Gigabyte, compressed, some thing to contemplate before you download it):

https://drive.google.com/file/d/0BwW3GI4uVWLjclJiWWpRdEZJVU0/view?usp=sharing

OR

https://www.dropbox.com/s/4vsaoidwlpaq8oi/allwikipedia.rar?dl=0

We have 3 tables that are all related to each other.

Titles.CSV - 176 megabytes uncompressed


The Titles table is related to the Para (paragraphs) table by the ID field. So [Para.id = Titles.id]. The field [countfound] field is the number of times this title was linked to from another Wikipedia article. The [dates] field is the number of dates we found in this Wikipedia article with this title. The structure of this table is as follows:


Create table titles (id c(9), title c(100), countfound n(10), dates n(5))


Para.CSV – 742 megabytes uncompressed


These are all of the paragraphs we found. The ID field corresponds to the titles table as in [Para.id = Titles.id]. The [Para.Para] field is a unique identifier to find the paragraphs sentences in the [Sen] table. So [Para.Para = Sen.para]. The [Order] field is the order of the paragraph in the article, where a number 3 would indicate this was the third paragraph found in this article title. Some paragraphs are not present (there may not be a fourth paragraph present, but the may be a 5th) because we did not add records for paragraphs we found no dates in. The [Dates] field indicates the number of dates found in this paragraph. The structure for the [Para] table is as follows:


Create table para (id c(9), para c(9), order n(5), dates n(5))


Sen.CSV – 1.6 gigabytes uncompressed, 48 million records


The [Sen] (sentence) table contains the dates. The [Sen] table is related to the [Para] table on the field [Para], so [Para.Para = Sen.Para]. Field [Startd] is the initial date found in the sentence. To save space, it is in the format of year, month, day, or YYYYMMDD. The [Endd] field is only populated if we found what we would consider a date range within the sentence, which is a rather complex topic. Only 1 in 10 or so dates in the [Sen] table will have an [Endd] populated field. It has the same format as the [Startd] field. The field [First] is populated with a [1] if this is the first sentence in the corresponding paragraph. The field born is populated with a [1] if the word [born] was found in the sentence. The structure for the [Sen] table is as follows:


Create table sen (id c(9), para c(9), startd c(8) ,endd c(8), first n(1), born n(1))



Methodology

How did we collect these dates?

The dates were collected using a rather complex set of algorithms and date template tables. The titles table was collected to only include Wikipedia articles that are linked to by other Wikipedia articles. Thus, we are not spending time acquiring and scanning through short “stubs” or “widowed” articles, which typically have no dates. You can start with just about any random article using this methodology. And just recursively gather the links from every article scanned. We depend on the fact that Wikipedia uses plain paragraph tags. Wikipedia also typically enforces a set of rules for notating date formats. So, we pull the paragraphs from the articles, disambiguate the sentences from the paragraphs (which is a lot more complex than it sounds), then scan the sentences for dates and date pattern indicators. Date ranges are deduced by finding more than one date in a sentence and identifying some sort of linguistic correlation between the dates [from, to] or [born, died] and so forth.


Of course, all of this is imperfect. But we did the best we could with the time we had. We feel that we got at least a 96% to 97% accuracy rate. This is really hard to gauge, as there is so much data here, it is impossible for one human to verify the majority of the data presented here. That would entail reading a LOT of Wikipedia.




A Count of all the dates in the English Wikipedia between 1000 AD and Today

Project goals[edit]

A master timeline of all of Wikipedia would allow anyone to understand history better. For example, it is interesting to note that Leonardo da Vinci was 3 years old when Gutenberg printed his first book. How would Leonardo's life been different because of this. Yet, this information would be much easier to discover if it was displayed graphically on a timeline.

Who will you be doing outreach with?[edit]

Anyone interested in history.

Get Involved[edit]

About the idea creator[edit]

I am a database analyst who specializes in dates.


Participants[edit]

Endorsements[edit]

  • When pondering a date I frequently wish I could browse forward and backward from that date to understand the context better. I also wish I could filter events by their type, and degrees of significance. For example, to ask what new technology or most recent war would have been familiar to some person at a certain time. 104.5.72.176 03:33, 21 February 2017 (UTC)

All good points. Let us work on your suggestions! Jroehl (talk) 20:27, 23 February 2017 (UTC)

  • Support Support - This is an awesome proposal! I'm the co-founder of Histropedia, which started with a similar aim. The path we're taking is more focused on creating a directory of separate timelines on all topics in history, using Wikidata as the primary source of dates with Wikipedia categories, wikidata queries, and community editing used to define the contents of each timeline. But this is missing a wealth of data from within Wikipedia articles that will not be available in Wikidata for a long while. A couple of points/questions:
  1. If you're not already aware of it, you should check out histography.io - it might be worth contacting the developer as the approach is similar in terms of extracting dates from within articles.
  2. Have you extracted any data about the precision of the dates? e.g. whether they are precise to a day, month, year , decade etc?
  3. Have you thought about using data from Wikidata to enrich or error check the data? Going the other way, this could be a great opportunity to output lists of suggested dates that could be added to Wikidata!
NavinoEvans (talk) 22:09, 27 February 2017 (UTC)

NavinoEvans, Well, right now I have a demonstration website at:

http://184.72.231.130/

That has the ability to create a timeline out of any Wikipedia article. You put in the URL for any Wikipedia article, it pulls the article, parses out all of the dates and creates a timeline. You can place multiple articles on any one timeline to compare them with any other articles contemporaneously.

The problem with it is, as always, I don’t know who to show it to. I don’t know anybody who would find this interesting. I have a website that is of no interest to anybody. Lol

But I do like http://histropedia.com/ and would like to talk to you about this.

And for your questions:

1) Yes, I have seen https://histography.io/ and I think it is very cool. And he can use my 40 million dates from Wikipedia, if he likes.

2) All the dates I have extracted are precise dates. Are the any other kinds?

3) I have no idea what Wikidata is. So I must educate myself.

Jroehl (talk) 19:10, 28 February 2017 (UTC)

Excellent, that's already looking super useful for comparison of topics! I'll be sure to show it around to anyone who will be interested. Happy to connect via email if you want to discuss Histropedia or Wikidata :) navino at histropedia dot com.
Regarding the date precision, I just meant that lots of dates are not known precise to a single day - e.g. a date given as "1957" (precise to a year), or 16th century (precise to a century). Was just curious whether you'd attempted to extract any of that info along with the full dates you've found. NavinoEvans (talk) 14:02, 1 March 2017 (UTC)


NavinoEvans,

In regards to date precision. We tried to be precise as possible. And you understand the issues involved. Most dates specifying a precise date are in the form [May 26, 1984] in keeping with the Wikipedia manual of style. These are unambiguous to preciseness and there are millions of them. When our system encounters dates with this format, it can pick them out and store them in milliseconds. Of course, we had to take into account other date expressions as well. When encountering a date specified as a single year, and we know these are dates because the 4 numerical digits have to be preceded by a known keyword like “in” or “the”, we would assign a random time of the year, typically sometime in the summer of that year. This is obviously the only thing we can do as the writer did not intend to be more specific. The same protocol is used for the citing of a single month in an article. But for this we had a little trick, we would keep something I called a “widowed year” as we would navigate the article. So, an example of this would be”

“He took office on November 27, 1988. But resigned the following February.”

As we would navigate through this, we would make “1988” our “widowed year”, because [November 27, 1988] is a precise date and we are sure of the writer’s intention as to expressing a year. Then when we encounter a single month, that has no year, we can understand that, most likely, this month is in 1988, as to imply that that month would be in any other year would make no sense. So, we would save this date with a random day of the month, around the 15th or so, because we cannot infer any more accuracy to these situations.

And about 1 in 10 dates are actually date ranges. Consider the following:

“The building project continued between October, 1965 to September, 1970.”

We would save this in our database as 2 dates, a date range. This, in some instances, deducing that a date is actually a data range turned out to be very complicated. Resorting to almost artificial intelligence to discern the writer’s intent. And a discussion of this is obviously out of scope for this type of forum (and we don’t want to give away all of our secrets, anyway).

We have been working on this for 7 years now.

Jroehl (talk) 15:27, 1 March 2017 (UTC)

Did anybody download the 40 million dates and spot check them to make sure I am for real? Jroehl (talk) 19:01, 3 March 2017 (UTC)

Well, I am not so sure what to do now. Jroehl (talk) 19:01, 3 March 2017 (UTC)

Expand your idea[edit]

Would a grant from the Wikimedia Foundation help make your idea happen? You can expand this idea into a grant proposal.

Expand into a Rapid Grant
Expand into a Project Grant

No funding needed?[edit]

Does your idea not require funding, but you're not sure about what to do next? Not sure how to start a proposal on your local project that needs consensus? Contact Chris Schilling on-wiki at I JethroBT (WMF) (talk · contribs) or via e-mail at cschilling(_AT_)wikimedia.org for help!