Grants:IdeaLab/All the Dates in the English Wikipedia
What is the problem you're trying to solve?
We would like to better understand how history has been recorded over time.
What is your solution?
We think it would be interesting to put together a master timeline of all the history in the English Wikipedia. We have extracted 49 million dates from the English Wikipedia. The data, when compressed is 1/2 a gigabyte including all titles of most articles broken down by paragraphs, sentences and dates (with date ranges). We would like to give this data to anyone who requests it.
Our preliminary data is here (this file is 1/2 a Gigabyte, compressed, some thing to contemplate before you download it):
We have 3 tables that are all related to each other.
Titles.CSV - 176 megabytes uncompressed
Para.CSV – 742 megabytes uncompressed
Sen.CSV – 1.6 gigabytes uncompressed, 48 million records
How did we collect these dates?
The dates were collected using a rather complex set of algorithms and date template tables. The titles table was collected to only include Wikipedia articles that are linked to by other Wikipedia articles. Thus, we are not spending time acquiring and scanning through short “stubs” or “widowed” articles, which typically have no dates. You can start with just about any random article using this methodology. And just recursively gather the links from every article scanned. We depend on the fact that Wikipedia uses plain paragraph tags. Wikipedia also typically enforces a set of rules for notating date formats. So, we pull the paragraphs from the articles, disambiguate the sentences from the paragraphs (which is a lot more complex than it sounds), then scan the sentences for dates and date pattern indicators. Date ranges are deduced by finding more than one date in a sentence and identifying some sort of linguistic correlation between the dates [from, to] or [born, died] and so forth.
Of course, all of this is imperfect. But we did the best we could with the time we had. We feel that we got at least a 96% to 97% accuracy rate. This is really hard to gauge, as there is so much data here, it is impossible for one human to verify the majority of the data presented here. That would entail reading a LOT of Wikipedia.
A master timeline of all of Wikipedia would allow anyone to understand history better. For example, it is interesting to note that Leonardo da Vinci was 3 years old when Gutenberg printed his first book. How would Leonardo's life been different because of this. Yet, this information would be much easier to discover if it was displayed graphically on a timeline.
Who will you be doing outreach with?
Anyone interested in history.
About the idea creator
I am a database analyst who specializes in dates.
- When pondering a date I frequently wish I could browse forward and backward from that date to understand the context better. I also wish I could filter events by their type, and degrees of significance. For example, to ask what new technology or most recent war would have been familiar to some person at a certain time. 18.104.22.168 03:33, 21 February 2017 (UTC)
- Support - This is an awesome proposal! I'm the co-founder of Histropedia, which started with a similar aim. The path we're taking is more focused on creating a directory of separate timelines on all topics in history, using Wikidata as the primary source of dates with Wikipedia categories, wikidata queries, and community editing used to define the contents of each timeline. But this is missing a wealth of data from within Wikipedia articles that will not be available in Wikidata for a long while. A couple of points/questions:
- If you're not already aware of it, you should check out histography.io - it might be worth contacting the developer as the approach is similar in terms of extracting dates from within articles.
- Have you extracted any data about the precision of the dates? e.g. whether they are precise to a day, month, year , decade etc?
- Have you thought about using data from Wikidata to enrich or error check the data? Going the other way, this could be a great opportunity to output lists of suggested dates that could be added to Wikidata!
NavinoEvans, Well, right now I have a demonstration website at:
That has the ability to create a timeline out of any Wikipedia article. You put in the URL for any Wikipedia article, it pulls the article, parses out all of the dates and creates a timeline. You can place multiple articles on any one timeline to compare them with any other articles contemporaneously.
The problem with it is, as always, I don’t know who to show it to. I don’t know anybody who would find this interesting. I have a website that is of no interest to anybody. Lol
But I do like http://histropedia.com/ and would like to talk to you about this.
And for your questions:
1) Yes, I have seen https://histography.io/ and I think it is very cool. And he can use my 40 million dates from Wikipedia, if he likes.
2) All the dates I have extracted are precise dates. Are the any other kinds?
3) I have no idea what Wikidata is. So I must educate myself.
- Excellent, that's already looking super useful for comparison of topics! I'll be sure to show it around to anyone who will be interested. Happy to connect via email if you want to discuss Histropedia or Wikidata :) navino at histropedia dot com.
- Regarding the date precision, I just meant that lots of dates are not known precise to a single day - e.g. a date given as "1957" (precise to a year), or 16th century (precise to a century). Was just curious whether you'd attempted to extract any of that info along with the full dates you've found. NavinoEvans (talk) 14:02, 1 March 2017 (UTC)
In regards to date precision. We tried to be precise as possible. And you understand the issues involved. Most dates specifying a precise date are in the form [May 26, 1984] in keeping with the Wikipedia manual of style. These are unambiguous to preciseness and there are millions of them. When our system encounters dates with this format, it can pick them out and store them in milliseconds. Of course, we had to take into account other date expressions as well. When encountering a date specified as a single year, and we know these are dates because the 4 numerical digits have to be preceded by a known keyword like “in” or “the”, we would assign a random time of the year, typically sometime in the summer of that year. This is obviously the only thing we can do as the writer did not intend to be more specific. The same protocol is used for the citing of a single month in an article. But for this we had a little trick, we would keep something I called a “widowed year” as we would navigate the article. So, an example of this would be”
“He took office on November 27, 1988. But resigned the following February.”
As we would navigate through this, we would make “1988” our “widowed year”, because [November 27, 1988] is a precise date and we are sure of the writer’s intention as to expressing a year. Then when we encounter a single month, that has no year, we can understand that, most likely, this month is in 1988, as to imply that that month would be in any other year would make no sense. So, we would save this date with a random day of the month, around the 15th or so, because we cannot infer any more accuracy to these situations.
And about 1 in 10 dates are actually date ranges. Consider the following:
“The building project continued between October, 1965 to September, 1970.”
We would save this in our database as 2 dates, a date range. This, in some instances, deducing that a date is actually a data range turned out to be very complicated. Resorting to almost artificial intelligence to discern the writer’s intent. And a discussion of this is obviously out of scope for this type of forum (and we don’t want to give away all of our secrets, anyway).
We have been working on this for 7 years now.
Expand your idea
Would a grant from the Wikimedia Foundation help make your idea happen? You can expand this idea into a grant proposal.
No funding needed?
Does your idea not require funding, but you're not sure about what to do next? Not sure how to start a proposal on your local project that needs consensus? Contact Chris Schilling on-wiki at I JethroBT (WMF) (talk · contribs) or via e-mail at cschillingwikimedia.org for help!