Research:University of Virginia/sort a billion documents
The Internet Archive is a digital library of websites, other digitally native media, and traditional media in digital form. Since its establishment in 1996 it has been one of the most popular web destinations. Its nonprofit mission includes providing the most library services to as many people globally as possible.
A simple and imprecise automated process has selected one billion documents in the collection of the Internet Archive which seem to be about some sort of scholarly research, whether in the sciences, humanities, medicine, business, or any other discipline. Some of these have metadata and for many of these there is digital text transcription. The lack of standard data among items in this collection is a barrier to further analysis and information discovery. These documents will need sorting many times and in many ways, but for now, sort them into these groups:
- This document is scholarly research, including
- academic papers from journals
- white papers
- preprints or unpublished drafts of research
- This document talks about research or technical topics, but is not scholarly
- casual essays
- high school student research reports
Accomplish the sorting in any way that seems appropriate. Subobjectives could include creating a database of metadata, splitting the collection into categories to sort subsets in different ways, and compiling appropriate test datasets to use for data modeling.
- Late September 2020
- Proposal presentation
- May 2021
- Project ends
- Research Proposal
- Data Product
- Technical Paper
- Research Poster
- Presentation of research
- video presentation?
- essay on ethics?
- method documentation?
- For general Wikimedia questions please contact Lane Rasberry, Wikimedian at the University of Virginia, rasberryvirginiaedu, user:bluerasberry