# Grants:IEG/PlanetMath Books Project

status: not selection

project:

PlanetMath Books Project

idea creator:

Arided

project contact:

holtzermann17gmail.com

participants:

grantees:

• Joseph Corneli (Arided)
• Raymond Puzio

volunteers:

• Deyan Ginev

• Jon Borwein
• Michael Kohlhase
• Murdoch James Gabbay
• Lee Worden
• Christoph Lange

summary:

As an early wiki encyclopedia, PlanetMath provided lots of material for Wikipedia. We want to do it again, for Wikibooks.

engagement target:

Wikibooks, Wikisource, Wikipedia (English for now, but ultimately relevant to other languages)

strategic priority:

Improving Quality, Increasing Participation, Encourage Innovation

total amount requested:

21,400 USD

2013 round 2

## Project idea

The purpose of this project is to make it easy to produce mathematical textbooks from pre-existing, freely available material.

PlanetMath is a free/open mathematics community that uses the same license as Wikipedia. It is best known for its encyclopedia, and cutting edge in-browser interaction and rendering of mathematics content. In recent years, PlanetMath's software system has been entirely rebuilt, with a new focus on free/open problem sets, course outlines, and textbooks. We plan to reuse existing material from PlanetMath, Wikipedia, and math.stackexchange.com, along with other free/open and public domain sources, to expand our collection of mathematical learning materials. PlanetMath's special-purpose software make it an ideal place to assemble content -- and publish "downstream", e.g. to Wikibooks. This kind of content exchange has a precedent in the earlier PlanetMath Exchange, which brought hundreds of articles from PlanetMath to Wikipedia. If we can do something similar with books, it will be a big step forward for Open Educational Resources, and the Wikimedia movement.

An IEG grant would support the technical work that will make this possible.

## Project goals

Overview

 Technical platform improvements to PlanetMath's user interface to streamline the production of books. Work with off-the-shelf Optical Character Recognition (OCR) systems and build custom enhancements that will make it easier for people to retrodigitize public domain texts (which could then be uploaded e.g. into WikiSource). Crawling and light-weight semantic linking to connect up existing free materials across the Web (e.g. questions and answers, problem sets, encyclopedia articles) to build a "unified index" of free/open mathematics content. A proof-of-concept demonstration that this raw material can be efficiently turned into expository textbooks. Format-shifting tools for exporting books from PlanetMath to Wikibooks.

### Narrative: Deliverables

This is a technically focused part of a major distributed crowdsourcing project, whose ultimate goal is the production of many free/open technical books. In fact, much of the required material is already available, in the form of encyclopedia articles, Q&A posts, and existing texts in the public domain. The purpose of this project is to build tools that can be used to make this material accessible and organize it into textbooks.

A central feature will be improvements to PlanetMath's software, Planetary, which make it into an efficient tool for assembling mathematics textbooks. The current "live" prototype on PlanetMath has partial support for the features we need: in particular, we'll add features that support automation of content import, assembly, and downstream publishing. The process of ingesting content will be supported by new tools that complement and enhance off-the-shelf Optical Character Recognition software, making them more user-friendly and less prone to error, and helping future editors draw from the rich materials available in the public domain.

We will build a few demonstrator texts using existing material available on PlanetMath, Wikipedia, Wikibooks, StackExchange, and from public domain sources. The purpose of this portion of the effort is not content building per se, but rather testing and debugging the technical platform. To this end, we will also run a brief user trial, to further check the robustness of tools and see what adaptations may be needed in order to teach others how to use them. Technology transfer is an important goal in the grant, and the user trials form part of our strategy for ensuring that others will be able use the tools we develop after the grant is over (we ourselves plan to contribute subsequent content work as volunteers).

The intial demonstrator texts will deal with a few of the core components of an undergraduate mathematics curriculum -- for instance, Advanced Calculus, Abstract Algebra, Real Analysis, and Differential Equations. Where they exist, the current coverage of these topics on Wikibooks is indicated to be "half-finished" or "partially developed" [1], [2], [3], [4]. In order to build a collection of educational texts that are suitable for self-study or use in classrooms, a new approach is needed.

The requisite new approach is what we will deliver in this project.

### A key challenge and potential benefit

To generate a digital version of an old book from the public domain, one could run it through an OCR program and clean up the result. We tried this on a calculus book from 1912, and we found an error rate of approximately one error per 25 characters. Cleaning this up required approximately 100 hours of proofreading. We would like to improve this figure by an order of magnitude: one error every 250 characters, and 10 hours of proofreading.

### Planned future work that will be made possible by our efforts

Subsequent work building on our efforts might seek "feature parity" with the mathematics section of the popular Schaum's Outlines series -- building approximately 50 free outlines like the demonstrator texts we will develop in this pilot phase. One of these Schaum's books selected at random had 97 expository sections, 877 problems, and 420 worked solutions; feature parity with the series as a whole would therefor require approximately 5,000 expository articles, 45,000 problems, and 22,000 solutions. Importing and appropriately linking content from math.stackexchange.com and MathOverflow -- which use the same CC-By-SA license as Wikipedia and PlanetMath -- would provide a source of interesting problems and worked solutions. As of June 16, 2013, these sites contained 147,768 and 42,735 questions, respectively. Our project will help to organize this material and make it useful for students and teachers.

## Get involved

Welcome, brainstormers! Your feedback on this idea is welcome. Please click the "Discussion" link at the top of the page to start the conversation and share your thoughts.

## Project plan

### Scope:

#### Scope and activities

We will develop several custom modules for the Drupal content management system, on which Planetary is based, some improvements to the NNexus concept indexer and autolinker, and custom OCR pre- and post-processing routines. This technical work will be evaluated by producing a few example books demonstrating this workflow:

OCR+proofreading | web scraping | content reuse → semantic linking → content assembly → import → editing → downstream distribution.
##### Raw Materials
 Old textbooks and monographs whose copyrights have expired Using the Library of Congress catalog and copyright renewal records, we have identified 100s of mathematics books that are available in the public domain, many of them already scanned and available as graphics files via the Internet Archive. Wikipedia Well-known for readable, friendly exposition, often suited to a general audience. PlanetMath Known for shorter, more technical articles, and detailed proofs. Math.StackExchange.Com and MathOverflow Source of questions and answers that we plan to convert into exercises and examples.
##### Retrodigitization workflow enhancements
 Preprocessing Before feeding a graphics file through the OCR program, we will do image transformations to clean it up so as to produce better output. The work underwritten by the grant would consist of figuring out what combinations and settings of off-the shelf graphics programs produce the best result. Coprocesing While Infty does good job of identifying where the characters are located on the page and assembling identified characters into structures, its weak point with old books is identifying the characters. For instance, with our calculus book, it was not able to distinguish 5's from 6's in the font used, even though the difference is quite obvious to anyone seeing the text. To deal with this weakness, we will supplement Infty with a procedure where we take the characters as located by Infty, then use a clustering algorithm to collect together similar looking characters. A small pilot study suggests that this will provide considerably better character identification. Postprocessing Context can be used to improve the results of character recognition after the fact. For instance, suppose that "tbe" appears in the output. While "the" is a common word, "tbe" does not appear in the dictionary; "b" and "h" are similar in appearance; this can be quantified using the results of the clustering described above, and one can even assign probabilities of misidentification to particular occurences of characters. Thus, one can conclude that this is a scan-o and correct it accordingly. Depending on how obvious or not obvious the correction is, one might want to have a human double-check it before making the change. For instance, if the dubious character meant that the word was just as likely to be "bat" as "hat". Copyediting Based on reading the archives of the Project Gutenberg listserver, we have identified a range of time-saving techniques. Using these techniques, PG contributors were able to proofread a book in four hours. They noticed that the straightforward approach of reading through a text line by line, page by page and marking errors as they appear is highly inefficient. A much better approach is one in which the computer identifies places where an error is likely and presents this information to the human. Each pass should focus on errors of one type, such as spelling, capitalization, or puctuation. To make this work efficiently, we will implement an interface in which the text in question is highlighted and centered, with surrounding text lowlighted, and where the user is presented with multiple choice options as to what the text says (with "other" as one of the options).
##### Light-weight Semantic Linking and Indexing
 NNexus can help us connect texts into useful configurations, and build an index of available free/open mathematical texts NNexus spots the technical mathematical terms in a given text, for example, it would identify abelian group as a technical term, and link it to its definition. NNexus includes features for contextual disambiguation in the case of overloaded technical terms, like distribution. NNexus also includes its own web-crawler and indexer. Using this tool, we will be able to build a large graph that connects, for instance, the questions and answers from StackExchange with the terms that are defined in the PlanetMath and Wikipedia encyclopedias. This will help connect "worked examples" to expository text.
##### Content Assembly and Import
 Simple LaTeX documents will be used as a text-based input format We will store documents coming from various sources in LaTeX form, using simple sectioning markup commands. Custom code will slice and reassemble documents into collections Planetary currently includes a feature called "collections", which are essentially lists of Drupal nodes. We will write custom code to turn LaTeX documents into (potentially nested) collections. We will also write custom code for displaying collections in user-friendly ways, e.g. with all of the sections and exercises from one chapter on a single page.
##### Improved support for editing mathematics texts on PlanetMath and sharing the results downstream
 Custom features for interacting with documents in Planetary We will develop some straightforward enhancements like per-collection recent changes pages that will help editors work together on revising the content of a PlanetMath collection. Generating wiki markup from LaTeX markup Pandoc can export simple LaTeX documents in Mediawiki format; however, more complicated texts will need some custom code. If possible, we would like to collaborate with others working on the reverse export system, so that content can be round-tripped between the two platforms. w:User:Jmath666 and Oleg Alexandrov developed a program called latex2wiki.pl that can do this for individual articles. We would like to revive and extend this code.

#### Tools, technologies, and techniques

This is a software focused project: the main system is Planetary, a customized version of Drupal 7 that uses LaTeXML and MathJax for mathematics rendering, and which uses the NNexus concept-indexer and autolinker as a way to identify (weak) semantic links between user-contributed content. We will use off-the-shelf OCR tools, which interoperate with some custom code for extra OCR support that we have prototyped. These software projects are all free/open and managed through public issue trackers and mailing lists.

We have weekly phone conferences that have been ongoing for about a year – this is a good way to keep up to date with progress and also to bring in newcomers. Our plan of work is broken down in a Gantt chart (presented below).

##### Technology transfer

Technology transfer (which in this case, includes recruitment as well as training) will be important, because in addition to special-purpose software, it will take significant human effort to realize the promise of a vast free mathematics digital library using the tools we will develop.

We plan to recruit participants in WikiProject Mathematics into the user trial. This strategy that will simultaneously achieve evaluation and "knowledge transfer" targets. This does not directly impact the wiki content as much as the wiki community, offering current users new strategies for building technical content. However: note that in the Gantt chart below, we have left a month "blank" at the end of the 6 month project. We would like to use this time (which we will contribute "pro bono") to collaborate with interested volunteers to build more content after the user trial.

### Budget:

$21400 #### Budget breakdown •$1000 will purchase a license for InftyReader, a state-of-the-art math-aware OCR software package, and ABBYY FineReader, a mainstream OCR program that is interoperable with Infty
• $3200 will pay for research to improve the efficiency of mathematics OCR •$800 will pay for generation of an index of mathematical reference material and Q&A available under the CC-By-SA license
• $2000 will pay for customizations to the Planetary platform to make it more useful for preparing books •$1600 will pay for prototyping proofreading features that interoperate with the OCR system
• $4800 will pay for a contractor to improve the general usability of Planetary, including the proofreading features •$800 will pay for code review and user docs
• $1600 will pay for beta testing the system as a whole •$1600 will pay for user trials (we will recruit from Wikipedia, Wikiversity, and Wikibooks as well as PlanetMath)
• $4000 will pay for proof of concept books exported to Wikibooks, along with other documentation and dissemination activities Personnel costs correspond to the plan of work depicted in the following Gantt chart (billing at$400 per week for grantees, and \$1200 per week for a Drupal/Javascript contractor).

Gantt Chart for PlanetMath Books IEG proposal.

### Intended impact:

#### Target audience

We want to build a free, interactive, replacement for expensive and technically outmodded textbooks. Mathematics students and teachers will be the first beneficiaries of this work -- particularly those living in the developing world.

The work will also be of interest to mathematicians and computer scientists who have been working for some time on building a "World Digital Mathematics Library". One venue for dissemination is the DML track of the yearly joint Conferences on Intelligent Computer Mathematics. This is an issue of considerable contemporary concern in the mathematics community; if funded, this grant will show that Wikimedia can get things done in a wiki wiki manner, and with the results free-as-in-freedom to boot :-). The idea of making books using StackExchange content has been discussed before and some SXE'ers show interest in having a "hands-on" role in the downstream use of the material.

#### Fit with strategy

Improving Quality and Increasing Participation: Wikipedia has great mathematics content and is often a first-stop for high-level reference questions in mathematics. However, students still don't have a reliable "first-stop" for learning how to do mathematics. Mathematics teachers tend to agree that this is a subject that is best learned as part of an active process of problem solving. The content we're building will support this sort of active use.

Encourage innovation: While we work primarily outside of Wikimedia, this grant will help us develop robust ways to work with Wikimedia, using alternative software to further Wikimedia's mission.

#### Sustainability

PlanetMath is one of the earliest online wiki-like communities – it launched in 2001 and has been running successfully since then, powered by a 100% free/open software system, which has recently been re-built using the popular and well-supported Drupal web content management system. A new workflow for producing mathematics textbooks will make PlanetMath an important tool for the next decade. In particular, the tools we will develop will help circulate content between other popular free/open platforms, including Wikipedia, Wikibooks, and math.stackexchange.com. This work will add significant value to the "free math" ecosystem as a whole.

#### Measures of success

In this phase of the project, we are focusing on technical implementation and only expect to produce a relatively small number of books. The key metric in this regard is how much more efficient we can make the book production process, when compared with by-hand authoring and ORC/proofreading.

We will also do significant pre-processing of material to feed into subsequent phases of work. In particular, we will build a unified index of the mathematical contents of Wikipedia and PlanetMath encyclopedias, and use the NNexus autolinker to crawl, index, and connect questions from math.stackexchange.com to this corpus. Metrics for evaluation include the percentage (and distribution) of questions, answers, and articles linked in this fashion, as well as time estimates on the amount of work required to transform the products generated via "automated" content assembly into usable books.

Community uptake of the tools is another important measure (which we suspect will be correlated with ease of use, as above). We will evaluate this measure through the user trial and the final month of donated content work.

## Participant(s)

Joe is one of the main developers on the Planetary project, and has been responsible for building and deploying most of the customizations needed for the re-built PlanetMath. He recently submitted a Ph. D. thesis at the Knowledge Media Institute of the Open University (UK), entitled "Peer Produced Peer Learning: A mathematics case study" – this thesis describes the primary considerations behind the PlanetMath rebuild/reboot, and including a report on preliminary user trials of the new software after its deployment on PlanetMath. Joe holds a Bachelors degree in mathematics from New College of Florida.

Ray is one of the top contributors to PlanetMath, having authored more than 450 articles; he also serves as PlanetMath's Operations Manager. He holds a Ph. D. in Physics from Yale.

Deyan, who will be participating in the project as a volunteer, is the current maintainer of the NNexus indexer and autolinker, and is also one of the core contributors to LaTeXML. He is working on a Ph. D. in computer science at Jacobs University Bremen, and is a board member at PlanetMath.