Grants:IEG/WikiBrainTools

From Meta, a Wikimedia project coordination wiki
statusselected
WikiBrainTools
summaryEnhance WikiBrain to make state of the art algorithms from NLP, AI, and GIS available to Wikipedia tools, bots, and researchers.
targetAll languages of Wikipedia
strategic priorityimproving quality, encourage innovation
themetools
amount$29,500
granteeShiladBrenthecht
advisorEpochFail
contact• ssen(_AT_)macalester.edu
this project needs...
volunteer
join
endorse
created on15:17, 25 September 2014 (UTC)


Diagram for Wikipedia IEG grant.


Project idea[edit]

What is the problem you're trying to solve?[edit]

While Wikipedia is widely known to be the world's primary encyclopedic reference, behind the scenes it has also become the leading source of world knowledge for computers. The knowledge that editors have encoded within Wikipedia has transformed algorithmic research in computer science fields such as natural language processing (NLP), artificial intelligence (AI), and geographic information systems. In doing so, Wikipedia has fundamentally improved how computers reason about the world.

Wikipedians have sought to turn researchers' algorithmic advances into a virtuous cycle. As Wikipedia's ecosystem of tools, bots, and cyborgs (we call these WikiTools) have grown in quantity and sophistication[1], they increasingly rely on Wikipedia-based algorithms. WikiTools use Wikipedia-based algorithms to improve the quality and quantity of knowledge encoded within Wikipedia articles, which in turn improve the data available to algorithms, which in turn improves WikiTools... and on and on.

However, two gaps have emerged in this cycle. First, as Wikipedia has grown in scale and complexity, it has become increasingly difficult for researchers to algorithmically mine Wikipedia. Researchers incur a substantial startup cost in developing their own custom Wikipedia processing pipelines. Because of this, virtually no researchers in the fields of NLP and AI publish reference implementations of their algorithms and those that do release software that is extremely difficult to integrate into production Wikipedia bots and tools. Thus, developers of Wikipedia tools and bots rarely integrate state-of-the-art algorithms that could enhance the effectiveness of their projects and, in turn, improve the quality of the encyclopedia.

What is your solution?[edit]

This project seeks to close the gap between three constituencies: (1) Algorithmic Wikipedia researchers in NLP, AI, and GIScience, (2) Developers of WikiTools, and 3) Researchers of the Wikipedia community. The project will leverage and enhance the existing WikiBrain software library to serve as a bridge between these communities.

WikiBrain is a Java software library we created to democratize access to state-of-the-art Wikipedia-based algorithms and technologies. WikiBrain downloads, parses, stores, and analyzes Wikipedia data in any language, providing access to state-of-the-art NLP, AI, and GIScience algorithms with the click of a button on commodity hardware. The project has a robust existing codebase, broad support from researchers, and has been well received by the Wikipedia research community. WikiBrain is described in more depth in our 2014 WikiSym / OpenSym publication [2].

For example, the WikiBrain software API can very quickly answer questions like the following:

  1. What English Wikipedia articles about people are most related to en:Volunteering?
  2. How many times were articles about Argentina viewed yesterday in the Spanish Wikipedia?
  3. What is the share of article views for each top-level category in each language edition of Wikipedia, and how does that relate to the supply of articles in those categories?
  4. Across all language editions of Wikipedia, which articles have fewer-than-expected wikilinks? What links should those articles contain (and where)?

Although the exact form of this software bridge will result from a community feedback process with tools developers and researchers, we anticipate four technical products as outcomes:

  1. A public web-based API to WikiBrain hosted on a WikiMedia Labs instance (requires no additional technical support from WMF).
  2. Software in Python and PHP to connect to the WikiBrain web-based API.
  3. A shared WikiBrain installation on Wikimedia Tool Labs (requires no additional technical support from WMF).
  4. Enhancements to WikiBrain to increase accuracy of algorithms for languages important to Wikipedia that are typically under-studied by the NLP community.

Project goals[edit]

High-level goals:

  1. Increase the intelligence of WikiTools bots and cyborgs by integrating state-of-the-art Wikipedia-based algorithms.
  2. Improve the productivity of Wikipedia researchers by reducing or eliminating the barriers to intelligent Wikipedia-based algorithms.
  3. Increase the multilingual coverage of existing state-of-the-art algorithms.

Specific goals:

  1. Design and develop integration API between WikiBrain and WikiTools (called WikiBrainTools).
  2. Using WikiBrainTools, enhance the intelligence of select early-adopters bots and cyborgs.
  3. Promote WikiBrain as a software platform that unites tools developers, Wikipedia researchers, and algorithmic researchers by engaging these constituencies online and in conference settings.
  4. Produce three human-coded datasets necessary for NLP algorithms for languages with strong Wikipedia communities that are underserved by NLP researchers.

We believe this project will lead to five lasting improvements:

  1. Wikipedia bots and cyborgs will become more intelligent, helping editors to work more effectively, leading to improved Wikipedia articles.
  2. Researchers who study the Wikipedia community will identify more complex patterns in editor behavior with less effort.
  3. Wikipedia-based algorithmic researchers in NLP / AI / GIScience will be able to shift effort away from Wikipedia data processing to the development of novel algorithms.
  4. Connecting algorithmic researchers to Tools developers and Wikipedia community researchers will encourage the development of more grounded and relevant algorithms.
  5. The scope of languages used to evaluate NLP techniques will be expanded to include several new languages with active Wikipedia communities.

Project plan[edit]

Task 1. Engage the WikiTools and Wikipedia research communities[edit]

  • Begins during feedback period - October 2014
  • Identify WikiTools developers and Wikipedia researchers that may benefit from WikiBrain
  • Introduce audience to WikiBrain's feature set, ask which features would be most valuable, and identify any new features that would be valuable.
  • Identify a WikiTool to serve as an early-adopter and alpha-tester for WikiBrainTools.

Task 2. Design the WikiBrain Tools API:[edit]

  • January 2015
  • Create a series of use cases driven by active conversations with WikiTools developers
  • Formulate overall integration strategy. This is likely to include support for both the Tool labs environment, and a public web-based API.
  • Design particular API calls.
  • Determine resource sharing scheme for CPU, bandwidth, etc.

Task 3. Implement and refine the API[edit]

  • February - June 2015
  • Possible major code enhancements:
    • Shift from existing custom Wikidata code to IEG-supported Wikidata Toolkit.
    • Reduce resource requirements for large languages (e.g. EN) by publishing pre-analyzed datasets.
    • Improve algorithmic support for under-served languages.
    • Implement resource sharing module.
    • Develop REST interface for each API call.
    • Any additional features requested by WikiTools developers.
  • Write client module in most common WikiTool languages (Python and ?PHP?).
  • Test code: Unit testing, functional testing, performance testing.
  • Write documentation: Integration guide, WikiBrain conceptual guide, Documentation for each API call.

Task 4. Deploy the WikiBrain Tools API[edit]

  • April - June 2015
  • Deploy API instance on Wikimedia Labs
  • Work with one early adopter Tool to test and refine the API. Contribute to software development integration work in early adopter tool if needed.
  • Conduct performance benchmarks. Optimize algorithms and adjust resource sharing modules as needed.
  • Semi-publicly launch WikiBrainTools API with limited number of access tokens. Monitor resource usage.
  • Publicly launch WikiBrainTools API.

Task 5. Encourage adoption of WikiBrain[edit]

  • October 2014 - June 2015
  • Online:
    • Email forums associated with WikiTools developers, Wikipedia researchers, Algorithmic researchers.
    • Post online demos.
    • Hold online office hours.
    • Host session at Wiki research's hackathon.
  • Conferences: SIGIR, WWW, WikiSym / OpenSym
    • Organize demos.
    • Hold Birds of a Feather sessions.

Budget[edit]

  • Conference travel: Promote WikiBrain as a platform for information retrieval, natural language processing, artificial intelligence and Wikipedia research communities.
    • WikiSym 2015 - travel from MN to California ($600), conference registration ($600), accommodations ($600): $1,800
    • SIGIR 2015 - travel from MN to Chile ($1500), conference registration ($900), accommodations ($700): $3,100
    • WWW 2015 - travel from MN to Florence ($1500), conference registration ($900), accommodations ($800): $3,200
  • Software development costs:
    • Grantees: 15 weeks, 20 hours per week @ $60 / hour: $18,000
  • Dataset collection costs:
    • Extract "surface forms" of Wikipedia concepts[3] for non-English Wikipedias using Common Crawl and en:Amazon Web Services: $800.
    • Create NLP training datasets for three unavailable languages - $200 per dataset: $600
  • Server resources (may be Disk + Memory upgrades to Grantees existing hardware or cloud-based options): $2,000
  • Total Budget: (Amount) USD: $29,500

Community engagement[edit]

This grant supports the creation of a technology bridge connecting three communities (algorithmic researchers, Wikipedia researchers, and WikiTool builders). Therefore, the engagement of these communities is critical to the success of the project. We have already made significant progress in connecting with algorithmic researchers and bringing state-of-the-art algorithms from NLP, GIS, and AI communities to WikiBrain. This grant supports increased engagement with these communities in addition to engaging the WikiTools and Wikipedia research communities. The details of this engagement are specifically described in the project plan above.

WikiTools constituencies that will be contacted:

Wikipedia research constituencies that will be contacted:

Sustainability[edit]

WikiBrain has been developed without support from the Wikimedia Foundation. Thus, bots will benefit from future algorithmic enhancements to WikiBrain beyond the scope of this IEG. In addition, over the next year, the grantees will seek funding for future "pure" algorithmic enhancements to WikiBrain from organizations such as the National Science Foundation, Macalester College, and the University of Minnesota. We will also encourage algorithmic researchers to develop their algorithms using WikiBrain, making integration of new algorithms easy (we have already seen some successes here).

The web API we create will be hosted on a Wikimedia labs instance and administered by the Grantee. We want to stress that we will work within the resource utilization guidelines provided by Wikimedia Labs administrators. While WikiBrain is already highly optimized (most API calls take milliseconds or less), this project will support performance optimizations targeted at common Wikipedia researcher and bot use cases, and an access-token-based resource sharing module that ensures API clients share server resources fairly.

Measures of success[edit]

  • Launch of a new WikiBrain WikiTools API.
  • At least one pilot Wikipedia tool actively using API.
  • Creation of at least three new datasets required by NLP algorithms in underserved languages with active Wikipedia communities.
  • 50 new WikiBrain installs for researchers (WikiBrain contains analytics that track this information).

Get involved[edit]

Participants[edit]

  • Grantee Shilad Sen is an Associate Professor of Computer Science at en:Macalester College in Minnesota. He has published research on Wikipedia from both an algorithmic perspective [2], [4] [5] and as a social system, including work studying Wikipedia's gender gap.[6] Shilad led software development at an early recommender systems startup for six years before returning to academia and receiving a Ph.D. under en:John T. Riedl.
  • Co-Grantee Brent Hecht is an Assistant Professor of Computer Science at the University of Minnesota. Like User:Shilad, his work involves both studying Wikipedia[7][8][9][10][11] and studying what artificial intelligence and NLP can gain from Wikipedia[5][12][13][14]. With regard to the former, he led and co-led some of the early work on the similarities and differences between the language editions[9][10], including the Omnipedia[8] project. With regard to the latter, Brent and colleagues leveraged Wikipedia in the Atlasify project, published at ACM SIGIR[12] (the top information retrieval venue). Atlasify should launch by the end of the year, with a WikiBrain-based backend! Brent was the closing keynote speaker at WikiSym 2012.
  • Volunteer Advice and API design EpochFail (talk) 15:18, 27 September 2014 (UTC)

Community Notification[edit]

If you would like to support this proposal, please add a comment to the "Endorsements" section below. If you are a bot developer or Wikipedia community researcher would would like to provide feedback on the features in the new API, head over to the use cases page and add your ideas. The following communities / pages have been encouraged to comment on this proposal and add to the existing use cases:

Endorsements[edit]

Do you think this project should be selected for an Individual Engagement Grant? Please add your name and rationale for endorsing this project in the list below. (Other constructive feedback is welcome on the talk page of this proposal).

  • Community member: add your name and rationale here.
  • While the research community that uses Wikipedia as a datasource has been historically criticized for not contributing back to Wikipedia (e.g. here) these researchers are aggressively trying to close that gap. This project has great potential -- not only in bringing the power of the algorithms they specify to the hands of wiki-tool developers, but also to perpetuate the practice of completing the circle and giving back. --EpochFail (talk) 15:32, 27 September 2014 (UTC)
  • As a wiki researcher, I see great potential in this project. I can see how it will improve my data science efforts moving forward, streamline those efforts, and make these algorithms more accessible to a broader class of researchers. West.andrew.g (talk) 15:28, 29 September 2014 (UTC)
  • Promising concept, $29.5 K seems a shoe-string budget. Erik Zachte (WMF) (talk) 18:25, 30 September 2014 (UTC)
  • This project should have great potential for both tool developers and researchers. The first group should be able to both improve existing tools through access to state-of-the-art algorithms as well as come up with ideas for new and exciting tools based on access to new data/algorithms. For researchers, the grantees indicates that it can lower the bar to access data, as well as enable new and more complex research projects, which hopefully will be both of the analytic as well as the applied kind. As the maintainer of User:Suggestbot, I can easily see this project help improve the bot's performance and potentially extend the variety of suggestions we can make. Regards, Nettrom (talk) 20:09, 30 September 2014 (UTC)
  • Recommendation from User:EpochFail is valuable, and the project description makes it look quite interesting. I'd strongly stress that the authors need to make user interface as simple as possible; i.e. make the tool easy to use for people who don't know how to code! --Piotrus (talk) 02:51, 2 October 2014 (UTC)
  • Confident that this will be helpful for folks in the WikiTools community, and after comments on the talk page, I am happy to hear about outreach / publishing efforts to inform more of the scientific community to attract more research efforts in this fields. I JethroBT (talk) 18:49, 2 October 2014 (UTC)
  • As a long-time wikipedian with a PhD in a related field, I support this work. See talk page for more detailed comments. Stuartyeates (talk) 00:53, 3 October 2014 (UTC)
  • This looks promising. Helder 21:45, 3 October 2014 (UTC)
  • If this gets NLP researchers more involved in WP, that's fantastic. Dank (talk) 19:58, 4 October 2014 (UTC)
  • I just worked with the Shilad and Brent on a project to understand the origins of sources and articles relating to places on Wikipedia in 40+ languages. WikiBrain was instrumental in the research and I'm hoping to use it in my own future projects as well. Having worked with Shilad and Brent, I can definitely say that they would do a great job with these resources - they have displayed a long-term commitment to the Wikipedia research community and have great integrity. --Hfordsa (talk) 16:17, 6 October 2014 (UTC)
  • Exactly the kind of project we should be funding 86.179.205.216 16:39, 8 October 2014 (UTC)
  • What Heather said. Ironholds (talk) 21:40, 8 October 2014 (UTC)
  • This would be very useful for me as a developer. Yes, please! -- 72.83.193.51 21:12, 8 December 2014 (UTC)

References[edit]

  1. Aaron Halfaker, John Riedl, "Bots and Cyborgs: Wikipedia's Immune System," Computer, vol. 45, no. 3, pp. 79-82, March, 2012
  2. a b WikiSym 2014: Sen, S., Li, T., Hecht, B. 2014. WikiBrain: Democratizing computation on Wikipedia. Proceedings of the 10th International Symposium on Open Collaboration (OpenSym / WikiSym 2014). New York: ACM Press.
  3. http://nlp.stanford.edu/pubs/crosswikis.pdf
  4. Sen, Shilad; Charlton, Henry; Kerwin, Ryan; Lim, Jeremy; Maus, Brandon; Miller, Nathaniel; Naminski, Megan R; Schneeman, Alex; Tran, Anthony; Nunes, Ernesto; "[Macademia: semantic visualization of research interests", Proceedings of the 16th international conference on Intelligent user interfaces, IUI 2011. pp 457-458.
  5. a b Shilad Sen, Matt Lesicko, Margaret Giesel, Rebecca Gold, Ben Hillmann, Sam Naden, Jesse Russell, Zixiao Wang, and Brent Hecht, "Turkers, Scholars, 'Arafat' and 'Peace': Cultural Communities and Algorithmic Gold Standards." To appear in the proceedings of CSCW 2015
  6. Lam, S. K.; Uduwage, A.; Dong, Z.; Sen, S.; Musicant, D. R.; Terveen, L.; Riedl, J. (2011) WP:Clubhouse? An Exploration of Wikipedia's Gender Imbalance. Proceedings of WikiSym 2011. Open access
  7. Warncke-Wang, M., Ayukaev, V., Hecht, B., and Terveen, L. (2015) The Success and Failure of Quality Improvement Projects in Peer Production Communities. Proceedings of the 18th ACM Conference on Computer-Supported Cooperative Work and Social Computing (CSCW 2015). New York: ACM Press.
  8. a b Bao, P., Hecht, B., Carton, S., Quaderi, M., Horn, M. and Gergle, D. (2012). Omnipedia: Bridging the Wikipedia Language Gap. Proceedings of the ACM Conference on Human Factors in Computing Systems (CHI 2012). New York: ACM Press.
  9. a b Hecht, B. and Gergle, D. (2010). The Tower of Babel Meets Web 2.0: User-Generated Content and Its Applications in a Multilingual Context. Proceedings of the ACM Conference on Human Factors in Computing Systems (CHI 2010), pp. 291–300. New York: ACM Press. Best Paper Award.
  10. a b Hecht, B. and Gergle, D. (2009). Measuring Self-Focus Bias in Community-Maintained Knowledge Repositories. Proceedings of the International Conference on Communities and Technologies (C&T 2009), pp. 11-19. New York: ACM Press.
  11. Hecht, B. 2013. The Mining and Application of Diverse Cultural Perspectives in User-Generated Content. Northwestern University.
  12. a b Hecht, B., Carton, S., Quaderi, M., Schöning, J., Raubal, M., Gergle, D., Downey, D. 2012. Explanatory Semantic Relatedness and Explicit Spatialization for Exploratory Search. Proceedings of ACM SIGIR 2012. New York: ACM Press.
  13. Hecht, B. and Moxley, E. 2009. Terabytes of Tobler: Evaluating the First Law of Geography in a Massive, Domain-Neutral Representation of World Knowledge. Proceedings of the 2009 International Conference on Spatial Information Theory , pp. 88-105. Berlin: Springer-Verlag.
  14. Hecht, B. and Raubal, M. 2008. GeoSR: Geographically explore semantic relations in world knowledge. Proceedings of the 2008 AGILE International Conference on Geographic Information Science, pp. 95-114. Berlin: Springer-Verlag.