User:Charles Matthews/Draft proposal2

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search
statusplease add a status
summaryDivetech applies text-mining techniques to scholarly articles on LGBT topics, to help identify high-quality sources.
targetWikipedias in English, Portuguese, Spanish and Arabic. WikiLGBT. Wikidata.
type of grantsoftware
type of applicantorganization
granteeContentMine Ltd.
this project needs...

Project idea[edit]

What is the problem you're trying to solve?[edit]

"Improve the quality of LGBT related articles on Wikipedia and Wikidata"

Gender and queer issues are now part of public debate and policy in almost all countries and the Internet is increasingly seen as the natural resource for knowledge. Terminology is so varied, multi-meaning and dynamic that people often use terms which are outdated or not relevant to the particular geographic or cultural region. Using an inappropriate term, even when well-intentioned, can create ambiguity and often hurt. Inappropriate vocabulary can be used to denigrate legitimate communities.

Wikipedia is increasingly seen as a knowledge resource by most Internet users. It’s therefore critical that it’s well informed, multicultural and up-to-date. Wikidata feeds off Wikipedia and – in-part – manages the “factual” data (especially terminology) through Semantic Web technology. It uses modern tools which support multilingualism and aims to provide navigation between different cultures. A reader interested in a term finds the Wikidata entry and thence the “equivalent” page in another language. This is not a translation – it’s an entirely separate page written by committed Wikipedia editors.

However, there are major problems:

• equivalent pages in multiple languages often do not exist. • Wikidata is often very sparse • Pages are of very variable quality • Terminology is not mapped between different languages • Language and terminology is changing very rapidly

Find a way to make measurable improvements for LGBT-related articles, on several language Wikipedias, by an in-depth study of quality sources. Quality metrics, such as FA/Good article on English Wikipedia, are available and are supported across WikiProjects. All this means that keyword search, a library staple, is hard to apply in this area. The problem addressed is to find an adequate substitute. It should apply not only to quality improvements in Wikipedia articles, but to a scrutiny of existing referencing, and a diminution of systemic bias in coverage of this area, particularly in relation to traditional attitudes characteristic of the Global South.

What is your solution?[edit]

A contribution to the community, this will be a diversity-Tech project to apply multilingual text-mining and Wikibase data retrieval to academic literature on LGBT+ topics. It will break new ground in partnering Wikimedia with a global network of interest groups and communities.

The Wikimedia community can best support the content policy pillars, attribution and neutrality, by close attention to terminology in the LGBT and gender area, and the identification of topical classifications that do better than loaded keywords. This work should be supplemented by identification of cruxes (problematic terms) in texts, and the close reading of proximity of terms. Attribution cannot be a plain matter of bibliographical research looking to find "proof texts": a drive-by approach to verification becomes the "Devil can cite scripture" when context is simply ignored. Or, in short, we need to be aware of and avoid confirmation bias in searching for terms and interpreting them in context.

The project will adopt contemporary technical means, in the test and data mining field as it is currently being applied to Wikimedia content. It will work also with WikiProjects and outside organizations, in particular by involving interest groups representing the Global South. Operationally it will compile a multlingual corpus of papers, and use a text and data mining (TDM) tool and dictionaries in line with earlier ContentMine projects. It will set up a Wikibase platform, where a community can work on the "main subject", terminological issue and contested aspects of supposed facts in the LGBT area. The platform will support language-specific statements (through a Wikibase feature), and a "reliability" property for sources, which may allow for qualified approvals.

With matching metadata items created or found on Wikidata, conventional bibliographical support will be available with the Divetech findings, and will be general accessible through a dashboard front end to the Wikibase installation.

Recruitment to the site will be supported by the partner organizations.

Project goals[edit]

Logo for WD:LGBT task force

The project aims to make it easy to discover useful literature for LGBT studies, and also to promote a better understanding of its content. It will undertake the curation of a collection of papers, and also carry out research with and on them. The field requires nuance, which the project will express as precisely as it can, and cultural awareness where it can play a role as a forum. As the project title suggests, its goal is to makes this area of the humanities not only more digital, but less unfamiliar. In concrete terms, it will produce results of interest both to Wikipedian editors and activists, while preserving the neutrality that defines Wikipedia's approach to reference material.

Specific Goal Description Wikimedia Project benefit Wikimedia community benefit
Improve LGBT-related Wikipedia articles by supplying good-quality references to Wikipedia editors in the topic area. Wikipedias LGBT WikiProjects
Improve LBGT-related Wikidata information, by the creation of new scholarly article items, for uploads to the Divetech platform and found by means of a census of references in Wikipedia articles. Wikidata Wikidata:WikiProject LGBT
Develop a more diverse community of editors, volunteers and international organizations, partnering in the work around this project. Wikipedias, Wikidata LGBT WikiProjects

Project impact[edit]

How will you know if you have met your goals?[edit]

Specific Goal Description Measurement criteria Actions taken
Improve LGBT-related Wikipedia articles by supplying good quality references to Wikipedia editors in the topic area. 30% improvement in number of Good Articles (or equivalents) in LGBT project monitoring lists Tracking via pages at en:Wikipedia:WikiProject_LGBT_studies/Assessment#Statistics, es:Wikiproyecto:LGBT#Wikiestrella and pt:Wikipédia:Projetos/Estudos_LGBT/Avaliação#Matriz resumo de avaliações actuais
Improve LBGT-related Wikidata information, by the creation of new scholarly article items, for uploads to the Divetech platform and found by means of a census of references in Wikipedia articles. 10,000 scholarly article items with main subjects created on Wikidata Wikidata analytics
Develop a more diverse community of editors, volunteers and international organizations, partnering in the work around this project. 100 new editors found WikiProject signups, and identification of editors in the topic area with WikiProject help.
Continuing impact
  • The platform will have content reusable as RDF, and will be a hub for its area.
  • Establishing a viable process for identifying relevant sources that goes beyond English will open the way to greater multilingualism there.
  • The software tools will be open source, and reusable for other projects.
  • The chosen corpus, in several languages and reviewed for quality, will be kept online.
  • The diverse, international community built around the project will take on its own direction.

Goals around participation or content?[edit]

Metrics Numeric target Tools & documentation
Total participants Editor: 100, measured by account creation
Workshops attendees: 100 measured by attendance list
Meetups attendees: 50, measured by attendance list
Newsletter circulation: 1000 individuals
Number of accounts
Attendance list
Attendance list
Mailing list
Number of newly registered users New collaborators/volunteers: 100, measured by account creation Number of accounts
Number of content pages created or improved across all Wikimedia projects New pages: 100
Improved pages: 700
Number of pages improved or referenced

Project plan[edit]


Our project scope is to make available useful information extracted from a corpus of 50K scholarly papers on LGBT topics, making their usefulness as references for Wikipedia more transparent. For that we will apply standard text-mining techniques with output into a Wikibase site, and develop a front end as a dashboard, using SPARQL to serve up information from both the Divetech site and Wikidata.

We have divided the work into eight work packages, as shown in the table below with each work package, duration (Gantt Chart), objectives and outputs. Here is a proposed workflow diagram: 2018 CM proposal pipeline.png

WP code Work package Objectives Outputs
WP1 Corpus Selection of 50k+ papers related to LGBT to help improving the current articles towards FA status. Annotated corpus on Wikibase site
WP2 TDM tool and dictionaries Develop software that can apply TDM techniques a high volume of papers, with custom dictionaries compiled for the project. Tool made available under open license, language-specific dictionaries as files made available with an account of the compilation process.
WP3 DTbase platform Develop the Divetech UI, so that editors can easily add value to the site. Online data and a dump of the content at the project's end.
WP4 Quality filtering Criteria for reliability of references in the area developed and applied by the Divetech community to tag papers. Sub-corpus of papers tagged as reliable made available in machine-readable form, e.g. as a Wikidata focus list.
WP5 Wikidata uploading Develop an automatic tool for uploading of items about suitable corpus papers on Wikidata, and populating them with metadata. Bot code made available as open source.
WP6 Dashboard UI for organizations without wiki knowledge to reuse the information curated by editors on the platform. Front end available to other Wikibase site, underlying SPARQL queries made available, e.g. on a Wikidata page.
WP7 Comms and dissemination Communicate about the project internally and externally, disseminate our outputs with the wider community, engage with new users and experienced editors. Archive of newsletters online.
WP8 Project management Ensure that the action runs smoothly, that there is excellent communication among all the project participants, volunteers and community, and that action outputs are delivered on time to deliver a high quality output Project Plan.
Progress report.
Risk register.
Final report.

Project Gantt chart (by month 1 to 12)
Work Package 1 2 3 4 5 6 7 8 9 10 11 12
WP1 Corpus X X X X X X
WP2 TDM tool and dictionaries X X X X
WP3 DTbase platform X X X X X X X
WP4 Quality filtering X X X X X X
WP5 Wikidata uploading X X X X X X X X
WP6 Dashboard X X X X X X
WP7 Comms and dissemination X X X X X X X X X X X X
WP8 Project management X X X X X X X X X X X X

Time expenditure[edit]

Tech development / Comms plan / PM (Contract) Advisors (volunteers) Organizations (Volunteers)
ContentMine: WP1 to WP8
Lane Rasberry: WP1, WP3, WP7
GVSU/Mcgill: WP2, WP6
Elisabeth Jay Frieman (Latam studies expert, U. of San Francisco): WP1, WP6
Anasuya Sengupta (Whose Knowledge): WP6, WP7
Myra Abdallah (Arab Foundation of Freedom and Equality, AFE): WP6, WP7
Jason Moore (Wiki LGBT group): WP1, WP3, WP6, WP7
Gonzalo Velasquez (Movilh Chile): WP6, WP7
Tani Leon (Fundacion Arcoiris, Mexico): WP6, WP7
Movilh (Chile). WP6
Fundacion Arcoiris (Mexico). WP6
AFE (Middle East). WP6
Mcgill (Canada). WP2, WP6

Project budget[edit]

Work Package / Task USD
WP1: Corpus $tbc
WP2: TDM tool and dictionaries $tbc
WP3: DTbase platform $tbc
WP4: Quality filtering $tbc
WP5: Wikidata uploading $tbc
WP6: Dashboard $tbc
WP7: Comms and dissemination $tbc
WP8: Project management $tbc

Achievements at the end of the project[edit]

An overall summary of what is intended:

  1. Helped to improve the quality of 300 articles in Wikipedia.
  2. Added at least 20K statements to Wikidata.
  3. Created a platform that can be used by the Wikimedia community to improve Wikipedia content, and all for exploration of its content and to expand their knowledge of their interest.
  4. Formed institutional links between non-Wikimedia organizations related to the project's work, and the Wikimedia world.
  5. Created a dashboard powered also by Wikidata to be used by these (and any other) organizations to extract valuable, reuseable information about the topic.
  6. Promoted and imported research in languages other than English.
  7. Created a diverse community of organizations and volunteers around this project.
  8. Filled gaps in knowledge about under-represented LGBT communities.
  9. Research tied into the project within academia, carried out by GVSU/McGill.


Communications and dissemination plan[edit]

Our communication and dissemination activities will be supported by Wikimedia UK, Movilh, AFE, Fundacion Arcoiris, Whose Knowledge and other organizations we are currently collaborating on this project. Our communication plan is divided into three main activities, as follows:

  • Online and social media presence. We will use monthly newsletters, place articles in relevant publications or via blogs, and make intensive use of Twitter. We will drive traffic to one of three landing pages: the project wiki main page, for participation; a Wikipedia project page, for case studies on LGBT referencing; and a mentoring page on Wikiversity, for support. All pages will contain information on the problem context and be cross-linked for easy navigation.
  • In-person presentations. These will introduce the project workflows in easy steps to lower the barriers to entry for those who would like to participate. We will provide meetups and networking opportunities, edit-a-thons, webinars, and organise hands-on workshops.
  • External media and communication. For audiences outside Wikimedia the goal will be to explain the context of the project such as the intractability of reviewing the whole open access LGBT literature, and how important that is to ensure the LGBT pages are well-referenced statements on Wikimedia sites in all languages. The main opportunities are via mailing lists and attendance at relevant conferences.

Our key objectives are:

  1. Embed our diversity-tech tools into the workflow of Wikimedian editors and other targeted users, so that usage continues sustainably after the project's end.
  2. Reach a wider audience outside Wikimedia communities, and bring new editors in to join the project.
  3. Disseminate our project's benefits to the broader Wikimedia community and the public.

Engagement plan[edit]

We have identified, contacted and engaged with: international organizations dedicated to gender studies and equality from three different regions in the Global South; an experienced team of advisors with a wide network in the field; and two university research groups. They will help us to deliver and promote our project to their audiences. This project was developed in consultation with members of the LGBT Studies WikiProject on the English Wikipedia.

We will also design a social media campaign, centred on Twitter to reach more than 100,000 people. It will build on the above, especially our advisors and organizations supporting the project. For the broad audience, we'll circulate videos, newsletters and documentation. We'll place op-eds in the mainstream media and related publications with a campaigning editorial line.

Experience with past interviews given by our team members shows that 100K views are quite conservative.

Our engagement activities will consider:

Activity Description Timeframe
Existing and new editors engagement Use our network to engage new editors from their community as well as existing LGBTwiki editors M1-M6
Meetup sessions Organise and deliver a three-monthly meetup (20 people per session) Every 3 months
Newsletter for Wikimedia community Deliver a monthly newsletter, reaching wikimedians (100 people on newsletter list) 1 every month
Newsletter for general community (multilingual) Deliver a monthly newsletter, reaching a wider community (1000 people on newsletter list) 1 every month
Advisory board network Delivering targeted content for our advisors and their networks (+10K) M3,M6,M9
Attendance at non-Wikimedia conferences Presenting our project progress on 10 conferences 1 every month
Project webpage Ensuring development work and results are communicated through ContentMine’s own site, wiki page and social media to interested communities at each step in the process M1
Engagement with social media communities developing a social media outreach plan for the project and the organizations supporting it(multilingual) M6
Press campaigns Deliver and develop press articles in related journals M6-M9
Workshops Deliver 2 workshops during the project aiming new potential editors M6, M9
Webinars Deliver a project webinar every three months to explain the project progress and increase awareness (in 2 languages every 3 months
Trainig material Preparation of videos to explain each stage of the project and how new editors and volunteers can contribute every month

Get involved[edit]


Peter Murray-Rust at Wikimania 2014
Jenny Molloy

Jenny is a molecular biologist by training and manages ContentMine collaborations and business development. She spoke on synthetic biology at Wikipedia Science Conference 2015 and has been a long term supporter of open science. She is also a Director of Biomakespace, a non-profit community lab in Cambridge for engineering with biology.

Lane Rasberry

User:Bluerasberry, Wikimedian-in-residence at the Data Science Institute at the University of Virginia. He coordinates projects between the university and Wikipedia, Wikidata, and other Wikimedia projects.

Peter Murray-Rust

Peter has been a Wikimedian since 2006 and delivered a keynote talk at Wikimania 2014 and Wikipedia Science Conference 2015, where CM also ran a hands-on workshop. Peter founded ContentMine as a Shuttleworth Foundation Fellow, and is the main software pipeline architect. He received his Doctor of Philosophy from the University of Oxford and has held academic positions at the University of Stirling and the University in Nottingham. His research interests have focused on the automated analysis of data in scientific communities. In addition to his ContentMine role, Peter is also Reader Emeritus in Molecular Informatics at the Unilever Centre, in the Department of Chemistry at the University of Cambridge, and Senior Research Fellow Emeritus of Churchill College in the University of Cambridge. Peter is renowned as a tireless advocate of open science and the principle that the right to read is the right to mine.

Wikimedia UK

Wikimedia chapter based in London, Chief Executive Lucy Crompton-Reid. Their mission is "to support and advocate for the development of open knowledge, working in partnership with volunteers, the cultural and education sectors and other organisations to make knowledge available, usable and reusable online."

Cesar Gomez

Director, ContentMine

Jo Brook

ContentMine contractor

Wikimedia UK logo

Advisors (volunteer)

  • Elisabeth Jay Frieman: Professor, University of San Francisco, author of Interpreting the Internet: Feminist and Queer Counterpublics in Latin America (University of California Press, 2016)
  • Anasuya Sengupta, founder of WhoseKnowledge, Indian poet and activist, authority on representation for marginalized voices on the Internet
  • Myra Abdallah, Middle-East and North Africa regional manager of Women in News program of the World Association of Newspapers and News Publishers (WAN-IFRA) and the Director of the Gender and Body rights Media Center of the Arab Foundation for Freedoms and Equality (AFE).
  • Jason Moore of WikiProject LGBT studies
  • Gonzalo Velasquez (Movilh Chile)
  • Tania Yasmín León Vázquez (Fundacion Arcoiris, Mexico)

Organizations expressing interest in re-using the output of the project

  • Whose Knowledge (USA), global campaign to center the knowledge of marginalized communities on the Internet
  • Movilh (Chile), human rights advocacy organization with focus on civil rights and liberties for lesbian, gay, bisexual and transgender citizens
Movilh logo


Do you think this project should be selected for a Project Grant? Please add your name and rationale for endorsing this project below! (Other constructive feedback is welcome on the discussion page.)