Grants talk:IdeaLab/Management of most public documentation for clinical research

Ambition[edit]

This project is wildly ambitious but it does take small steps. Blue Rasberry (talk) 11:24, 21 April 2014 (UTC)[reply]

Technical needs[edit]

For this project to scale it would need multiple unrelated software tools, and I know nothing about software development. At a small or testing level, everything could be prototyped manually. The major technical help that I need is help calling the API from Lilly Pharmaceuticals just so that I can see the data it gives. Here is a breakdown of the software tools needed to make this work.

Collect identification numbers from a dataset[edit]

Every clinical trial has an identification number. Given database X of clinical trials, the tool should collect the identification number from this database. To start, get these numbers from Lilly's database. In the longer term, these numbers should come from clinicaltrials.gov, the US government federal database.

Manual way - somehow call the Lilly API and manually copypaste perhaps 100 identifers
Automated way - a tool grabs all identifers in the database

Search PubMed with collected identification numbers; collect PMIDs[edit]

If an identification number returns a search hit on PubMed, the US government medical research database, then every hit is an academic paper about the clinical trial matching that identification number. A search for "NCT00434512" returns a page with four results. On the search results page, the "PubMed ID" or PMID for each research paper is displayed as "PMID:24391139". The identification number for each of these papers needs to be collected.

Manual way - Manually search clinical trial identifer numbers; collect PMIDs. For a test, 10 trials with 3-10 papers about them would be good enough to show a prototype.
Automated way - a tool searches for all clinical trial identifers and collects all associated PMIDs

Perform a count of PMIDs per entry to choose best cases to make Wikipedia articles[edit]

For a Wikipedia article to exist there have to be multiple published sources talking about the subject of the article. For a clinical trial article to exist, those sources could be PubMed indexed research papers. Most clinical trials will not return any search results in PubMed, because no paper has been published about them. To make articles which are good enough for Wikipedia, some number has to be chosen, perhaps 3-5, which would be the minimum number of articles acceptable to justify the creation of a Wikipedia article on a given clinical trial. A count must be done of the search results in the "collect PMIDs" step and only those items with more than that minimum are processed to become Wikipedia articles.

Manual way - just count the PMIDs for a given clinical trial in the work database; choose 10 trials which seem to have any papers written about them
Automated way - very similar to manual, but do it for all entries

Pull more data from original clinical trials dataset[edit]

Now that there is a list of clinical trial identifiers which are matched to a sufficient number of research papers, it is decided which clinical trials get Wikipedia articles. The next step is to pull more data from the original clinical trials dataset so that text for a Wikipedia article can be generated. Consider trial NCT00434512 in clinicaltrials.gov. This data is presented:

Identifier: NCT00434512

Name: Dose-ranging Study to Evaluate the Safety & Immunogenicity of a HIV Vaccine 732461 in Healthy HIV Seronegative Volunteers

Sponsor:GSK

Description: This is a single center, observer-blind, randomized, dose-escalating, staggered study with 6 groups: 3 groups of 50 subjects receiving the adjuvanted candidate vaccine, at 3 different doses and 3 groups of 10 subjects receiving the non-adjuvanted candidate vaccine in water for injection, at 3 different doses. The vaccination schedule will be 0-1 month. Blood samples will be collected at 8 visits. The duration of the study will be approximately 14 months for each subject. Rationale for Protocol Posting Amendment: The third vaccination will be cancelled and the visit at Month 7 will be postponed to Month 9. The Protocol Posting has also been updated in order to comply with the FDA Amendment Act, Sep 2007

Study start date: February 2007

Study end date: June 2008

Primary completion date: June 2008

Number of participants: 180

Collect this sort of data, and put it into the work dataset.

Manual way - depends on API, but assuming that one can see a dataset, one could copy and paste trial information from it
Automated way - copy the structured data from matching clinical trials by matching the desired clinical trial ID numbers to the numbers in database X, which contains everything

Convert data fields into prose[edit]

Using a paragraph with variables, the data fields just collected are turned into the sort of readable prose which is put into Wikipedia articles. In the example above, the prose could look like this:

"Dose-ranging Study to Evaluate the Safety & Immunogenicity of a HIV Vaccine 732461 in Healthy HIV Seronegative Volunteers is a clinical trial sponsored by GSK. Its enrollment was 180 participants. It started February 2007, the study ended June 2008, and its primary completion date was June 2008. The study sponsor described this study by saying "This is a single center, observer-blind, randomized, dose-escalating, staggered study with 6 groups: 3 groups of 50 subjects receiving the adjuvanted candidate vaccine, at 3 different doses and 3 groups of 10 subjects receiving the non-adjuvanted candidate vaccine in water for injection, at 3 different doses. The vaccination schedule will be 0-1 month. Blood samples will be collected at 8 visits. The duration of the study will be approximately 14 months for each subject. Rationale for Protocol Posting Amendment: The third vaccination will be cancelled and the visit at Month 7 will be postponed to Month 9. The Protocol Posting has also been updated in order to comply with the FDA Amendment Act, Sep 2007."

Manual way - copypaste the data fields into a paragraph and make 10 articles like this
Automated way - perform this for all chosen clinical trials

Gather draft Wikipedia articles in series of text files[edit]

All of the data needed to make a Wikipedia article is brought together in an off-line draft and queued for upload. Each offline article contains the following:

An article title, named after the title of the clinical trial
The prose just formatted
A list of all the PMID numbers found in PubMed, and associated with the clinical trial, in order to meet Wikipedia's notability criteria for creating an article
Wikitext to make things formatted nicely, including in-wiki commands to expand the PMIDs to full citations, library cataloging tags, and section headings

Manual way - copypaste entries into template
Automated way - the same, but for everything

Execute upload to Wikipedia[edit]

Given the draft article all formatted to wikicode, the draft can be uploaded to Wikipedia using the title of the clinical trial.

Manual way - easy for any Wikipedia user to do
Automated way - there is a bot approval process on Wikipedia which requires a limited trial and prototyping first. Probably doing this manual for the trial run is better, especially since a test of about 10 pages would not burden any individual to perform

Seek community feedback[edit]

The output would be an oddity to the Wikipedia community because these kinds of articles have never existed before. To the larger research community though this would provoke more discussion.

Wikidata identifier established![edit]

See d:Wikidata:Property_proposal/ClinicalTrials.gov_Identifier. Blue Rasberry (talk) 11:22, 24 August 2016 (UTC)[reply]