Grants talk:IEG/ScalaWiki data processing toolbox

Strategic priority[edit]

Can I fill "strategic priority" field with "encourage innovation"? It's one of the 5 strategic priorities in Wikimedia Movement Strategic Plan Summary but form suggests only 3 of them: "Choices of strategic goals: Increasing Reach (more people access Wikimedia projects), Improving Quality (better quality and quantity of content on Wikimedia projects), Increasing Participation (larger and more diverse groups of people are contributing to Wikimedia projects)" --Ilya (talk) 15:47, 12 April 2016 (UTC)[reply]

Hi Ilya! Thanks for this question. In our call earlier, I told you I would look into this question and get back to you. Having taken a look at your project, I would prefer that you stick with the suggested guidelines. My recommendation is that you select "Increasing Reach (more people access Wikimedia projects)." I suggest this because your project seeks to improve access so that certain kinds of data present on Wikimedia projects won't be limited to users who are able to write scripts or tools for data access.

April 12 Proposal Deadline: Reminder to change status to 'proposed'[edit]

@Ilya: This is a final reminder that the deadline for Individual Engagement Grant (IEG) submissions this round is today (April 12th, 2016). To submit your proposal, you must (1) complete the proposal entirely, filling in all empty fields, and (2) change the status from "draft" to "proposed." As soon as you’re ready, you should begin to invite any communities affected by your project to provide feedback on your proposal talkpage. If you have any questions about finishing up or would like to brainstorm with us about your proposal, let me know.

With thanks, --Marti (WMF) (talk) 19:50, 12 April 2016 (UTC)[reply]

OpenRefine[edit]

I just skimmed the proposal for now, and I was not able to understand what you're actually proposing. However a word popped in my mind: OpenRefine. Any resemblance? Nemo 05:31, 13 April 2016 (UTC)[reply]

Not going to comment your "skimming" and "inability to understand", ok? I'm grateful to any suggestions about tools that might be similar or related. I stumbled upon OpenRefine once but did not have possibility to pay attention and then forgot about it. Yes, it seems quite related and I'm going to evaluate how it or any other software can be used or give ideas for the goals of this grant request. I found a mention of using OpenRefine for Wikimedia project and a wikidata-reconcile using it by Magnus Manske. Do you have other usage examples? It has several open issues regarding Wikimedia projects support: Support Wiki table formats, Malformed input from reconcile APIs freezes OpenRefine, Implement Wikidata reconciliation.

I found reports that it slows down and crashes with 1 GB datasets and unresolved issues that it crashes with 88Mb dataset, cannot handle forms larger than 1Mb and finally that it does not support streaming but loads all data into memory intsead.

Comment: "The current architecture requires all data to be in memory for processing. Changing that would require a major reworking of things and require GRefine to be backed by a "real" database to do all the processing For the forseable future, you should assume that 1.x N memory is required for an N-size database, where the goal is to keep x small (say 5-20%)."

So probably you were not aware of this? I cannot assume that you suggest to process Wikipedia data fully in memory and require 5-20 times more memory than the dataset size.

Regarding other tools that may be related, I'm going to do some evaluation of pandas, Jupyter (And other pydata projects), R and Sparta. Any other suggestions? I'll ask this question on wikimedia tech lists too. --Ilya (talk) 09:10, 13 April 2016 (UTC)[reply]

I think I'll have to provide more detailed examples --Ilya (talk) 15:14, 13 April 2016 (UTC)[reply]

Yes, examples would be good. Nemo 12:07, 17 April 2016 (UTC)[reply]

I am a regular user of OpenRefine. I think it is a valuable tool to clean up messy datasets from our GLAM partners. As an example, I just spent days aligning a Lepidoptera catalogue from my local university that contains outdated names or badly identified taxons with online resources (namely the Catalogue of life and the Encyclopedia of life) and Wikidata just to be able to properly describe the specimens we photograph. OpenRefine's reconcile capacity is great, despite the flaws Ilya pointed and Magnus' own Wikidata-reconcile service is rather crude and has its own issues.

--EdouardHue (talk) 12:38, 18 April 2016 (UTC)[reply]

Feedback[edit]

Hi Ilya, as I have been in touch with you and have used your WLM tool I have no doubts that you have the needed skills for this kind of tasks and also the required insider view in the movement. Still, I am having problems to understand what are you offering to the movement, to me it is a bit too abstract. Proposal: could you clarify for which kind of public is this toolset foreseen (public could be any newbie editor in the movement, an experienced editor, a bot owner, a programmer, ....) and list 5 use cases of concrete problems that I could solve with this tool set?

Off-topic: it is good to know that in the coming weeks your WLX jury tool will have an admin user interface so that we can manage our stuff without needing you, and it is surely good for you, as you can then relax :) If some of Wikimedia Ucrania's employees are now familiarized with the code of the WLx tool and are even implementing a new UI, would it be possible to share that code so that other chapters can add features to it? e.g. if one jury member out of five members accepts a picture, then it is skipped from the others' backlog as it will make it to the next phase anyhow. Please, let me know where we can discuss about this topic or even if I could discuss this topic during Berlin's WikiCon next week, as it is not directly linked to this grant but for many people of high interest. Thank you and best regards, --Poco a poco (talk) 19:07, 15 April 2016 (UTC)[reply]

Poco a poco, jury tool admin UI is done - commons:Commons:WLX_Jury_Tool/AdminUI and code is on GitHub

Eligibility confirmed[edit]

This Individual Engagement Grant proposal is under review!

We've confirmed your proposal is eligible for review and scoring. Please feel free to ask questions and make changes to this proposal as discussions continue during this community comments period (through 2 May 2016).

The committee's formal review begins on 3 May 2016, and grants will be announced 17 June 2016. See the round 1 2016 schedule for more details.

Questions? Contact us at iegrantswikimedia · org .

--Marti (WMF) (talk) 05:12, 28 April 2016 (UTC)[reply]

Aggregated feedback from the committee for ScalaWiki data processing toolbox[edit]

Scoring rubric	Score
(A) Impact potential Does it have the potential to increase gender diversity in Wikimedia projects, either in terms of content, contributors, or both? Does it have the potential for online impact? Can it be sustained, scaled, or adapted elsewhere after the grant ends?	6.0
(B) Community engagement Does it have a specific target community and plan to engage it often? Does it have community support?	6.6
(C) Ability to execute Can the scope be accomplished in the proposed timeframe? Is the budget realistic/efficient ? Do the participants have the necessary skills/experience?	6.3
(D) Measures of success Are there both quantitative and qualitative measures of success? Are they realistic? Can they be measured?	3.9
Additional comments from the Committee: As far as I can tell, if implemented, this tool would serve a minority of our community, namely those conducting Wikipedia research or reporting on projects. The project could fit the Wikimedia strategy, have potential for impact and, if properly implemented (as I understand it), would probably scale. While I agree that the challenges mentioned (tools dying, wikimetrics issues, etc) are very real and need solutions, I am unsure what the proposed solution here actually is. It seems to be a set of libraries written in Scala, which itself has multiple problems: 1) The users of the tools mentioned in the problem section may not be able to learn and write scala code to solve issues, especially. not with the (fairly heavyweight) projects mentioned later on in the proposal. 2) For the tool developers themselves, inside the Wikimedia Community Python / PHP are much more favored languages, and already have a large set of supporting libraries & communities. I'm unsure where this project fits strategically, what impact it will have, and how it'll be sustained afterwards. The lack of endorsements from actual users looking forward to using this is also a concern. This could be a huge project with infinite complexity. Building other tools may not solve the initial data sources problem, as users could face the same problem in the future. As I understand it, the usability of Wikidata is a major concern for a lot of projects that hope to gather quantitative data. If a project were to successfully create a tool to process data on Wikipedia, this could have a pretty huge impact. Also, if successful, this project has the potential to be expanded upon in the future. Seems to be somewhat innovative in the DAG approach to query configurations, but I am still looking for reasons to develop this project in the first place. If people were able to easily adopt this tool and do things with it it would have a great impact. Unfortunately, from the look of things, OpenRefine is already available and seems to fit the described needs. This project does seem to have measurable goals. It's a big task, but if completed successfully the potential impact is far greater than the risks. The applicant has also written measurable goals of success into the grant. This project seems doable based on the project plan details. I know that Ilya is a good programmer and able to execute. The proposal is fairly light on the exact details, but even with what is presented, I am very skeptical that one individual (no matter how brilliant) would be able to build out such a base set of libraries without active user participation in the given time frame. Not a clear approach, and presents the same problem for the data source in the root. The applicant clearly does have necessary skills and experience, but my concern is merely whether or not this could be completed within 6 months. Not a lot of encouragement or interest, and as a heavy editor who has in the past worked hard on producing chapter reports, these reports always need a human perspective, and I always adjust the measurement methods for them so that they are easy to build. I would not prefer using a one-size-fits-all reporting method. This proposal has little engagement. It isn't clear if the applicant actually is aware of people using the technology. There was no solicitation of comments or feedback from the thriving and active tools / gadgets communities from what I can see, and no explicit plans to do so in the future either. I would recommend that the user reach out to the community beyond the lists mentioned. I think this project could benefit from an agile, iterative development process that takes into account the user community as the tool is being built. I see a well presented project but it may be too ambitious to complete in six months. There are some problems related with measuring the success and how measure it. Also, the project presents some criticism to Global Metrics tools that have not been scaled in Phabricator. A nice idea, but I do not feel comfortable funding it. I don't see the impact of this project. The project is interesting but there is the need to do a further evaluation. Scala is a very good language for distributed applications and the problem is real but in my opinion the delivery is limited in comparison with the work involved. I'd like to see more concrete use cases and negative examples that need resolving before signing off on this project. The proposal is too ambitious. The tool is not critical enough to warrant full time employment. I would be more comfortable with a model where an initial 6 month part time engagement which, depending on progress and need, could be extended another 6 months. There is also mention of participation in several events (Wikimania, CEE, WLM), but is it part of the budget?

-- MJue (WMF) (talk) 17:07, 3 June 2016 (UTC) on behalf of the IEG Committee[reply]

Hi MJue (WMF), thank you for the feedback. I'll provide a detailed answer in several days. Can I convert your list from unordered to ordered so I'll be able to reference the points more easily? --Ilya (talk) 19:33, 3 June 2016 (UTC)[reply]

MJue (WMF), I'm going to provide detailed answers on each of the statements from these comments by Monday June 13th. Is that ok? --Ilya (talk) 03:24, 11 June 2016 (UTC)[reply]

Hi Ilya, apologies for the delay in reply. Yes, please feel free to respond to the committee feedback in the area you have already created below. If you have any further questions, don't hesitate to contact us! Best, MJue (WMF) (talk) 03:29, 11 June 2016 (UTC)[reply]

Answers[edit]

In progress..

I'll group by major concerns

Need/Impact[edit]

Comments[edit]

13. these reports always need a human perspective, and I always adjust the measurement methods for them so that they are easy to build. I would not prefer using a one-size-fits-all reporting method.
1. "would serve a minority of our community, namely those conducting Wikipedia research or reporting on projects.",
3."unsure where this project fits strategically, what impact it will have",
6. "I am still looking for reasons to develop this project in the first place."
18. "I don't see the impact of this project."
19. "the delivery is limited in comparison with the work involved.",
20. "I'd like to see more concrete use cases and negative examples"

Answer[edit]

13. Yes, I agree that reports need human perspective and we need to adjust measurement methods. That is exactly why I want to have a tool that allows humans to adjust what and how to measure. It's also important to be able to save, repeat and share with others the adjustments we do.
1 "Would serve a minority of our community, namely those conducting Wikipedia research or reporting on projects." Only if research or reporting is treated formally, just for those who need to do that research or report for some reason. Мeaningful research is not to get some numbers to cover the need to get them but to understand the reality and act according. In many cases what is interesting is not some numbers, but finding specific subjects (users, articles, categories, areas) that are interesting to us by some criteria and trying to take some actions on that.

Others do not see the impact, and upon rereading the grant request I think it is because I was mostly concerned on making the readers sure that I'm aware with the technical side and know how to imlement it, and naively took for granted that what can be done with thе tool can be obvious from that techical details.

Let's find a common ground by comparing to other grant - Grants:IEG/WIGI: Wikipedia Gender Index. That grant recevied 1.9 points higher on Impact potential with some good comments on that and no explicitly stated misunderstanding that is so much stated here. WIGI measures about dozen of fixed parameters on one particular subject (women) so it is a subset of a poject that I suggest that is going to provide UI to select what subject and what parameters do you want to query, how do you want to filter, group, aggregate and present the output. Let us say we want to promote or investigate something different (endadgered species, biosphere reserves, architectural monuments, scientists, LGBT people whatever). Does is make sense to create a separate tool for any subject that might be interesting to us? When one uses a calculator it does not matter what real life objects the numbers correspond to.

Example of how it can be used.

Find and improve on under-represented areas or content that can be used. Categories, articles, sources, images that are unproportionally low/not-existent/lacking some property in one wikimedia project comparing to another wikmedia project or some reference list (or in some country/region comparing to another, or by some other property (one category of articles comparing to another)). List of such areas can be used to attract volunteers to improve for example them using thematic weeks, edithatons, wikiprojects. This is very general description too but can be used in many different variations.
Ratings of presentation on wikipedia (for example for museums, universities) can motivate them to improve their presentation.
Find most active authors of articles on specific topic to invite them to participate in such thematic weeks, contact them to ask questions or suggest someting on the topic.
Usually all reports measure only impact of specific events without comparing with what happens outside of them (no Scientific control).
- Whether activity in thematics weeks or contests is higher than without them, or users just switch and write about thematic topics?
- Do they give long-term growth of activity of users who participated and any users on topics the events cover? Or can events with prizes demotivate and decrease contribution of users who did not receive them?
- How does amount and size of articles on specific topic compare in some event and outside it?
In case of image uploads interesting question is geographic diversity, estimated travel time and distance of contributors (Taking lots of pictures in one time and place like big city vs traveling may distant villages).

Ability to execute[edit]

4. Could be a huge project with infinite complexity
10. I am very skeptical that one individual (no matter how brilliant) would be able to build out such a base set of libraries without active user participation in the given time frame.
12. My concern is merely whether or not this could be completed within 6 months.
17. May be too ambitious to complete in six months

First of all I like to thank about that scepticism that allowed me to think more about how to make sure to follow time constraints.

About completing and boundaries of included functionality. It's not possible to include everything at once. The goal of the project is to become viable and compeling enough with some core starting functionality withing this timeframe, so it will be worth sustaining, experimenting and evolving without much effort per each functionality part that will added later.

Lets us put some measure of difficulty. There is an an mw:Extension:ApiSandbox. It is based on development by a student User:Salil who did not know about Mediawiki and its API before in a GSoC project. It was planned that he would spend about 3,5 month part-time in the evenings. Let's put it 1,5 months fulltime. I think UI like this (or in general converting the metadata (what commands are available) to code that takes particular actions based on the metadata) is essential part of the project. It shows time estimate and also I can start from that code. And adapt it to supler form generator.

On other components

I've been working on Mediawiki API, database and dumps components for about 2 years, so they need more polishing and completing and not implementing from start.
There are a lot of good tools for Wikidata, it is quite complex and I do not have enough knowledge and experience to give it the same level of coverage so I'm not spending that much time on it.
I'm going to start with simplest possible implemetations. For example I can start not with coversion fron JSON to SQL but just plain SQL, I can start with just Mediawiki API query string and not the UI to build it.

Overall I think it's more like integration project than something radically new and brilliant and the problem mainly is not in the ability to implement some design, but maybe for ideas or needs from other users to be part of it.

Note that ApiSandbox was developed by experienced MediaWiki developers and mw:Google Summer of Code past projects doesn't list any related past project. The proposer doesn't seem to have contributed any patch. Nemo 16:02, 14 June 2016 (UTC)[reply]

Thank you for the comment. From the commits it seems that extension was developed in less than two weeks by one developer. I feel it's reasonable time for that task. --Ilya (talk) 16:30, 14 June 2016 (UTC)[reply]

Communication[edit]

According to comments I decided to ask for ideas from users before showcasing the first version and not vise versa.

21. "There is also mention of participation in several events (Wikimania, CEE, WLM), but is it part of the budget?
I've already been funded for Wikimania 2016, for CEEM I can be funded by Wikimedia Ukraine. There is no offline WLM event, I will communicate with the teams online

Sustainability[edit]

3. 1) The users of the tools mentioned in the problem section may not be able to learn and write scala code to solve issues, especially. not with the (fairly heavyweight) projects mentioned later on in the proposal. 2) For the tool developers themselves, inside the Wikimedia Community Python / PHP are much more favored languages, and already have a large set of supporting libraries & communities. I'm unsure where this project fits strategically, what impact it will have, and how it'll be sustained afterwards.

Scala: Yes, Scala is much less favoured overall than Python. But

Many projects are actually mainly developed by small number of people, so I should be familiar and productive with it as a major factor of its succes
Judging by the number of client libraries (Java - 12, Python - 12, JavaScript - 11, PHP - 8). Java is as popular as Python and is more popular than PHP. Scala can be easily used in Java or by Java developers - it runs on the same virtual machine, uses same IDEs, package formats, distribution methods and can be used with the same build tools. Scala is visible in Java world - according to ZeroTurnaround survey half of Java developers are interested in using Scala.
Scala is popular among data scientists. According to O'Reilly survey 10% data scientists use Scala, and particularly 24% data engineers use Scala. This is mainly due to Spark which is developed in Scala and whose users largely use Scala, which shows that good tool can increase popularity of technology.
Large set of libraries for the same purpose can be a sign of fragmentation and dissatisfaction with them.

Heavyweight dependencies: I suggested Flink to grant enough power and functionality out of the box and not to reimplement things. Maybe for local usage small library of Apache Calcite will be enough and will cover most usage cases.

Existing tools[edit]

17. project presents some criticism to Global Metrics tools that have not been scaled in Phabricator.
Uploading cohort or running a large report fails Reported on Jan 26 2015, has High priority, no assignee, no comments from Analytics team and not fixed in about 1,5 year.
If you find that bug important, you could ask a grant to fix it. Nemo 15:56, 14 June 2016 (UTC)[reply]

Following up on your answers[edit]

@Ilya:

I spoke with Asaf Bartov yesterday and he pointed me here to your responses to the aggregated feedback from the committee. I am very sorry that there hasn't been any follow-up before now. I inadvertently failed to watch this talk page and I consequently overlooked your responses.

Unfortunately, this proposal was not recommended by enough reviewers to advance past the first phase of committee selection. This means that your responses to the aggregated committee feedback did not impact their decision about whether to fund your project. That should have been clearer in the feedback template and I'll make sure it is revised going forward. That said, the committee is very open to reviewing submissions again in future rounds, and they tend to look favorably on applicants who have taken their feedback seriously and used it to clarify and improve their proposal. If you are still interested in doing this project, you can incorporate your answers into your proposal and resubmit it into the current Project Grants open call. The deadline is August 2.

If you click on that link and scroll down to the "Upcoming events" section, you will see that WMF is hosting a series of proposal clinics via Hangouts. You are welcome to join one of those clinics if you would like individualized support for reworking your proposal for submission.

My apologies again that there wasn't a response here before now. It's clear that you have put care into responding to the committee's feedback and I'm sorry your answers were overlooked. I've sent a message to the IEG committee asking them to follow up with any further feedback in light of your answers, so that you can take it into account if you do decide to resubmit your proposal.

Please let me know if you have any additional questions! --Marti (WMF) (talk) 17:00, 7 July 2016 (UTC)[reply]

@Mjohnson (WMF):

Thank you for the explanation. The evaluation process looks broken for me :( But I do appreciate your being frank about it. And I hope that the organization of the process will be more clear for other future applicants now and they would not run into the same situation as me.

What are my steps to resubmit the application? Am I supposed to create a new proposal and insert there my answers to the questions on the talk page? So people reading my new proposal will get the info they were interested before? --Ilya (talk) 11:03, 14 July 2016 (UTC)[reply]

@Ilya:

I agree with you that the process was broken and I apologize for that oversight. I've corrected this for Project Grants going forward so that next steps will be clearer to future applicants. It's not necessary to transfer your answer to questions on this talkpage to the talkpage of your Project Grant proposal, but you should make sure that the Project Grant proposal itself addresses committee concerns raised here. The committee make-up will be different this time, so there will be different perspectives at work in the next review. If you get further comments from the committee on your Project Grants proposal talkpage, you will definitely want to respond there.

Thank you again for your feedback.

--Marti (WMF) (talk) 20:48, 9 August 2016 (UTC)[reply]

Round 1 2016 decision[edit]

This project has not been selected for an Individual Engagement Grant at this time.

We love that you took the chance to creatively improve the Wikimedia movement. The committee has reviewed this proposal and not recommended it for funding, but we hope you'll continue to engage in the program. Please drop by the IdeaLab to share and refine future ideas!

Next steps:

Review the feedback provided on your proposal and to ask for any clarifications you need using this talk page.
Visit the IdeaLab to continue developing this idea and share any new ideas you may have.
To reapply with this project in the future, please make updates based on the feedback provided in this round before resubmitting it for review in a new round.
Check the schedule for the next open call to submit proposals - we look forward to helping you apply for a grant in a future round.

Questions? Contact us.