Grants:IEG/Tools for Armenian Wikisource and beyond/Midpoint
Welcome to this project's midpoint report! This report shares progress and learning from the grantee's first 3 months.
- Team was able to find solutions to few most complicated problems.
- ZoomProof works better than expected, most of the time.
- We were quite off our timeline, and many things changed on Wikisource and off-Wiki changed in that period.
- We've stop pursuing idea of involving students as volunteer developers.
- Initial team mates have other priorities now.
- Involving new developers from local and international Wikimedia community, didn't work yet with 1 exception, but getting developers from elsewhere worked well.
- Our development progress so far, looks like:
- 2 tools are completed and are being used on Armenian Wikisource, for few months now - AutoHinter and QuickCleanupTool (later wasn't initially planned, and was a request from community)
- SectionHarvester and ZoomProof, are 70% completed.
- IllustrationCropper is 40% completed.
- SectionMarker and SectionGuard are not developed yet, but we know how to make MVPs now.
Methods and activities
Please, see next section, for detailed progress report for each tool.
Below is the description of progress and state of tools. I've kept original naming of tools, as presented in original proposal, but we plan on renaming few tools, to give them more descriptive and appealing names. Significant part of code is published on GitHub.
Was activated as Gadget in November 2016 on Armenian Wikisource and presented on same day, during teacher gathering organized by Wikimedia Armenia, as teachers often involve students to Wikisource tasks. Currently it's most used Gadget on Armenian Wikisource.
Nearby plan is to localize it for language versions other than Armenian. We find that this will be particularly useful for languages which don't have proper spell-checker, or for Wikisource versions which host many old texts on which spell-checkers produce too many false positives.
During presentation of AutoHinter, members of community asked me to consider creating a small spinn-off of my SAE-Tools (Swiss-army knife for proofreading and wikification of Soviet Armenian Encyclopedia), which would be more universal and would work on any book. This tool wasn't in initial IEG proposal, so it's something which was created from community wishlist. It is enabled as Gadget on Armenian Wikisource since April 2017. Tool allows to remove about 85% of hyphenation (hyphenation rules are complicated in Armenian), fixed some common issue with incorrect Unicode chars used in Armenian, adds paragraphs and removes excessive spaces, all in 1 click. I consider this one to be complete.
SectionHarvester is at MVP stage, and I've already made few fixes on Wikisource, thanks to it's validation rules and output. Early adopters may take a look at it by enabling it per instructions on GitHub. What's left is check for duplicate section names, export options and UI revamps. I expect it to be completed this month. Stretch goal would be to add few more validation rules.
Back in the day when I sent a proposal for this project, there was no on-wiki tool to crop image from DjVu or PDF and upload it to Commons. You had to go to Commons, download image, edit it on your desktop, and upload to Commons. Quite mundane task to do, especially in case of well illustrated books.
In past 3 years, things have changed, and now CropTool supports cropping out images from DjVu and PDF. As CropTool is well established tool, I'm considering using it as backend for frontend UI I've developed. Main author of CropTool User:Danmichaelo agreed on this vision, and we'll need to discuss and sort out few tech. questions. We'll have to discuss some more details with him, as there are some features which we had in mind, yet unsupported in CropTool, so we'll hopefully be able to add few extra tricks to CropTool too, which would be also used outside of Wikisource.
We've built several versions. While all of them are already useful in real life, every method has it's weak points (for example, many have difficulties with hyphenated words, if hyphen is removed in Wikisource, some are very reliable, but only until you make major changes compared to what you've got from OCR, almost all methods are confused when wiki syntax is added (tags, templates), short 1-2 letter words are still challenge. From my latest test we're withing stated goal (correct highlighting in 75% of cases and correct area of page in 95% of cases), but to me as user those 75% don't feel satisfying enough. As it's practically impossible to build some automation tests, I'm currently comparing methods running side by side, throwing some tweaks in the process.
Server side component, which downloads, extracts and processes word coordinates from DjVu files, was completed quite recently, in the end of May 2017. It runs on ToolLabs server and uses djvutoxml utility from djvulibre package, Flask, Celery & Redis. We estimated that it will take 5-10 minutes, to extract needed information from DjVu files, on first run. Real life usage showed that it takes quite longer with average for 700 page book taking 20-30 mins We plan small performance adjustments, to move some processing currently done on client side, so things are calculated just once, on server side.
Another challenge currently are DjVu files, which don't provide coordinate information per every word, but only per line. This will require some aproxmiation.
Stretch goal here will be to support not only DjVu files, but PDFs. But PDF issues and pitfalls are not yet will researched, so this is more of a wish than a promise, for now.
We haven't worked on this tool yet. As it's supposed to bin a more universal version of our tool which works well on specific subset of books, we are looking optimistic on this one.
There was a lot of brainstorming about this tool with Wikisource and Wikipedia editors and peer developers, but no prototypes were built, as my initial concept of tool had flaws. Initial vision expected tool to be fully automatic and eliminate problem almost completely. Just to remind, we're talking about cases when LST sections being renamed on book page, but not in other WS pages which request (transclude) it, basically blanking such pages, and doing in so in an unnoticed way for everyone.
After longer evaluation and discussion of this idea, this approach seems to be either too intrusive and conservative, as it just tries to keep existing connections, even if they're incorrect. We can make it a bit smarter, but that's not enough for complete solution, taken creativity of out editors, and multiple complicated scenarios of things which could be wrong and are being fixed, and things which were correct and are being messed up by edit.
Consider for example case when it's not just one section on one book page being transcluded by one Wikisource page, but same section name used in a range of different book pages, being transcluded by many pages in different places, and section name being changed only on one of the book pages. Without human input, a best a bot can make is just try to revert changes to section names. This approach is only acceptable if we're sure we want to keep things unchanged - in cases when book is completely proofread and wikified, and all transclusions have been done without any mistakes or space for improvement.
Recently we are consider following scenarios:
- Give up on this tool for now. Easiest one, but problem this tool aimed to solve still exists and this mistake is still done regularly even by quite experienced editors.
- Instead of trying to fix anything, tool can just reports issues in a log file or web interface or notifying users via Talk page. This is safest option, it just helps spot such cases, but whole burden of fixing things (often with complicated searches, and on multiple pages) stays on users.
- We add a way to specify specific Indexes/pages which tool should "protect", no matter what. Some really careful users can be whitelisted. All other cases are being just reported, so editors would act manually on them.
As alternative solution to problem, we thought of using a small gadget which will remind users to be careful when editing section names, but known problem with notifications is that our target audience will most probably ignore them. Also it won't work on bots. That still some extra piece which we may develop, as educating and preventing is better than fixing and reverting.
We were also looking in using AbuseFilter, on initial stage of spotting potential cases, but it's capacities seem to be limited even for that.
As we'were diving into analyzes of scenarios and potential pitfalls, evaluating time/cost of each tool quickly started reminding magic, so we as a team, decided that we'll leave money where they are, till end of project when it will be much easier to objectively evaluate amount of work required for each task retrospectively. So I expect most of transfers for Research and Development to be made near end of project. Another reason we didn't make much transfers so far, that after series of bumps we've ran into, we wanted to keep our option of withdrawing and returning all funds to WMF as open as we can. In fact a week ago, I've learned that separate account I've opened for holding this IEG funds was in passive state, because of no activity for many months.
|1||June 2014||Bank fees, for receiving first disbursement||$15|
|2||January 2015||Volunteer coordination (for series of meeting with students November 2014-January 2015)||$200|
|3||January 2015||Project coordination||$400|
|4||May 2017||Research and Development (ZoomProof backend v.1 $250 + $6.90 PayPal fees)||$256.89|
As we're not going to try to involve student this time, I've requested a change to budget (decrease of $250 intended for Promotional materials, and trasnfering funds from Volunteer coordination into Development and Research line).
The best thing about trying something new is that you learn from it. We want to follow in your footsteps and learn along with you, and we want to know that you are taking enough risks to learn something really interesting! Please use the below sections to describe what is working and what you plan to change for the second half of your project.
What are the challenges
Main challenge was how many things went unexpected in our life. Still I guess I've over-projected how we're going to organize and manage project. There were more parties, phases and dependencies between them involved, than was needed for this project and not enough back up plans and persons were in place for cases when something from the scheme becomes unavailable. So after we've ran into issues here and there, we were loosing momentum each time. Keeping things more simple and straightforward would probably be a better idea. One example of that would be not trying to involve students at that stage.
I've kept sticking to original plan for too long. Later I tried to involve local and international Wikimedians-developers. But people were either too busy, or lacked skills required. So what changed, and made things really move forward, was switching from "make it in most optimal way, involving more peer Wikimedians at any cost" mode, to "just get it done" approach. For example outsourcing one subtask to a freelancer on UpWork in May 2017, was definitely a best thing I did for this project in past year. I had a new team mate in 12 hours, and I'm quite happy with results he produced.
- Data in DjVu files and quality of OCR in them came up to be more diverse, than we found out before applying for grant.
- Variety of things which can happen during proofreading was also underestimated (how much a page can change in structure compared to OCR output).
- Expected, but still underrated was how big tree of possible scenarios is, when it comes to user behavior in free-form text changes.
- Some limitations on ToolLabs or bugs on Wikisource.
As core part of majority of the tools is done now, and we need to polish, document and fine tune things, I don't expect major bumps anymore.
What is working well
Original team made good research and solved most complicated parts. Tools are producing useful results, which makes me quite happy.
Learning pattern I've decided to share, based on our experience:
I've also learned that meetups IRL with laptops, were very, very productive.
Next steps and opportunities
- Developing a MVP version of SectionGuard and SectionMarker.
- Connecting IllustrationCrop with CropTool and assisting in enhancement of last.
- Enhancing accuracy of ZoomProof and making it fail-proofness
- Making tools localizable.
- Inviting participants from non-Armenian Wikisource, to test and localize tools they find useful for their projects, and helping them in the proces.
- Enhancements and fixes after first wave of feedback from community.
- UI/UX enhancements.
- Documenting all tools
And trying to reach few other stretch goals, here and there.
Report instruction suggests to share one thing which surprised me. I can definitely mention that I was surprised by amount of support I've felt from WMF, and especially Siko and Marti. Without their support, moral boosts and extreme understanding we would give up on things, 2.5 years before. I'm no stranger to both NGOs and Foundations, and this approach of WMF is really unique and special.
Another reflection, is that as Wikimedian you're almost always just volunteer, with no obligations. At least de jure, you do things you want to do, when you can do, and how much feel doing them. Becoming grantee changes that. Something you do for Wikimedia movement, becomes obligation. As long as your project goes successful it's all awesome. But failing to deliver something as grantee, feels much worse than failing to make your share of RC patrolling, taking part in edit-a-thon or solving out conflict on Wikipedia, when you're expected. No matter how much support and understanding you get from WMF. So don't overproject and underplan. :)