Grants talk:Project/Heritage GLAM 2020

From Meta, a Wikimedia project coordination wiki

Alternative suppliers[edit]

I see that you plan to digitise books for a total of about 10k pages, with a budget of 1500 $ for equipment and 2400 $ for wages.

Have you considered how expensive it would be to outsource the scanning to a professional supplier, for instance the Internet Archive scanning services? Their default rate is about 75 % less than what you plan to spend, and the quality achieved with their professional equipment is vastly superior to what you can get with some makeshift home process. In File:ਮਹਾਨ ਕੋਸ਼ ਭਾਗ 1.pdf for instance I see missing corners, imperfect cropping, bad lighting, skewed lines, no OCR, inconsistent page sizes. Such output is commendable and useful when made by volunteers (it's hard work!), but it's not worth 0,4 $/page.

Producing bad scans also leaves a lot of work to do for Wikisource users downstream, who will need to fix the images with ScanTailor, re-create the PDF or DjVu with proper software (Adobe Acrobat is not adequate, in addition to the ethical issues of using proprietary software). Unsurprisingly, s:pa:ਇੰਡੈਕਸ:ਮਹਾਨ_ਕੋਸ਼_ਭਾਗ_1.pdf currently has nearly zero transcribed pages. If you continue using this ad hoc process, I recommend investing in some training on the usage of a proper workflow. Nemo 14:20, 23 February 2020 (UTC)[reply]

Hello Nemo! Thank you for your invaluable advice. You make a very good point and I am happy to share some insights from our experience of last year.

The default rate of the internet archive service (which is not present in our region) you have mentioned: $0.12 USD per image, Folios are $0.30 USD per image and Foldouts are $2.40 USD each, but does not quite fit with ”Their default rate is about 75 % less than what you plan to spend” statement. I have increased the number of pages by 2000 more, but considering we are completing this work in half the time as last year, it seems this number is very high.

It is not about simply shipping away books from the institution to any place or agency that would scan the books, manuscripts and archives. The core process is painfully long and very complicated. We are not allowed to take the works outside these GLAM institutions as per their laws and regulations. That is one of the things which makes the process harder. Secondly, after physically checking each book for copyrights which in itself is very complex work because of lacking sources, we have to check the condition of the books for digitization. Some manuscripts and books we found were too precarious and old, yet gone too far to be saved. Most of the works are so old, that it requires professional care to handle them and to maintain them in order to digitize them. After digitization, these have to be checked for faults, such as, missing or torn pages and later it has to be post scanned.

And this itself is supposed to be done in institutions that are not fit for humans to work in. Because of lack of investment from the government, one of the reasons the books are in this situation is [of cleanliness]. The libraries have no fans let alone air conditioning in this hot weather and have not been cleaned since more than a decade, due to which the books have over 4 inches of a highly allergic dust that has cost our staff skin issues and severe cases of fungal infection along with rancid burning that doesn't go away hours after you have taken medication and bath and these skin problems cannot heal unless the contact of the agents are cut off during the medication process for months. So, it is a very big risk for humans working on the digitization task. I personally have experienced that in 2018 when I was the sole person working on digitization among other tasks.

I am adding the images to attest my words, as painful it is to look at them, the treasure trove these institutions carry includes, first editions, rare manuscripts and rare works that go back till 14th century that havent been found anywhere else in the world. For us, in India, these knowledge sources are no less than priceless treasures that are all but extinct for our language.

No professionals would work in institutions that are in such conditions, in extremely low costs. The cost of digitization we mentioned is very low below the market prices in our city, considering the institution working hours, the allergic and unhygienic environment, the delicacy of the works, the additional hours taken to clean, handle, and save some books before they get fit for digitization, the post processing that is done after scanning also takes a long time. Last year, we miscalculated the amount of work a single individual can do and a lot of volunteer effort of more than double hours (which lead to a serious burnout) from our end went to ensure we completed the tasks, which was about 15000 pages of these works in twelve months. This time we will be working on 12000 pages in six months, the task is not any less easier but now we have more understanding of what could be improved further in the process, which also includes the post processing aspect you mention. Which is exactly why, we are hiring better professionals than last time for better post quality, even though we improved a lot in the second half. And the cost prices you mention do not state anywhere about the type of work itself - even in some alternate scenario where books could be shipped, which they can’t - there is work and additional labor required for Folios, Foldouts, single or double sided, manuscripts and torn books and the consequent prep work needed for the work. Even the agencies take extra money for that labor. And considering everything, our prices are indeed below the market rate here.

For OCR, we do not use the traditional OCR methods used in western Wikisources, but instead use a very innovative tool made by our volunteer developer Jay Prakash who created Indic OCR, that has been integrated in all the Indic Wikisources. The editor can OCR one page at the time they start proofreading the page which saves a lot of labor and time from one individual. As for transcribing the mentioned book, it is true, no one proofread this work File:ਮਹਾਨ ਕੋਸ਼ ਭਾਗ 1.pdf . But our target was to cover 3000 pages out of these 15000 pages from different works and we have proofread over 6500 pages, exceeding way above our target with our consistent efforts with community training, meetups, workshops and campaigns along with retention of community that we have been growing out of scratch since last year, becoming one of top 4 Wikisource projects of South Asia in terms of the largest number of proofread works, from our last year of ninth position from last year and fifth fastest growing community among all Wikimedia projects. Initially, when the pilot of this GLAM Wiki project started in October 2018, we had over 5000 pages on Punjabi Wikisource and only 1260 pages integrated via proofreading, after 16 months of persistent work, we have reached over 20873 pages integrated via proofreading, seeing overall growth of 1655% in Punjabi Wikisource. This has been focussed effort to grow and protect our language with the much needed sources for revival and survival of our language on the internet which although spoken by 120 million people in the world has little content on the internet. The book you mentioned has double lined content inside a single page which means that it is hard to OCR it correctly, especially for Indic languages, we have been trying to figure out ways to do that and we concluded that we will need to further crop the page in two and upload book in that way for transcribing the work. We welcome your suggestions for alternative solutions for this task.

This book is one of the most important lexicography books of Punjabi Sikh literature and we have created a special workflow for integration of this book that would be a welcome and useful in addition of references for new and old articles in Wikipedia and entries in Wiktionary, both of which would be worked by community via campaigns and education program.

We have been working systematically with a workflow for integration of these works in Wikisource, Wikipedia and Wikidata by innovative campaigns such as 1Lib1Ref, proofreading contests and campaigns of the months, something that has been explained in detail in the project page itself. This year, we will be training community for advanced techniques, categorizing and training such as transcribing the books in final formats, interface adminship, etc, via our events and wikicamps we will be organizing for the work. The entire project in other words is about Community growth, planning events, maintaining GLAM partnerships, running campaigns, and development of tools and much needed documentation for the whole process.

Hope this helps! Thank you so much for taking time to read our proposal and giving this valuable input. We will surely be expanding and improving the process from last year with the changes we have made in the current proposal. We would love to have you as an advisor for our project, we can use your vast experience and learnings to ensure that we improve and succeed in this project. Thank you and regards. Wikilover90 (talk) 07:40, 24 February 2020 (UTC)[reply]

Hello and thank you for this answer and additional statistics. I think you know that I'm a long-time vocal supporter of funding for Wikisource and I love what the Punjabi Wikisource is doing. I admire a lot what you've achieved because I know how hard it is, having tried myself in conditions which aren't nearly as hard as yours. I'm happy if you think you can scan 12k pages instead of 10k but I'm not here to stress you on the decimals.
Precisely because I like your work, I want to be sure that we consider all the options: you should be able to get the best tools possible for your work and if necessary the Wikimedia Foundation should spend more on it. It's not necessarily Wikimedia's goal to fund digitisation programs worldwide, but it's definitely within the Wikimedia Foundation duties to support Wikisource users so that they can spend their time more effectively when contributing to Wikisource.
So, for instance, you have mentioned local suppliers being more expensive: even if you don't have a formal quote from one of them, I would suggest to include an estimate of how much it would cost to outsource part of this work locally, given this number of pages etc. Or maybe the institute has commissioned scans in the past and can share how much they spent. I think any comparison to commercial provides will show the good value for money that your project has.
I'm sure you know about the Servants of Knowledge group in Bangalore which got some Scribe scanner to scan at the Indian Academy of Sciences. They are quite a fortunate case for various reasons because it's not easy at all to do this (the last shipment of scanners was sent back due to customs/shipper problems...), but maybe there's something we can learn from it: how long would it take to order such equipment, does it fit the requirements of your location, how faster workers would be with such equipment, at what volume would it be more cost effective than your current system etc.
Ultimately only you can decide what's feasible and suitable in your context, I'm not going to judge that. However I want to be sure that we (as in international Wikimedia) don't force volunteers to reinvent the wheel just for fear of some upfront cost. I hope that Wikimedia Foundation would confirm that they're happy to fund an additional upfront expense which is going to pool resources with the Internet Archive, increase impact and make volunteers less stressed; but they can only do so if the proposal is on the table. Nemo 08:05, 25 February 2020 (UTC)[reply]
Hello Nemo! Thank you so very much for this valuable insight. It is really good to learn new things from our more experienced community members. I am highly appreciative of your advice. I will certainly take this into account. We don’t have professional agencies that do this kind of digitization in Patiala, which is a small town in Punjab, India. But I acknowledge that it can be a good practice to have professionals working with advanced apparatus. I am very grateful to learn about the Scribe scanner - this looks very advanced and high-tech. We also in our last grant bought DIY Scanner from Latin America, which is more or less similar to this model in a much cheaper prize. The issue is that it came really late especially since we had started scanning the works with our other scanner and you are right, the quality is highly different. There is also an issue for post scanning work, which is a very technical and time consuming task. That is one of the things even top of the line scanner cannot correct - the dissipated and damaged books leave a lot of work to be left for post scanning and admittedly, our professional from last year left a lot to be desired in terms of the post scanning quality. Which is why, as a part of our learning experience we will this year, hire people who have decent expertise at this particular work and have been working with delicate subject matters, such as palm leaf manuscripts, damaged works, etc. with prior experience.

Thank you for your kind words and support, they mean a lot to our community! ^^ Wikilover90 (talk) 14:16, 25 February 2020 (UTC)[reply]

Eligibility confirmed, Round 1 2020[edit]

This Project Grants proposal is under review!

We've confirmed your proposal is eligible for Round 1 2020 review. Please feel free to ask questions and make changes to this proposal as discussions continue during the community comments period, through March 16, 2020.

The Project Grant committee's formal review for Round 1 2020 will occur March 17 - April 8, 2020. We ask that you refrain from making changes to your proposal during the committee review period, so we can be sure that all committee members are seeing the same version of the proposal.

Grantees will be announced Friday, May 15, 2020.

Any changes to the review calendar will be posted on the Round 1 2020 schedule.

Questions? Contact us at projectgrants (_AT_) wikimedia  · org.

I JethroBT (WMF) (talk) 16:57, 27 February 2020 (UTC)[reply]

Project seems to be in line with developing best practices in local contexts[edit]

I wanted to come here and endorse this project from the perspective of GLAM and other community programs perspectives: the team has a track record of managing and implementing challenging digitization and community activation projects in their own context. Its important to have model projects and workflows developing, with very locally defined strategies and approaches to content work -- in particular, the focus on WikiSource, transcription, building an audience for participation in the project, and doing digitization seems appropriate for the kinds of community and partnerships they have been developing. I also wanted to highlight that the staffing plans seem appropriate for the level of complexity and need for professional support that partnerships and digitization require. Its exciting to see this community continuing to grow the impact of Wikisource through this work. Astinson (WMF) (talk) 16:47, 5 March 2020 (UTC)[reply]

Please complete survey for your Project Grants proposal[edit]

Dear Charan Gill, Open Heritage Foundation, Wikilover90, Yann, VIGNERON, and KCVelaga,

We have sent you a survey link to the email address you provided for this Project Grants proposal. We need you to open the email and fill out the survey as soon as possible. We have emailed you twice without response (on March 20 and March 23), and we are not sure if you still wish for your proposal to be reviewed. If we do not receive your survey response by March 31, 2020, we will mark your proposal withdrawn.

We hope to hear from you!

Warm regards,

--Marti (WMF) (talk) 02:57, 28 March 2020 (UTC)[reply]

Hi, My proposal is suspended for now. Regards, Yann (talk) 21:14, 28 March 2020 (UTC)[reply]
Hello Marti (WMF), we have submitted the form and will be making the changes soon to convert the project into a virtual one. ThanksWikilover90 (talk) 21:36, 30 March 2020 (UTC)[reply]

Updated Proposal and Final Report[edit]

An update for Project Grants team and committee - with the ongoing COVID-19 and possible long term impact, our team has changed the proposal to be turned into a virtual one with the exception of digitization work done by the professional in GLAM institution proposed to be done from October - December, provided the COVID-19 pandemic is under control by that timeperiod. (In case it is not so, we plan to return the funds proposed for the digitization professional or delay the work, as advised by Grants team). The changes have been made after discussion with the GLAM and education program institution and partners. Our Final report was published last month. It can be read here. Regards Wikilover90 (talk) 14:01, 7 April 2020 (UTC)[reply]