Grants:Project/Hydriz/Balchivist 2.0

This project is funded by a Project Grant

statusselected

Balchivist 2.0

summaryImprovements and enhancements to the archiving and searchability of datasets provided by the Wikimedia Foundation, including but not limited to the Wikimedia database dumps and other analytics data files.

targetAll Wikimedia projects

amountUSD 16,000

grantee• Hydriz

advisor• ArielGlenn

contact• hydriz

jorked.com

this project needs...

volunteer

give feedback

join

endorse

created on05:36, 19 February 2021 (UTC)

Friendly space expectations

Project idea[edit]

What is the problem you're trying to solve?[edit]

What problem are you trying to solve by doing this project? This problem should be small enough that you expect it to be completely or mostly resolved by the end of this project. Remember to review the tutorial for tips on how to answer this question.

Every month, almost 10 TB of data is generated by the Wikimedia Foundation in the form of database dumps, analytics files and other related dumps. These dumps serve as a partial backup of the wikis hosted by Wikimedia, but also provide a valuable alternative to accessing the API for content and metadata retrieval. These files are depended on by researchers, but other than a few dump status files, it is not possible to search for recent and historical dumps in a consistent and simple manner.

Additionally, requests for historical dumps are common, but due to economical reasons, it is not possible for the Wikimedia Foundation to host every single dump snapshot. Hence, there is a need for a reliable way to archive these files to the Internet Archive to be stored for prosperity. The current Balchivist 1.0, as well as previous implementations, have been archiving the Wikimedia database dumps since 2012 and have uploaded almost 400 TB of data to date. However, due to poor implementation and the growing size of the database dumps, there is an increasing lack of reliability. The number and size of the dumps have grown significantly since then, and despite the best efforts to manage and handle the growth, the amount of issues have grown beyond the capacity of volunteers without much resources.

What is your solution to this problem?[edit]

For the problem you identified in the previous section, briefly describe your how you would like to address this problem. We recognize that there are many ways to solve a problem. We’d like to understand why you chose this particular solution, and why you think it is worth pursuing. Remember to review the tutorial for tips on how to answer this question.

To improve the archiving infrastructure as well as to provide an interface for researchers to download dumps, I propose to create Balchivist 2.0, which is a web application hosted on Wikimedia Cloud VPS that allows users to easily search for both current and historical datasets (not just limited to database dumps). A RESTful API will also be developed to allow users to programmatically get metadata about datasets as well as to download the datasets themselves. Finally, I intend to create a watchlist feature that notifies interested users of the presence of new datasets for download, depending on the type of dataset that they have subscribed to.

This new software is a rewrite of Balchivist 1.0 with better reliability and support for adding new datasets, thus improving the guarantee that the Wikimedia datasets will be uploaded to the Internet Archive for researchers and prosperity. The archiving process will also be entirely run on Wikimedia Cloud VPS, as has been done for the past 9 years.

Project goals[edit]

What are your goals for this project? Your goals should describe the top two or three benefits that will come out of your project. These should be benefits to the Wikimedia projects or Wikimedia communities. They should not be benefits to you individually. Remember to review the tutorial for tips on how to answer this question.

The main objective of this proposed project is to improve the searchability of the datasets published by Wikimedia, so that it is easier for new and existing users to find and download Wikimedia datasets. Wikimedia users will also be able to benefit from an improved API for working with the datasets compared to the existing system which is only limited to the database dumps, overall streamlining the process of using Wikimedia datasets. Finally, improved reliability in the archiving infrastructure provides researchers with the assurance that the datasets will be archived timely and in full.

Project impact[edit]

How will you know if you have met your goals?[edit]

For each of your goals, we’d like you to answer the following questions:

During your project, what will you do to achieve this goal? (These are your outputs.)
Once your project is over, how will it continue to positively impact the Wikimedia community or projects? (These are your outcomes.)

For each of your answers, think about how you will capture this information. Will you capture it with a survey? With a story? Will you measure it with a number? Remember, if you plan to measure a number, you will need to set a numeric target in your proposal (i.e. 45 people, 10 articles, 100 scanned documents). Remember to review the tutorial for tips on how to answer this question.

During the project, the code for the web application as well as the archiving processes will be written, tested and subsequently deployed. The following are some metrics that will be used to measure the progress of this project:

All critical tasks completed: All tasks relating to the basic functionalities of the project will be completed before the final deliverable of the project.
50% code coverage: This metric indicates the percentage of the code that is covered by automated tests and serves as an indicator on the robustness of the code written. Robust code is cleaner and ensures that the project remains reliable in the long run.

After the project, users will be able to access the web interface for finding datasets. The archiving processes will also run in the background to upload the datasets to the Internet Archive. The following are some metrics that will be used to measure the success of this project:

Reduction of incidences to <1%: An incident is defined as an error in the archiving process which needs to be manually resolved. As lowering the incidence rate is one of the main objectives of this project, successfully lowering this figure indicates that the project has been successful.
Include at least 5 new datasets: Currently, only the main database dumps, Wikidata entity dumps and the CirrusSearch dumps are being tracked and archived. The new system should expand to the other types of datasets and eventually cover most of the datasets that Wikimedia provides. Increasing the number of datasets covered is one of the objectives of this project and achieving this goal indicates that the project has been successful.

Do you have any goals around participation or content?[edit]

Are any of your goals related to increasing participation within the Wikimedia movement, or increasing/improving the content on Wikimedia projects? If so, we ask that you look through these three metrics, and include any that are relevant to your project. Please set a numeric target against the metrics, if applicable.

This section is not applicable to this project.

Project plan[edit]

Activities[edit]

Tell us how you'll carry out your project. What will you and other organizers spend your time doing? What will you have done at the end of your project? How will you follow-up with people that are involved with your project?

This project is mainly focused on the software development aspect, but some community outreach will also be done to users of the datasets to get them to use this new product. More specifically:

Tasks definition and prioritization
- Create tasks on Phabricator for the various parts of the project that needs to be done
- Prioritize tasks based on importance for the final product and its dependencies on other tasks
- Create milestones to mark various checkpoints of the project
Software development
- Write code for the project
- Report progress onto relevant tasks on Phabricator
- Write and improve tests
Limited testing
- Begin testing of the web interface as well as the archiving scripts for the main database dumps
- Expand to other datasets as stability of the product improves
Community outreach, beta testing and going live
- Let users of Xmldatadumps-l, Analytics, Wiki-research-l and Cloud know about this new product
- Let users of Wikimedia-l and Wikitech-l know about this new product
- Add a link to the dumps.wikimedia.org homepage that points to this new product
Reporting
- Providing timely report for the Grant proposal

During any phase of the project, users are free to create new tasks and feature requests, but tasks that do not address the main objectives of the project will have a lower priority, which may only be completed after the project has ended.

Budget[edit]

How you will use the funds you are requesting? List bullet points for each expense. (You can create a table later if needed.) Don’t forget to include a total amount, and update this amount in the Probox at the top of your page too!

This grant request is based on 20 hours per week over the duration of about 4 to 6 months. As per the estimate given on Grants:Project/Timeless/Post-deployment support#Budget, the hourly rate is set to USD 40 per hour, which gives a total base amount of USD 12,600. Due to the inherent unpredictability of software development and the potential for other expenses, the number of hours is rounded up to 400, thus netting a budget of USD 16,000 over a span of 20 weeks.

Task description	Number of hours
Tasks definition and prioritization
Creating and documenting tasks on Phabricator	10
Prioritization of tasks and creation of milestones	5
Software development and testing
Writing code	190
Writing tests, limited and beta testing	100
Community outreach and reporting	10
Total
Base hours	315
Rounded hours	400
Total budget	USD 16,000

Notes:

Due to the existence of the Dumps project on Wikimedia Cloud VPS, there are no additional costs involved in the purchase of hardware to run the archiving scripts as well as the hosting of the web application.
The project will be worked on at a part-time basis, due to the large size of the datasets which will take a long time to complete when testing. Running the project over 6 months allows for more extensive testing, as it would give us 6 full runs and 6 partial runs of the database dumps to test on.

Community engagement[edit]

How will you let others in your community know about your project? Why are you targeting a specific audience? How will you engage the community you’re aiming to serve at various points during your project? Community input and participation helps make projects successful.
As users of the database dumps and other datasets provided by Wikimedia, they should have been subscribed to the Xmldatadumps-l and Analytics mailing lists. As the project is about to conclude, pages about downloading Wikimedia datasets will be updated to point to this new product. For instance, the dumps.wikimedia.org website, the database download project pages on some of the largest wikis, etc.

Community input is possible anytime during the project, as tasks will be tracked and triaged on Phabricator. Tasks will also be prioritized on the platform, allowing for a transparent view of the project plan and progress.

Get involved[edit]

Participants[edit]

Please use this section to tell us more about who is working on this project. For each member of the team, please describe any project-related skills, experience, or other background you have that might help contribute to making this idea a success.

Hydriz is a volunteer developer and the main developer for this new product. He has been archiving the Wikimedia datasets since 2012 and was the main developer for Balchivist 1.0.
ArielGlenn (advisor) is a senior software engineer as part of the Site Reliability Engineering team at the Wikimedia Foundation. They are mainly in charge of producing and making the Wikimedia datasets available for download. They will provide assistance and product-level expertise on working with the Wikimedia datasets that will be integrated into this new product.

Community notification[edit]

You are responsible for notifying relevant communities of your proposal, so that they can help you! Depending on your project, notification may be most appropriate on a Village Pump, talk page, mailing list, etc.--> Please paste links below to where relevant communities have been notified of your proposal, and to any other relevant community discussions. Need notification tips?

Endorsements[edit]

Do you think this project should be selected for a Project Grant? Please add your name and rationale for endorsing this project below! (Other constructive feedback is welcome on the discussion page).

Legoktm (talk) 05:59, 15 March 2021 (UTC), have some questions, but overall very excited to see this happen! Dumps are very critical to supporting the right to fork.

I think it is a problem that the community can solve and it is necessary for keeping Wikimedia alive, fast, and new. Samu marti (talk) 08:15, 15 March 2021 (UTC)
Amazing idea. I have spent a lot of time writing custom parsing of the current data directories, using string pattern recognition to look for the right kind of files for the correct dates. It will always be a brittle solution until we have a proper API. What a good idea. Fully support. Maximilianklein (talk) 18:37, 15 March 2021 (UTC)