China Judgments Online Preservation Program
This is a record of the staged program on Chinese Wikisource to preserve a large corpus of Chinese court judgments which have already been published online but are at ongoing risk of disappearance or restricted access.
Local community discussion on Chinese Wikisource has already occurred and the program has been approved.
The scope of this program is Chinese Wikisource only; Wikidata item creation is explicitly out of scope unless a separate process later approves a curated subset.
Why Chinese Wikisource needs this corpus
[edit]Chinese court judgments are not only legal texts. In practice they are primary-source records of how disputes are handled, how evidence is assessed, what reasoning is given, and how laws are applied across regions and courts. For researchers, journalists, civil-society groups, and ordinary citizens, judgments are often the most detailed public description of a real-world event that exists.
The key motivation for this project, and the reason the Chinese Wikisource community has engaged with it seriously, is that access to these judgments is being rolled back in ways that are widely understood as politically constrained.
China Judgments Online (中国裁判文书网) was introduced as a major transparency initiative, and for years it enabled large-scale public access to court decisions. Since around 2021 --- and especially across 2023-2024 --- multiple independent investigations and major news organizations have reported systematic removal of previously public judgments, with removals disproportionately affecting politically sensitive cases or cases that reflect poorly on local authorities or procedures. Reporting has also described policy changes and plans to reduce what remains publicly accessible, including moving more judgments into court-internal systems. This is not a hypothetical risk: the public corpus has already been shrinking, and further shrinkage is plausible.
For context (and so reviewers can evaluate the factual basis directly), here are several representative reports on the rollback of access to Chinese court rulings:
- Yang, Zeyi (2023-12-20). "China’s judicial system is becoming even more secretive". MIT Technology Review. Retrieved 2025-12-06.
- Gu, Ting (2023-12-14). "China to limit access to court judgment searches to internal use". Radio Free Asia. Retrieved 2025-12-06.
- Ma, Josephine (2023-12-22). "China to cut back access to court rulings, sparking transparency concerns". South China Morning Post. Retrieved 2025-12-06.
- Chen, Laurie (2024-01-22). "China vows judicial disclosure after outcry over plan to curb access to rulings". Reuters. Retrieved 2025-12-06.
In that context, preserving judgments is not merely "mirroring a website." It is a preservation response to an ongoing loss of public records. Mirrors exist, but many are commercialized behind paywalls, fragmented, or may comply with additional takedown practices that reduce completeness. A Wikimedia-hosted corpus would not prevent any government from changing its own publication practices, but it can preserve what has already been made public, with transparent governance and community oversight.
Chinese Wikisource is the movement's best-fitting venue for this preservation work because it is designed for maintaining large bodies of text with long-term stewardship. Unlike a static mirror, Wikisource offers durable hosting, transparent revision history, community-led formatting and navigation improvements, and consistent metadata for search and reuse. If this can be done within WMF's privacy, legal, and operational constraints, it provides a credible, movement-aligned way to prevent ongoing censorship and access rollback from quietly erasing a historically important public corpus.
Existing preparation on Chinese Wikisource
[edit]Chinese Wikisource already has a dedicated metadata structure:
This template provides a standardized place to store key fields (court, case number, date, document type, etc.) which supports on-wiki navigation and also enables the movement to evaluate, later and separately, whether limited structured-data export is useful.
For reviewers who want to inspect formatting and Header usage, here are three example pages:
- wikisource:zh:贺某某、哈某某等民间借贷纠纷民事二审民事判决书
- wikisource:zh:阿某某、陈某侵权责任纠纷民事二审民事判决书
- wikisource:zh:李某某、包头市某某矿业有限公司民事一审民事判决书
Community resolution:
Approval on January 28, 2026.
Program design principles
[edit]This program is intentionally conservative.
First, progression is gated. Every expansion of scale requires an explicit decision to proceed. The program can stop permanently at any completed stage.
Second, privacy risks are treated as inevitable. Even if a public authority has published a document, redaction mistakes occur in real-world datasets. Moreover, hosting on Wikimedia can increase discoverability by search engines and internal search. Therefore this program includes scanning, sampling, a published incident workflow, and a hard stop condition.
Third, this is not a Wikidata mass-ingestion program. The default and initial scope is Chinese Wikisource only.
Provenance and verifiability
[edit]Each imported page will carry a simple, consistent provenance record: the docId from the archival dataset, and a source statement that the text is imported from an archival dataset derived from CJO HTML.
To make this concrete, the intended on-page source statement is:
本文文本导入自 caseopen.org 存档数据集(HTML 上网稿);参见[https://wenshu.court.gov.cn/website/wenshu/181107ANFZ0BXSK4/index.html?docId={{{docid|}}} 中国裁判文书网原始页面](需登录检视)。
English translation:
"The text of this page is imported from the caseopen.org archival dataset (HTML 'online publication' version). See the original China Judgments Online page (login required to view)."
Where helpful for readers, the docId can be used to reconstruct the corresponding CJO URL pattern. However, CJO is now login-walled, and the import workflow will not attempt to fetch, validate, or re-check live availability of individual pages. Accordingly, the project does not claim that the original remains publicly accessible at import time, nor that on-demand revalidation against CJO is possible at scale.
To keep provenance display maintainable at scale, the source statement will be rendered via the existing metadata template (wikisource:zh:Template:Header/裁判文书), so the display format can be adjusted later without mass-editing pages.
Edits will use a consistent edit tag (e.g., "CJOPP") and an edit summary that includes the docId, so that imports can be monitored and sampled.
Privacy and incident response
[edit]Even when a document has been published online, large corpora can contain redaction mistakes, and hosting on Wikimedia can make information easier to discover. For that reason, the program includes preventative checks, sampling, and a clear response path.
Before publishing any batch, candidate texts will be scanned for common high-risk personal data patterns (phone numbers, ID-like strings, bank-account-like strings, and address-like strings using heuristics). In the early stages, pages will still be reviewed before saving. As scale grows, each stage will publish a sampling plan and sampling results before proceeding.
A dedicated on-wiki page will describe how to report unredacted personal data and how requests will be triaged and handled (draft: /Privacy request workflow).
The bot will stop immediately if a credible privacy issue is reported, if audits show repeated misses above an agreed threshold, or if a pause is requested for operational reasons.
Scope
[edit]To reduce complexity and risk, the scope is limited to 判决书 (judgment). Other document types (裁定书 (ruling/order), 决定书 (decision), 通知书 (notice), etc.) are explicitly deferred until the pipeline is proven and separately approved on Chinese Wikisource.
Wikidata and structured data
[edit]The default scope of this program is Chinese Wikisource only.
If, in the future, there is interest in structured-data integration, it should be handled through a separate discussion, which could consider one of two paths:
- a curated subset on Wikidata with Wikidata community approval, or
- a dedicated Wikibase instance for the full graph with selective synchronization later.
Nothing in this program requests or authorizes automatic Wikidata item creation. The program will coordinate with the Wikidata bot currently configured on Chinese Wikisource to avoid creating new items for pages created through this program.
Note: Could reference the approaches of wikidata:Wikidata:WikiProject Sweden/Swedish Riksdag documents if Wikidata becomes part of the scope later.
Staged program
[edit]The program is structured so that no step presumes the next.
Stage 0: Design review (no mass edits)
[edit]The following materials document the program design:
- /Ingestion pipeline: HTML to wikitext conversion approach, Header field mapping, category strategy
- #Provenance and verifiability: Provenance and source statement
- /Privacy request workflow
- #Existing preparation on Chinese Wikisource
Stage 1: Micro-pilot (50-200 pages; fully reviewed)
[edit]This stage is deliberately small and slow. The purpose is to validate formatting, metadata, and the privacy workflow in real conditions.
Stage report: a report describing what was edited, what issues were found, what categories were created, and whether any privacy incidents occurred.
Stage 2: Small bot test (200 pages; strongly throttled)
[edit]This is the first stage that uses the bot end-to-end as a bot. The import is limited to 200 new pages from a defined slice, with a strong throttle (e.g., roughly 10 seconds per page) and continuous monitoring.
Stage report: a report including scanner summary, manual sampling results, and operational observations (recent changes load, job queue symptoms, category growth characteristics).
Stage 3: Medium batch (up to 10,000 pages; still throttled)
[edit]This is the largest scale in the initial program scope. The purpose is to test "scale realism" without approaching hundreds of thousands of pages.
This stage includes increased sampling (for example, 1-2% manual checks) and a published report on privacy outcomes, provenance resolution rate, and search/categorization impact.
If the program is to grow beyond this point, a separate decision is required, with site reliability consultation, before any larger batch is considered.
Stage 4: Scaled import of remaining 判决书 (judgments), up to the full corpus
[edit]Local community consensus on Chinese Wikisource supports the eventual import of all 判决书 from the archival dataset, provided the workflow is stable and privacy risks are handled. Accordingly, the end stage of this program is the staged import of the remaining 判决书 pages, with community-reviewed stage gates before each expansion in volume or speed.
A practical structure (aligned with the original local batch plan) is:
- Stage 4A: 2024-10 judgments (up to 269,052 pages), with an agreed throttle and published audit results
- Stage 4B: remaining 2024 judgments (up to about 2.5 million pages), contingent on community review after Stage 4A reporting
- Stage 4C: remaining judgments in the dataset (up to about 30 million pages), contingent on community review after Stage 4B reporting
At each sub-stage, the review should consider technical feasibility (including database growth, job queue and search impact), privacy incident rates and response capacity, and whether additional guardrails or pauses are needed.
Program scope and boundaries
[edit]The following are explicitly out of scope for this program:
- Importing additional document types beyond 判决书. The Chinese Wikisource community have not yet decided on whether to import them.
- Large-scale Wikidata item creation.
- Scaling beyond Stage 3 without a separate community discussion first.
Safety comes first, and pausing or stopping at any stage is acceptable.
Notes
[edit]This page will be updated with stage reports, audit methods, and any incident handling outcomes.