Jump to content

China Judgments Online Preservation Program

From Meta, a Wikimedia project coordination wiki
Shortcut:
CJOPP

This is a record of the staged program on Chinese Wikisource to preserve a large corpus of Chinese court judgments which have already been published online but are at ongoing risk of disappearance or restricted access.

Local community discussion on Chinese Wikisource has already occurred and the program has been approved.

The scope of this program is Chinese Wikisource only; Wikidata item creation is explicitly out of scope unless a separate process later approves a curated subset.

Why Chinese Wikisource needs this corpus

[edit]

Chinese court judgments are not only legal texts. In practice they are primary-source records of how disputes are handled, how evidence is assessed, what reasoning is given, and how laws are applied across regions and courts. For researchers, journalists, civil-society groups, and ordinary citizens, judgments are often the most detailed public description of a real-world event that exists.

The key motivation for this project, and the reason the Chinese Wikisource community has engaged with it seriously, is that access to these judgments is being rolled back in ways that are widely understood as politically constrained.

China Judgments Online (中国裁判文书网) was introduced as a major transparency initiative, and for years it enabled large-scale public access to court decisions. Since around 2021 --- and especially across 2023-2024 --- multiple independent investigations and major news organizations have reported systematic removal of previously public judgments, with removals disproportionately affecting politically sensitive cases or cases that reflect poorly on local authorities or procedures. Reporting has also described policy changes and plans to reduce what remains publicly accessible, including moving more judgments into court-internal systems. This is not a hypothetical risk: the public corpus has already been shrinking, and further shrinkage is plausible.

For context (and so reviewers can evaluate the factual basis directly), here are several representative reports on the rollback of access to Chinese court rulings:

In that context, preserving judgments is not merely "mirroring a website." It is a preservation response to an ongoing loss of public records. Mirrors exist, but many are commercialized behind paywalls, fragmented, or may comply with additional takedown practices that reduce completeness. A Wikimedia-hosted corpus would not prevent any government from changing its own publication practices, but it can preserve what has already been made public, with transparent governance and community oversight.

Chinese Wikisource is the movement's best-fitting venue for this preservation work because it is designed for maintaining large bodies of text with long-term stewardship. Unlike a static mirror, Wikisource offers durable hosting, transparent revision history, community-led formatting and navigation improvements, and consistent metadata for search and reuse. If this can be done within WMF's privacy, legal, and operational constraints, it provides a credible, movement-aligned way to prevent ongoing censorship and access rollback from quietly erasing a historically important public corpus.

Existing preparation on Chinese Wikisource

[edit]

Chinese Wikisource already has a dedicated metadata structure:

This template provides a standardized place to store key fields (court, case number, date, document type, etc.) which supports on-wiki navigation and also enables the movement to evaluate, later and separately, whether limited structured-data export is useful.

For reviewers who want to inspect formatting and Header usage, here are three example pages:

Community resolution: Approval on January 28, 2026.

Program design principles

[edit]

This program is intentionally conservative.

First, progression is gated. Every expansion of scale requires an explicit decision to proceed. The program can stop permanently at any completed stage.

Second, privacy risks are treated as inevitable. Even if a public authority has published a document, redaction mistakes occur in real-world datasets. Moreover, hosting on Wikimedia can increase discoverability by search engines and internal search. Therefore this program includes scanning, sampling, a published incident workflow, and a hard stop condition.

Third, this is not a Wikidata mass-ingestion program. The default and initial scope is Chinese Wikisource only.

Provenance and verifiability

[edit]

Each imported page will carry a simple, consistent provenance record: the docId from the archival dataset, and a source statement that the text is imported from an archival dataset derived from CJO HTML.

To make this concrete, the intended on-page source statement is:

本文文本导入自 caseopen.org 存档数据集(HTML 上网稿);参见[https://wenshu.court.gov.cn/website/wenshu/181107ANFZ0BXSK4/index.html?docId={{{docid|}}} 中国裁判文书网原始页面](需登录检视)。

English translation:

"The text of this page is imported from the caseopen.org archival dataset (HTML 'online publication' version). See the original China Judgments Online page (login required to view)."

Where helpful for readers, the docId can be used to reconstruct the corresponding CJO URL pattern. However, CJO is now login-walled, and the import workflow will not attempt to fetch, validate, or re-check live availability of individual pages. Accordingly, the project does not claim that the original remains publicly accessible at import time, nor that on-demand revalidation against CJO is possible at scale.

To keep provenance display maintainable at scale, the source statement will be rendered via the existing metadata template (wikisource:zh:Template:Header/裁判文书), so the display format can be adjusted later without mass-editing pages.

Edits will use a consistent edit tag (e.g., "CJOPP") and an edit summary that includes the docId, so that imports can be monitored and sampled.

Privacy and incident response

[edit]

Even when a document has been published online, large corpora can contain redaction mistakes, and hosting on Wikimedia can make information easier to discover. For that reason, the program includes preventative checks, sampling, and a clear response path.

Before publishing any batch, candidate texts will be scanned for common high-risk personal data patterns (phone numbers, ID-like strings, bank-account-like strings, and address-like strings using heuristics). In the early stages, pages will still be reviewed before saving. As scale grows, each stage will publish a sampling plan and sampling results before proceeding.

A dedicated on-wiki page will describe how to report unredacted personal data and how requests will be triaged and handled (draft: /Privacy request workflow).

The bot will stop immediately if a credible privacy issue is reported, if audits show repeated misses above an agreed threshold, or if a pause is requested for operational reasons.

Scope

[edit]

To reduce complexity and risk, the scope is limited to 判决书 (judgment). Other document types (裁定书 (ruling/order), 决定书 (decision), 通知书 (notice), etc.) are explicitly deferred until the pipeline is proven and separately approved on Chinese Wikisource.

Wikidata and structured data

[edit]

The default scope of this program is Chinese Wikisource only.

If, in the future, there is interest in structured-data integration, it should be handled through a separate discussion, which could consider one of two paths:

  • a curated subset on Wikidata with Wikidata community approval, or
  • a dedicated Wikibase instance for the full graph with selective synchronization later.

Nothing in this program requests or authorizes automatic Wikidata item creation. The program will coordinate with the Wikidata bot currently configured on Chinese Wikisource to avoid creating new items for pages created through this program.

Note: Could reference the approaches of wikidata:Wikidata:WikiProject Sweden/Swedish Riksdag documents if Wikidata becomes part of the scope later.

Staged program

[edit]

The program is structured so that no step presumes the next.

Stage 0: Design review (no mass edits)

[edit]

The following materials document the program design:

Stage 1: Micro-pilot (50-200 pages; fully reviewed)

[edit]

This stage is deliberately small and slow. The purpose is to validate formatting, metadata, and the privacy workflow in real conditions.

Stage report: a report describing what was edited, what issues were found, what categories were created, and whether any privacy incidents occurred.

50 pages created in three batches. Categories were added based on the norm. No privacy issues were found. No format issues were found beside [sic].
  1. wikisource:zh:王某某、巨某某等民事一审民事判决书
  2. wikisource:zh:中国平安财产保险股份有限公司揭阳市普宁支公司、黄某娜等机动车交通事故责任纠纷民事二审民事判决书
  3. wikisource:zh:屈xx、xx公司民事一审民事判决书
  4. wikisource:zh:曹某、曹某某等民事一审民事判决书
  5. wikisource:zh:中国某某财产保险股份有限公司佛山中心支公司、覃某锦等机动车交通事故责任纠纷民事二审民事判决书
  6. wikisource:zh:祁某某、杨某等民事一审民事判决书
  7. wikisource:zh:岳某、陈某1等民事一审民事判决书
  8. wikisource:zh:黎亚妹、海南禾瑞辰实业有限公司等房屋租赁合同纠纷民事一审民事判决书
  9. wikisource:zh:高某、闫某林民事一审民事判决书
  10. wikisource:zh:陈孝杰开设赌场罪刑事一审刑事判决书
  11. wikisource:zh:潘从贵走私、贩卖等刑事一审刑事判决书
  12. wikisource:zh:董某某、李某某民事一审民事判决书
  13. wikisource:zh:周某某、某某养老保险股份有限公司内蒙古分公司民事一审民事判决书
  14. wikisource:zh:内蒙古某公司、谭某民事一审民事判决书
  15. wikisource:zh:李彪、李德均等偷越国(边)境罪、偷越国(边)境罪刑事一审刑事判决书
  16. wikisource:zh:蒙商银行股份有限公司呼和浩特新苑支行、张美玲等民事一审民事判决书
  17. wikisource:zh:敖汉旗某有限公司、康某民事一审民事判决书
  18. wikisource:zh:王大春、易俊名民事一审民事判决书
  19. wikisource:zh:黑某某、郭某某民事一审民事判决书
  20. wikisource:zh:陈某某甲、邹某某等与吴某某提供劳务者受害责任纠纷一审民事判决书
  21. wikisource:zh:绵阳某某广场有限公司与绵阳某某商业管理有限公司一审民事判决书
  22. wikisource:zh:菅某某、宝某某民事一审民事判决书
  23. wikisource:zh:李光贤、尹良等民事一审民事判决书
  24. wikisource:zh:曹建兴、鄂尔多斯市博源小额贷款有限责任公司等小额借款合同纠纷民事一审民事判决书
  25. wikisource:zh:国某某、王某某等民事一审民事判决书
  26. wikisource:zh:贺某某、陈某某盗窃罪、盗窃罪刑事一审刑事判决书
  27. wikisource:zh:胡友康、唐天虎生命权、健康权等民事一审民事判决书
  28. wikisource:zh:李某、史某某民事一审民事判决书
  29. wikisource:zh:陈其超、四川虎蚁会务会展服务有限公司等提供劳务者受害责任纠纷民事一审民事判决书
  30. wikisource:zh:中国农业银行股份有限公司呼和浩特迎宾支行、图雅民事一审民事判决书
  31. wikisource:zh:岳池付伟门业经营部、刘合斌买卖合同纠纷民事一审民事判决书
  32. wikisource:zh:杜某某与冯某某、罗某某劳务合同纠纷一审民事判决书
  33. wikisource:zh:四川某某广告传媒有限公司与成都某某置业有限公司广告合同纠纷一审民事判决书
  34. wikisource:zh:梁某、谢某1等民事一审民事判决书
  35. wikisource:zh:内蒙古某公司、乔某民事一审民事判决书
  36. wikisource:zh:柳州某某电器有限公司与中国某某财产保险股份有限公司柳州市分公司财产保险合同纠纷一审民事判决书
  37. wikisource:zh:万某某、常某某等民事一审民事判决书
  38. wikisource:zh:喻远成、陈仁清等国有资产行政管理(国资)行政一审行政判决书
  39. wikisource:zh:杨连杰盗窃罪、盗窃罪刑事一审刑事判决书
  40. wikisource:zh:通辽市某房地产发展总公司、柴某贵民事一审民事判决书
  41. wikisource:zh:兴业银行股份有限公司呼和浩特分行、李飞喜民事一审民事判决书
  42. wikisource:zh:韩某某、张某某民事一审民事判决书
  43. wikisource:zh:庞某、徐某民事一审民事判决书
  44. wikisource:zh:甄某、张某民事一审民事判决书
  45. wikisource:zh:邓仕全、绵阳市公安局高新技术产业开发区分局司法行政管理(司法行政)行政二审行政判决书
  46. wikisource:zh:吴某交通肇事一审刑事判决书
  47. wikisource:zh:孙伟、胡科虎民事一审民事判决书
  48. wikisource:zh:胡某吉与四川省某食品科技有限公司买卖合同纠纷一审民事判决书
  49. wikisource:zh:崔xx、巴彦淖尔市xx房地产开发有限责任公司民事一审民事判决书
  50. wikisource:zh:罗某与谢某瑞买卖合同纠纷一审民事判决书

Stage 2: Small bot test (200 pages; strongly throttled)

[edit]

This is the first stage that uses the bot end-to-end as a bot. The import is limited to 200 new pages from a defined slice, with a strong throttle (e.g., roughly 10 seconds per page) and continuous monitoring.

Stage report: a report including scanner summary, manual sampling results, and operational observations (recent changes load, job queue symptoms, category growth characteristics).

Stage 3: Medium batch (up to 10,000 pages; still throttled)

[edit]

This is the largest scale in the initial program scope. The purpose is to test "scale realism" without approaching hundreds of thousands of pages.

This stage includes increased sampling (for example, 1-2% manual checks) and a published report on privacy outcomes, provenance resolution rate, and search/categorization impact.

If the program is to grow beyond this point, a separate decision is required, with site reliability consultation, before any larger batch is considered.

Stage 4: Scaled import of remaining 判决书 (judgments), up to the full corpus

[edit]

Local community consensus on Chinese Wikisource supports the eventual import of all 判决书 from the archival dataset, provided the workflow is stable and privacy risks are handled. Accordingly, the end stage of this program is the staged import of the remaining 判决书 pages, with community-reviewed stage gates before each expansion in volume or speed.

A practical structure (aligned with the original local batch plan) is:

  • Stage 4A: 2024-10 judgments (up to 269,052 pages), with an agreed throttle and published audit results
  • Stage 4B: remaining 2024 judgments (up to about 2.5 million pages), contingent on community review after Stage 4A reporting
  • Stage 4C: remaining judgments in the dataset (up to about 30 million pages), contingent on community review after Stage 4B reporting

At each sub-stage, the review should consider technical feasibility (including database growth, job queue and search impact), privacy incident rates and response capacity, and whether additional guardrails or pauses are needed.

Program scope and boundaries

[edit]

The following are explicitly out of scope for this program:

  • Importing additional document types beyond 判决书. The Chinese Wikisource community have not yet decided on whether to import them.
  • Large-scale Wikidata item creation.
  • Scaling beyond Stage 3 without a separate community discussion first.

Safety comes first, and pausing or stopping at any stage is acceptable.

Notes

[edit]

This page will be updated with stage reports, audit methods, and any incident handling outcomes.