Jump to content

Abstract Wikipedia/Tools/abstract-wiki-architect

From Meta, a Wikimedia project coordination wiki

⚠️ Updates available at https://github.com/Rejean-McCormick/SemantiK_Architect/wiki following refusal of WMF / Abstract Wiki to collaborate and consolidate efforts with me. Reason given: "Not what we had in mind" (no other details offered).

⚠️ Disclaimer: This project is the sole independent work of Réjean McCormick. It is not affiliated with, endorsed by, or representing the Wikimedia Foundation.

Abstract Wiki Architect
Initiator Réjean McCormick
Status Active (Hybrid Factory)
License MIT
Language Python / GF
Repository GitHub

Abstract Wiki Architect is an industrial-scale Natural Language Generation (NLG) engine designed to scale Abstract Wikipedia to 300+ languages.

It addresses the "N+1 problem" of multilingual generation—where adding a new language typically requires writing a new renderer from scratch—by implementing a Hybrid Factory architecture. Instead of maintaining 300 separate scripts, the system relies on shared "Family Engines" for logic and automated fallback grammars for coverage.

Documentation

[edit]
Page Description
Architecture & Internals Deep dive into the Waterfall Priority Logic, the Family Engine design, and the separation of semantics from morphology.
Contributor Guide Technical manual on how to add a new language, "graduate" a language from Factory to Contrib status, and manage the `grammar_factory.py` script.
Build System & Data Map Comprehensive reference for the Data Centralization files (Source of Truth), Builder Core modules (Orchestrator, Factory, Compiler), and Verification Utilities.
AI Integration & Services Documentation of the Lexicographer (Data Generation), Surgeon (Self-Healing Code), and Judge (Quality Assurance) agents.

Core Concept: Consoles and Cartridges

[edit]

To understand the scalability model, the system uses a gaming analogy to separate logic from data.

  • The Standard Approach: Building a separate renderer for every language (300 distinct consoles).
  • The Architect Approach: Building ~15 Universal Consoles (Family Engines) and plugging in 300 Cartridges (Language Cards).

For example, the Romance Engine implements the logic for gender, pluralization, and article elision common to the entire family. The Italian Card (a JSON file) simply provides the specific data (e.g., suffixes `-o`/`-i`) required to parameterize that engine for Italian. This allows the system to support a new Romance language (like Catalan or Galician) simply by adding a new JSON card, without writing new Python code.

The Hybrid Factory

[edit]

The system guarantees 100% API availability for all target languages using a Waterfall Priority System.

When a generation request is made for a specific language code (e.g., `zul`), the orchestrator (`build_300.py`) resolves the grammar source in the following strict order:

Priority Source Description
🥇 Priority 1 Official RGL Uses expert academic grammars from the Grammatical Framework Resource Grammar Library if available (approx. 40 languages). This is the "High Road" for quality.
🥈 Priority 2 Contrib Uses manual overrides or community-written grammars stored in `gf/contrib`. This allows users to "patch" specific languages without waiting for upstream RGL updates.
🥉 Priority 3 Factory Uses the Language Factory (`grammar_factory.py`) to automatically generate a simplified "Pidgin" grammar (string concatenation) for any language not covered by Tier 1 or 2.

This architecture ensures the system never fails due to a missing language; it simply utilizes the best available generation strategy, allowing languages to "graduate" from Tier 3 to Tier 2 over time.

AI-Powered Automation

[edit]

The system integrates Large Language Models (Gemini) via a dedicated `ai_services` layer to handle tasks that require semantic understanding or fuzzy logic. This is divided into three specialized agents:

  • The Lexicographer: Automatically generates morphological dictionaries and seed lexicons for new languages.
  • The Surgeon: A self-healing mechanism that reads compiler error logs and surgically patches broken grammar source code during the build pipeline.
  • The Judge: A Quality Assurance agent that generates "Gold Standard" reference sentences and evaluates the linguistic naturalness of the engine's output.

Tool Ecosystem

[edit]

The engine is supported by a frontend dashboard that manages the pipeline and visualization:

  • The Everything Matrix: A central dashboard tracking the maturity and build status of all 300+ languages.
  • The Status Page: Allows users to select active languages for generation.
  • The Editor: A visual tool for defining abstract Semantic Frames (e.g., `BioFrame`).

Diagrams

[edit]

Where AWA (Architect) is used in the kOA digital ecosystem (Konnaxion + Orgo) — with SenTient + Kristal

[edit]

Context (full disclosure): As mentioned in the "SenTient" topic on this page, I’m building and using AWA together with other tools in a broader personal project I call kOA Digital Ecosystem or kOA Sociotechnical Operating System. AWA remains a “gift” here and I’m not speaking on behalf of WMF; I’m sharing this only to clarify integration context.

The kOA framing in one paragraph

[edit]

kOA treats knowledge as infrastructure: it compiles raw sources into Kristals (validated, structured, provenance-bound, portable semantic artifacts), then routes them through public coordination (Konnaxion) and/or closed-loop execution (Orgo). The goal is a full loop: knowledge → deliberation → decisions → execution → institutional memory.

Where AWA fits (high-level)

[edit]

In kOA, AWA is the multilingual realization engine: given structured inputs (frames / functions / AST-like intent), it produces correct surface text through GF-based grammars. In other words:

  • Kristal = the portable “meaning package
  • AWA = the deterministic multilingual renderer for that meaning
  • SenTient = upstream deconstruction / extraction tooling that helps turn messy sources into structured artifacts (often paired with deterministic tools like OpenRefine/OpenTapioca)

Two deployment contexts: Konnaxion vs Orgo

[edit]

1) Konnaxion (public / open layer)

  • AWA is used to publish/read compiled knowledge (Kristals) as multilingual text (and later, potentially as assembled paragraphs/pages).
  • The public commons can keep stable references: “one meaning artifact, many language outputs”.
  • This relates directly to common AWA questions like: where do translations live, what gets pre-built vs on-demand, and how to preserve provenance at the claim/sentence level.

2) Orgo (closed-loop / “inside the walls” execution layer)

  • Orgo is designed for offline-capable, auditable execution (routing, escalation, closure).
  • In that setting, AWA is used to generate multilingual briefs, SOPs, task text, and “operational” documents from the same Kristal-backed semantic substrate — using local runtime packs when disconnected.


(Again: Konaxxion and Orgo are not not a WMF proposal; I’m only describing where AWA is used in my broader stack so the integration boundaries stay clear.)

Réjean McCormick (talk) 17:40, 30 January 2026 (UTC)[reply]