Abstract Wiki Architect/Architecture
⚠️ Disclaimer: This project is the sole independent work of Réjean McCormick. It is not affiliated with, endorsed by, or representing the Wikimedia Foundation.
System Architecture
[edit]The Abstract Wiki Architect replaces the traditional monolithic renderer with a modular pipeline designed for reuse, scalability, and "testability".
The Generation Pipeline
[edit]The system follows a strict flow from abstract meaning to surface text. The pipeline is composed of five distinct layers:
- Router (`router.py`): The entry point. It receives a language code (e.g., `it`, `zu`) and resolves the appropriate Family Engine (e.g., `Romance`, `Bantu`) and configuration files.
- Semantics & Discourse: Builds strictly typed language-independent frames (e.g., `BioFrame`) and tracks discourse context (salience, topic) to determine pronoun usage (e.g., "She" vs. "Marie Curie").
- Constructions: Selects the abstract clause pattern (e.g., "X is a Y", "X has Y") without yet knowing the specific language morphology.
- Family Engine: Realizes the abstract syntax into concrete words and structures using the shared family matrix and the specific language card.
- Morphology & Lexicon: Handles the final surface realization, including inflection, agreement, and phonology, utilizing the decoupled lexicon subsystem.
The Hybrid Factory (Waterfall Logic)
[edit]To solve the "Long Tail" problem (supporting 300+ languages), the build system (`build_300.py`) employs a Waterfall Priority System to determine which grammar source to use for any given language.
| Priority | Source | Directory | Description |
|---|---|---|---|
| 1. Official RGL | High Road | `gf/gf-rgl/` | High-quality, expert-written grammars from the standard Resource Grammar Library (approx. 40 languages). This is the default if available. |
| 2. Contrib | Manual Override | `gf/contrib/` | If a community member writes a grammar for a missing language (e.g., Quechua), it is placed here. It automatically overrides the Factory version, allowing for incremental improvement. |
| 3. Factory | Automated Fallback | `gf/generated/` | The Language Factory (`grammar_factory.py`) automatically generates a "Pidgin" grammar (simple SVO concatenation) for any ISO code not found in Tier 1 or 2. This guarantees 100% API coverage. |
Key Design Decisions
[edit]The architecture is defined by three foundational decisions that separate it from standard renderers.
1. Family Engines vs. Language Engines
[edit]Instead of implementing 300 separate engines (one per language), the system implements ~15 family-level engines (Romance, Germanic, Slavic, Bantu, etc.).
- Rationale: Languages within a family share core morphosyntactic patterns (e.g., Bantu noun classes, Romance gender/plural logic).
- Benefit: Reduces code duplication and allows a fix in the `RomanceEngine` to instantly benefit French, Italian, Spanish, and Portuguese simultaneously.
2. Data-Driven Morphology
[edit]Morphological rules are strictly separated from code. They are stored in JSON matrices (`data/morphology_configs/*.json`) and Language Cards.
- Rationale: Allows non-programmers (linguists) to edit language behavior without touching Python code.
- Benefit: Enables the "crowdsourcing" of Language Cards (e.g., "Here is the suffix table for Catalan") which can be plugged into the existing Romance Engine.
3. Decoupled Lexicon Subsystem
[edit]The lexicon is not hidden inside the engines. It exists as a standalone package (`lexicon/*`) that interfaces with Wikidata.
- Rationale: Lexical data (words) should be distinct from Grammatical data (rules).
- Benefit: Allows for offline lexicon management, large-scale coverage reports, and reuse of the lexicon across different engines and constructions.
Directory Structure
[edit]The project organizes assets to clearly separate Official, Manual, and Generated sources:
/abstract-wiki-architect/
├── architect_http_api/ # The Python Backend (FastAPI/Flask)
├── gf/
│ ├── gf-rgl/ # [Tier 1] Official RGL (Git Submodule)
│ ├── contrib/ # [Tier 2] Manual Overrides (Community contributions)
│ ├── generated/ # [Tier 3] Factory Output (Auto-wiped & Re-generated)
│ ├── build_300.py # The Orchestrator Script (Compiles Wiki.pgf)
│ └── Wiki.pgf # The Compiled Binary containing all 300+ languages
├── data/
│ ├── morphology_configs/# Shared Family Matrices (e.g., romance_matrix.json)
│ └── lexicon/ # Sharded JSON Lexicons (e.g., /fr/people.json)
└── utils/
└── grammar_factory.py # The Language Factory Script (Generates Tier 3)