Jump to content

Abstract Wiki Architect/Architecture

From Meta, a Wikimedia project coordination wiki

⚠️ Disclaimer: This project is the sole independent work of Réjean McCormick. It is not affiliated with, endorsed by, or representing the Wikimedia Foundation.

System Architecture

[edit]

The Abstract Wiki Architect replaces the traditional monolithic renderer with a modular pipeline designed for reuse, scalability, and "testability".

The Generation Pipeline

[edit]

The system follows a strict flow from abstract meaning to surface text. The pipeline is composed of five distinct layers:

  1. Router (`router.py`): The entry point. It receives a language code (e.g., `it`, `zu`) and resolves the appropriate Family Engine (e.g., `Romance`, `Bantu`) and configuration files.
  2. Semantics & Discourse: Builds strictly typed language-independent frames (e.g., `BioFrame`) and tracks discourse context (salience, topic) to determine pronoun usage (e.g., "She" vs. "Marie Curie").
  3. Constructions: Selects the abstract clause pattern (e.g., "X is a Y", "X has Y") without yet knowing the specific language morphology.
  4. Family Engine: Realizes the abstract syntax into concrete words and structures using the shared family matrix and the specific language card.
  5. Morphology & Lexicon: Handles the final surface realization, including inflection, agreement, and phonology, utilizing the decoupled lexicon subsystem.

The Hybrid Factory (Waterfall Logic)

[edit]

To solve the "Long Tail" problem (supporting 300+ languages), the build system (`build_300.py`) employs a Waterfall Priority System to determine which grammar source to use for any given language.

Priority Source Directory Description
1. Official RGL High Road `gf/gf-rgl/` High-quality, expert-written grammars from the standard Resource Grammar Library (approx. 40 languages). This is the default if available.
2. Contrib Manual Override `gf/contrib/` If a community member writes a grammar for a missing language (e.g., Quechua), it is placed here. It automatically overrides the Factory version, allowing for incremental improvement.
3. Factory Automated Fallback `gf/generated/` The Language Factory (`grammar_factory.py`) automatically generates a "Pidgin" grammar (simple SVO concatenation) for any ISO code not found in Tier 1 or 2. This guarantees 100% API coverage.

Key Design Decisions

[edit]

The architecture is defined by three foundational decisions that separate it from standard renderers.

1. Family Engines vs. Language Engines

[edit]

Instead of implementing 300 separate engines (one per language), the system implements ~15 family-level engines (Romance, Germanic, Slavic, Bantu, etc.).

  • Rationale: Languages within a family share core morphosyntactic patterns (e.g., Bantu noun classes, Romance gender/plural logic).
  • Benefit: Reduces code duplication and allows a fix in the `RomanceEngine` to instantly benefit French, Italian, Spanish, and Portuguese simultaneously.

2. Data-Driven Morphology

[edit]

Morphological rules are strictly separated from code. They are stored in JSON matrices (`data/morphology_configs/*.json`) and Language Cards.

  • Rationale: Allows non-programmers (linguists) to edit language behavior without touching Python code.
  • Benefit: Enables the "crowdsourcing" of Language Cards (e.g., "Here is the suffix table for Catalan") which can be plugged into the existing Romance Engine.

3. Decoupled Lexicon Subsystem

[edit]

The lexicon is not hidden inside the engines. It exists as a standalone package (`lexicon/*`) that interfaces with Wikidata.

  • Rationale: Lexical data (words) should be distinct from Grammatical data (rules).
  • Benefit: Allows for offline lexicon management, large-scale coverage reports, and reuse of the lexicon across different engines and constructions.

Directory Structure

[edit]

The project organizes assets to clearly separate Official, Manual, and Generated sources:

/abstract-wiki-architect/
├── architect_http_api/    # The Python Backend (FastAPI/Flask)
├── gf/
│   ├── gf-rgl/            # [Tier 1] Official RGL (Git Submodule)
│   ├── contrib/           # [Tier 2] Manual Overrides (Community contributions)
│   ├── generated/         # [Tier 3] Factory Output (Auto-wiped & Re-generated)
│   ├── build_300.py       # The Orchestrator Script (Compiles Wiki.pgf)
│   └── Wiki.pgf           # The Compiled Binary containing all 300+ languages
├── data/
│   ├── morphology_configs/# Shared Family Matrices (e.g., romance_matrix.json)
│   └── lexicon/           # Sharded JSON Lexicons (e.g., /fr/people.json)
└── utils/
    └── grammar_factory.py # The Language Factory Script (Generates Tier 3)