Abstract Wikipedia/Function Evaluation for Wikifunctions

From Meta, a Wikimedia project coordination wiki

This document describes the three repositories involved in function evaluation for Wikifunctions. It calls for a function-schemata library repository, which will provide a language-agnostic representation of Abstract_Wikipedia/ZObjects along with hooks in all supported programming languages. Validation, testing, and interfaces for function-evaluator and function-orchestrator will all consult this single representation, allowing all supported languages to understand the same request/response formats. The function-evaluator and function-orchestrator repositories will depend on function-schemata. function-evaluator will provide a sandboxed service for code execution in every supported language. function-orchestrator, a user-facing service, will ingest requests for function execution and marshal all required resources, like WikiData entities, stored function implementations, etc.

function-schemata[edit]

Purpose[edit]

This repository will provide common validation code used by both function-orchestrator and function-evaluator. This validation will allow all supported languages to understand and operate on ZObjects in a universal JSON format.

The schemata in this repository have several uses. Some Wikifunctions functionality will need to be implemented in multiple programming languages; schemata can help test that these overlapping implementations honor the same API contract. Serialization, deserialization, validation, normalization, and canonicalization are all performed in (at least) the PHP WikiLambda extension and the function-orchestrator.

Similarly, the UI can avoid ad hoc heuristics for type inference, relying instead on schema validation. Both in testing and at runtime, a central source of truth allows us to dispense with incidental validation checks: if an object validates against the expected schema, it can safely be assumed that that object satisfies the API or function contract.

Structure[edit]

The core of this repository, its beating heart, will be a set of .yaml files following the OpenAPI 3.0 specification. These files will describe the formats of various objects, e.g. function calls, canonicalized ZObjects, and normalized ZObjects. For example, the function evaluator call documented here can be described by an OpenAPI spec resembling this one. Note that many of the enum fields here are only partial. In particular, argument and return types are necessarily incomplete, being semi-open classes which will include community-defined types. We will eventually need a mechanism to keep these enums up to date.

This specification will allow for easy interoperability with Swagger UIs, which are already well-supported within MediaWiki services. The orchestrator and evaluator repositories' Swagger specification files can simply import the function schema spec. Additionally, the function schema spec will improve maintainability in the dependent orchestrator and evaluator repositories by formalizing the contract between these two services.

For each supported language, we will write (or find among reputable open-source repositories) an OpenAPI validator. The validators will be responsible for checking the syntactic and semantic correctness of requests and responses passed between the orchestrator and the evaluator.

In practice, validators will be initializable by ZID and expose a single method, "validate." Example JavaScript usage:

let factory = SchemaFactory.NORMAL();
let validatorForZ10 = factory.create('Z10');
// returns a Boolean; logs helpful messages if validation fails
validatorForZ10.validate(zobject);

Alternatives[edit]

The alternatives below have been discussed and rejected but are included for completeness.

We could instead use something like protocol buffers or Apache Thrift to maintain a language-agnostic representation of ZObject. This would reduce the amount of per-language code we need to write, since both of these formats natively support (among other things) JSON serialization.

However, because these two formats do not natively interoperate with OpenAPI, we would have to write a rather hairy piece of code that uses protocol buffer introspection/reflection to generate a .yaml specification if we want Swagger interfaces. Additionally, this route would introduce a very heavyweight dependency where such gravity is probably not called for.

The Future[edit]

This repository could also provide a mapping from readable aliases to ZObject-style identifiers. This can be accomplished easily with a .json config mapping human-readable aliases in various languages to ZObject identifiers, e.g.

{
    "es": {
        "ZID_TIPO": "Z1K1",
        
    },
    "en": {
        "ZID_TYPE": "Z1K1",
        
    },
    "de": {
        "ZKNG_TYP": "Z1K1",
        
    },
}

along with code to labelize a JSON object by replacing its keys with human-readable equivalents. These labels can automatically be extracted from the wiki as the community adds aliases.

This is not strictly necessary for code maintenance or performance, but it will improve the readability of function-orchestrator and function-evaluator code. It will be much easier to maintain and reason about code like

zobject[ZID_ARGUMENTS][0][ZID_ARGUMENT_TYPE] == ZID_POSITIVE_INTEGER

than code like

zobject["Z8K1"][0]["Z17K1"] == "Z70"

Following the above example, this functionality could easily be part of the schema validator:

let validator = SchemaValidator();
// ingests function schema spec
validator.parse("function-schema.yaml");
// returns a Boolean; logs helpful messages if validation fails
validator.validate(zobject);
console.log(zobject[ZID_TYPE]);  # > Z7; ZID_TYPE is equal to "Z1K1"
console.log(labelize(zobject[ZID_TYPE]));  # > ZID_FUNCTION_CALL

It is not clear how much of this is feasible--in particular, it will be difficult to support the constants described above in more than one language. Discussion of whether--and how much--to implement i18n-friendly aliasing within the code base itself can be deferred until well after launch.

function-evaluator[edit]

Purpose[edit]

Provide sandboxed execution environments which process requests from the function-orchestrator.

Containerization and Security[edit]

Execution engines will be capable of running arbitrary code. This can be done as simply as by executing code at runtime (e.g. with "eval" or "exec"), with all the ceremony of running a subprocess, or even by running higher-level code within lower-level C bindings (e.g., with V8, embedded Python, Lua sandbox … ). Perhaps even a series of subprocesses for languages which require a compile step. These options will be discussed in more depth below.

No matter what, the execution environment will be able to run more-or-less arbitrary code, so various mechanisms will be necessary to isolate them and obviate security catastrophes.

Execution engines should run in isolated environments. Containerization with Docker seems a natural choice. This is not a line of defense on its own given the prevalence of container escape bugs, but containers will facilitate the implementation of defensive layers.

AppArmor, Seccomp, Firejail, and other technologies will be crucial to minimize the attackable surface area of the execution engines. Together, these technologies can do things like eliminate file system access, prohibit network requests, and automatically kill processes which violate constraints on resource use or run duration. We will whitelist only the system calls required by specific programming languages and prohibit all others by default.

Function Execution[edit]

MVP[edit]

Initially, the execution engine will consist of a single REST service which ingests a function call request; extracts the code and inputs from the request; runs the code; and returns the result. function-schemata will be responsible for validating both the incoming request and the response as ZObjects.

The service itself is a lightweight wrapper requiring very little overhead functionality or resources: it will not be able to access databases or call external services, for example. It can be implemented in a very lightweight framework (e.g. Flask) to expose the single REST endpoint needed to run code snippets and return the results.

In order to execute code, the execution engine will create a forked subprocess. The subprocess will wrap the code provided by the orchestrator so that the inputs and outputs can be shared between the calling service and the subprocess. If some file operations can be permitted, the subprocess can read arguments and write return values to a file; if not, the subprocess will interact with the calling service via piped stdin/stdout.

We specify that the subprocess will be forked in order to facilitate monitoring: a forked subprocess will have its own PID and can thus be independently killed if it exceeds memory or runtime constraints.

The Future[edit]

Continuously spinning up subprocesses may introduce prohibitively high resource requirements or latency. If these things become an issue, we can later replace the subprocess-based workflow described above with one involving low-level bindings. Technologies such as sandboxed Lua, V8, embedded Python, etc. mean that a single service, implemented in C/C++, will be able to execute code in any of the supported languages without the temporal and spatial overhead of invoking other binaries. Alternatively, if we want to optimize for complex function compositions (instead of very fast execution), we may consider executing code within Jupyter kernels. See "spurious reentrance" below for thorough treatment of a full-featured function composition architecture.

Function Composition[edit]

Milestone 1[edit]

The first iteration of function composition will impose the constraint that only built-in functions may process the results of other functions. Built-in functions will not require access to the ZObject database or any other external resource; instead, they will execute as server-side, in-process JavaScript code. They will begin life as part of the function-orchestrator library, but they will eventually need to move to the function-evaluator when/if it becomes possible for arbitrary Wikifunctions to call each other.

Since the inputs to a function call can themselves be function calls, this model allows for limited function composition. Arbitrary recursion will not be possible, since all nested calls will be specified as part of the initial function call request.

In this iteration, a complex function call will be a tree. Evaluation will consist of a post-order traversal, where the return values of child nodes become the inputs to superordinate ones.

Milestone 2[edit]

From there, it will be straightforward to add limited support for non-built-in functions (i.e., functions implemented in supported languages and executed by the function-evaluator). Non-built-ins will not initially be able to invoke nested calls to other Wikifunctions. Thus, a complex function call will still be a tree whose non-terminal nodes are constrained to be built-in functions. However, in this iteration, leaves may be arbitrary functions. Each leaf function (if it is not a built-in) will occasion a request to the function-evaluator, retrieving a return value and propagating it up as input to the superordinate function.

The Future[edit]

It may well be desirable to support arbitrarily nested non-built-in function calls in future. There are several ways to achieve this, discussed below. The forthcoming examples set forth the attributes of various implementation paths with a particular eye on polyglot mutual recursion, i.e. a situation in which functions written in different programming languages depend on each other's results, as this is the most difficult kind of arbitrary function composition to support.

Example 1: Easy Mode[edit]

Let's say a community member wants to do a "pure" Wikifunctions implementation of the Pythagorean theorem.

Dramatis Personae

Z1414 is the 'square root' function. Z20 is the 'addition' function. Z2 is the 'square' function. Z10 is the 'Pythagorean theorem' function. Z10K0 is Z10's return value; Z10K1 and Z10K2 are its arguments.

Note that these ZIDs are fabricated to protect the participants' identities.

In ZPseudocode™ (or Python, or JS, or …), the Pythagorean theorem then looks like

Z10K0 = Z1414( Z20( Z2( Z10K1 ), Z2( Z10K2 ) ) )

or, more legibly,

result = square_root( add( square( arg_1 ), square( arg_2 ) ) )

Obviously, this case is a bit contrived--people will probably just use each implementation language's native math capabilities for many of these operations--but the reader is entreated to have faith that community members will, in fact, desire some kind of function composition in less trivial cases. Note that these functions may all be implemented in different programming languages: perhaps the "square" function is implemented in JavaScript, the "add" function in Python, Z10 itself in PHP, etc.

The function orchestrator needs to determine in what order to call the functions. Here, assuming appropriate metadata is supplied to the function orchestrator, static analysis would reveal that the calls can be resolved in a bottom-up fashion. Thus, since this example doesn't involve anything tricky, the bottom-up composition model described above largely still works: for each function call, the function orchestrator sends an implementation and some inputs to the function evaluator, then replaces the statically analyzed function call with its return value.

Interlude[edit]

The example above works when (and only when) the inputs of "lower" function calls can be known a priori, i.e., when they do not depend at all on data defined within the calling function. Something as simple as

Z10K0 = Z2( Z10K1 + 1 )

could not work with the execution order described above, since the addition is defined within the calling function, meaning that the called function (Z2) could not be executed beforehand.

A solution to this problem should also address more complicated situations like (mutual) recursion.

This could be addressed in a number of ways. The major tradeoff here will be between feature completeness and ease of maintainability: flexibility in the user-facing system necessitates complexity of implementation. In the below possibilities, concepts like returning a value to the caller are intentionally treated as an implementation detail. Depending on the constraints imposed by the sandboxing environment (e.g., presence of a file system), "returning a value" might look like printing a serialized value to stdout for the subprocess's originator to capture, writing a value to a file, etc.

Writing and Importing Source Files[edit]

The easiest solution here is to dump each function to a source file (or all functions to the same file), then use the implementation language's own native import/require functionality. This would allow absolute freedom with respect to function composition. Even the static analysis step described in Example 1 would be unnecessary. All this comes at the steep cost of precluding multi-language function calls, making this path undesirable.

Piping and Tail Calls[edit]

Given that function calls will be executed within subprocesses, it is natural and straightforward to conceive of a function call like a Linux process. As such, we could conceive of function input and output as akin to pipes in Linux. Specifically, a function could make arbitrary (even mutually recursive) calls to other functions if and only if the recursive call is the final statement in the function.

As an example, assume that Z2 is the "square" function. Z2K0 is its return value and Z2K1 is its argument. Z2 might have an implementation like

Z2K0 = Z2K1 * Z2K1

and might naively return to the caller something like

{ "Z2K0": "4" }

when Z2K1 is 2. Assume further that Z10 adds one to its argument, then squares the result; its implementation might look like

Z10K0 = Z2( Z10K1 + 1 )

and return to its caller

{ "Z10K0": "9" }

when Z10K1 is 2. Adopting the piping/tail call mechanism, return values would look a little different. A simple return value (like that of Z2) might look like

{ "Z2K0": { "return_value": "4" } }

while a tail call, like Z10's return value, might look like

{ "Z10K0": { "tail_call": { "function": } "4" } }

This approach has the benefit of being relatively simple to implement. A small amount of static analysis would be needed to recognize nested function calls and propagate informative errors. The disadvantage here is that allowing only tail calls imposes a somewhat esoteric constraint on community implementers. While one can prove that any problem involving function composition can be solved with only tail calls, actually writing functions in an intuitive way under that constraint is another matter.

Spurious Reentrance (Alternate Titles: "Faux Asynchrony," "Broken Promises")[edit]

The most flexible solution from the community perspective is, of course, the most intricate and fraught to implement. In this case, when a function sees a nested call, it freezes its own execution while waiting for a value to be populated.

There is a way to do this without making every function a coroutine. Recall that a function call will be operationalized as a subprocess. Let the "subprocess wrangler" be the top-level process which manages the subprocesses in which implementations run. The subprocesses receive inputs from the wrangler via piping of stdin or by reading from a file, and, conversely, they produce outputs by piping to stdout or writing to a file. For clarity, the below description will assume file reads/writes as the method of inter-process communication, but the stdin/stdout solution is also possible.

Consider again the functions Z2 and Z10 as described above. Assume that Z10 is implemented in Python and Z2 in JavaScript. Static analysis (or information from the function orchestrator) can tell the execution engine that the code for Z10 will need an implementation for Z2. Using only Z2's signature (which is defined in the call from the orchestrator and requires no static analysis of code), a stub implementation can be generated in Python. The stub might look something like this:

def Z2(Z2K1):
    other_file_name = generate_uuid()
    requested_call = {
        'function_name': 'Z2',
        'args': [Z2K1],
        'place_result': other_file_name }
    with open(unique_file_name, 'w') as outp:
        json.dump(requested_call, outp)
    send_signal()
    while True:
        time.sleep(1)
        if os.exists(other_file_name):
            with open(other_file_name, 'r') as input:
                result = json.load(input)
            os.rm(other_file_name)
            return result

In other words, the wrangler will generate a unique file name ("unique_file_name") (or embedded unique IDs if using stdin/stdout) for each subprocess. These unique IDs will allow for inter-process communication. Here, when the Python implementation of Z10 sees something like Z2( Z10K1 + 1), the stub implementation ensures that the Python subprocess will spin-wait while polling for information in "other_file_name".

Likewise, "send_signal" will send a signal to the wrangler (or, if this is infeasible, the wrangler can simply poll for changes to "unique_file_name"). This will cue the wrangler to spin up another subprocess for Z2, using the information serialized in "unique_file_name." Once the wrangler receives a result from Z2, "place_result" will ensure that the result goes in the location expected by Z10's subprocess. Z10 will then continue execution. If you squint, this looks kind of like an asynchronous call or a JS promise, whence the proposed nomenclature.

The disadvantage of this route is that it relies on some unpleasant polling and spin-locking. The workflow described above could probably be refined (in particular, there are probably more sophisticated ways to manage inter-process communication), but this amount of context-switching is bound to introduce latency. This route also involves a fair amount of boilerplate in each function call, meaning that certain names (in the above example, Z2) will be defined invisibly to the user and certain modules (here, Python's "time") will be included/required/imported. Every supported programming language will need to know how to generate this boilerplate and, possibly, how to implement inter-process communication.

The advantage, of course, is complete freedom to implement nested function calls. This implementation path does not set arbitrary limits on contributor-defined code, as would happen with tail calls; nor does it constrain composition to take place only within the same programming language, as importing/including would.