Architecture guidelines

From mediawiki.org

This document describes architecture guidelines that all MediaWiki developers should consider when they make changes to the MediaWiki codebase, especially code meant to run on Wikimedia sites. It is a work in progress. Its current incarnation is the result of developers discussing on IRC, in real life and via wikitech-l.

This guide is more concrete than the high level Architecture Principles and interrelates the performance guidelines, design style guidelines, and security guidelines.

This page reflects MediaWiki as it is now and does not get into the details of how the overall architecture of MediaWiki can be improved; for that, see phab:T96903.

Process for implementing architecture changes[edit]

Incremental change[edit]

MediaWiki core changes incrementally. Third-party users' time is important, as is backwards compatibility. Be careful about taking on complete rewrites of large sections of code, as they often are more difficult than they appear and are often abandoned.

The design principles listed below are meant to last a long time. To discuss changing them or adding more, see phab:T96903.

Good example: The old transition from cur/old tables to page/revision/text, which brought object intermediary interface in front of revisions. This enabled extending to sharded external storage with relatively little change to the rest of the code, and various funky compression techniques. That’s a case where polymorphic objects were chosen over hooks, and it worked out well.[1] And now the storage layer may be further decoupled to a separate service, which feels like it won’t be too disruptive due to good design choices 10 years ago.

The transition from cur/old tables to page/revision/text happened around 2004-2005. It was probably the first really successful big refactor. The experience of the horror of the pre-refactored code helped make this go well, since it vividly demonstrated all the potential pitfalls of tight coupling. The new code structure then abstracted most of the actual table bits, which turned out to be a very extensible solution.
Mostly, Brion Vibber worked on the main code, with Tim Starling working on the compression abstraction that was later extended to external storage, and it took a few months. The actual data table conversion took a few days on the biggest wikis.

Bad example: Authentication/authorization interface: basically, it was put together without a good idea of the requirements. It turned out to work okay for an initial version of CentralAuth, which went live around 2008, but had some weaknesses for LDAP and other uses. It lacked flexibility in the interface, and made assumptions about data flow. Thus, the bad interface has made auth fixes harder.

Introduction of new language features[edit]

Features introduced in the PHP language which impact architecture should be discussed (see examples below), and consensus reached, before they are widely adopted in the code base.

Rationale: Features being added to PHP aren't necessarily suitable for use in large scale applications in general, or MediaWiki specifically. In the past, features have been introduced with caveats or pitfalls that have required workarounds (for example: Late Static Binding, which can be worked around by using a better design). Experienced MediaWiki developers are in a good position to critically review new PHP features.

Examples of PHP features that aren't widely used yet in MediaWiki, but could be adopted (or rejected):

  • Method chaining (enabled by PHP 5)
  • __get() magic method
  • Late static binding (PHP 5.3)
  • Namespaces (PHP 5.3) see phab:T166010
  • Traits (PHP 5.4)
  • Generators (PHP 5.5)

Interface changes[edit]

An interface is the means by which modules communicate with each other. For example, two modules written in PHP may have an interface consisting of a set of classes, their public method names, the definitions of the parameters, and the definitions of the return values.

For interfaces which are known to be used by extensions, changes to those interfaces should retain backwards compatibility if it is feasible to do so. If not, they must follow the deprecation policy. The rationale is:

  • To reduce the maintenance burden on extensions. Many extensions are unmaintained, so a break in backwards compatibility can cause the useful lifetime of the extension to end.
  • Many extensions are developed privately, outside of Wikimedia Foundation's Gerrit, so rectification of all extensions in the mediawiki/extensions/* tree does not necessarily address the maintenance burden of a change.
  • Some extension authors have a policy of allowing a single codebase to work with multiple versions of MediaWiki. Such policies may become more common now that there are Long Term Support (LTS) releases.
  • MediaWiki's extension distribution framework contains no core version compatibility metadata. Thus, the normal result of a breaking change to a core interface can lead to a PHP fatal error, which is not especially user-friendly.
  • WMF's deployment system has only rudimentary support for a simultaneous code update in core and extensions.
  • When creating hooks, try to keep the interfaces very narrow. Exposing a '$this' object as a hook parameter is a poor practice, which has caused trouble as code has moved from being UI-centric into separate API modules or similar.

Good examples:

  • File storage: When the file storage system was redone, most of it was abstracted away so front-end code never had to touch storage. Of course, there was some code that had to actually touch files, and it was gradually updated to use the new storage system, bit by bit: first, primary images and thumbs; then, things like math image generation.
  • notifications: Notifications (formerly Echo) is being dropped in as a supplementary layer, without fully replacing the user talk page notification system. Eventually we’ll probably drop the old bits and merge them fully. (It would be even better to systematically notify old clients of this changeover through a public comment, migration and revision period.)
  • ResourceLoader: When ResourceLoader was added, initially some scripts were still loaded in the legacy fashion, and there was a fairly long transition period where site scripts and gadgets got fixed up to play better with RL.

Requests for comment (RfC)[edit]

An RfC is a request for review of a proposal or idea to change the basic structure of MediaWiki. RfCs are reviewed by the community of MediaWiki developers. Final decisions on RFC status are made by the WMF Wikimedia Technical Committee.

Filing an RFC is strongly recommended before embarking on a major core refactoring project.

Data-driven change[edit]

Understand the parts of the overall Wikimedia infrastructure that your change would touch.

Do your homework before suggesting a change, so other people can check your math. And after you've made the change, repeat your benchmarks to check whether you've succeeded.

Design principles[edit]

Cf. the "end result" principles. MediaWiki and related code are intended to meet the following principles.

Secure[edit]

The privacy of users' data is important; see Security for developers/Architecture.

Efficient[edit]

Users should be able to perform most operations within two seconds. Please see the performance guidelines.

Multilingual[edit]

The software should empower people speaking, reading, and writing all human languages. New MediaWiki code must be internationalised and localisable. See Localisation to see how to do this.

Separation of concerns — UI and business logic[edit]

It is generally agreed that separation of concerns is essential for a readable, maintainable, testable, high-quality codebase. However, opinions vary widely on the exact degree to which concerns should be separated, and on which lines the application should be split.

MediaWiki began in 2002 with a very brief style where "business logic" and UI code were freely intermixed. This style produced a functional wiki engine with a small outlay of time and only 14,000 lines of code. Despite the MediaWiki core now weighing in at some 235,000 lines, the marks of the original style can still be seen in important areas of the core code base. This design is clearly untenable as the core for a large and complex project.

Many features have three user interfaces:

  • Server-generated HTML
  • The HTTP API, i.e. api.php. This is used as both a UI in itself (action=help etc.) and as an interface with client-side UIs.
  • Maintenance scripts

Currently, these three user interfaces are supported by means of either:

  • A pure-PHP back end library
  • Having one UI wrap another UI (internal requests to the HTTP API)
  • Code duplication

The preferred way is to construct pure PHP backend libraries that model application logic independently of access mechanism or representation. While the idea of using internal API requests via FauxRequest has some support among engineers, the general consensus is that the disadvantages outweigh the benefits (see RFC discussion on phab:T169266).

Advantages of wrapping the HTTP API:

  • Certain kinds of batching are naturally supported. Some existing pure-PHP interfaces suffer from a lack of batching, for example, Revision. The necessary refactoring would require a non-trivial amount of resources.
  • The HTTP API provides a boundary for splitting across different servers or across daemons running different languages. This provides a migration path away from PHP, if this is desirable.

Disadvantages of wrapping the HTTP API:

  • Loss of generality in the interface, due to the need to serve both internal and external clients. For example, it is not possible to pass PHP objects or closures.
  • The inability to pass objects across the interface has various implications for architecture. For example, in-process caches may have a higher access latency.
  • Depending on implementation, there may be serialisation overhead. This is certainly the case with the idea of replacing internal FauxRequest-style calls with remote calls.
  • The command line interface is inherently unauthenticated, so it is difficult to implement it in terms of calls to functions which implement authentication. Similarly, some extensions may wish to have access to unauthenticated functions, after implementing their own authentication scheme.
  • More verbose calling code.
  • The resulting code is not idiomatic, it defies static analysis and makes it impossible to use the available tooling for PHP. IDEs cannot be used to find function calls or do automatic refactoring.

In consequence, internal API calls are considered technical debt in production code. They are however useful (and necessary) for testing, and may be used for prototyping.

Separation of concerns — encapsulation versus value objects[edit]

It has been proposed that service classes (with complex logic and external dependencies) be separated from value classes (which are lightweight and easily constructed). It is said that this would improve testability. The extent to which this should be done is controversial. The traditional position, most commonly followed in existing MediaWiki code, is that code should be associated with the data it operates on, i.e. encapsulation.

Disadvantages of encapsulation

  • The association of code with a single unit of data tends to limit batching. Thus, performance and the integrity of DB transactions are compromised.
  • For some classes, the number of actions which can be done on/with an object is very large or not enumerable. For example, very many things can be done with a Title, and it is not practical to put them all in the Title class. This leads to an inelegant separation between code which is in the main class and code which isn't.
  • The use of smart but new-operator-constructable classes tends to lead to the use of singletons and global variables for request context. This makes unit testing more awkward and fragile. It also leads to a loss of flexibility, since the relevant context cannot easily be overridden by callers.

Whether or not it is used in new code, it is likely that encapsulation will continue to be a feature of code incrementally developed from the current code base. The following best practices should help to limit the negative impacts of traditional encapsulation.

Encapsulation best practices

  • Where there is I/O or network access, provide repository classes with interfaces that support batching.
  • A global singleton manager should be introduced, to simplify control of request-lifetime state, especially for the benefit of unit tests. This should replace global, class-static and local-static object variables.
  • Limit the code size of "smart object" classes by splitting out service modules, which are called like $service->action( $data ) instead of $data->action().

You aren't gonna need it[edit]

Do an inventory of currently available abstractions before you add more complexity.

Do not introduce abstraction in advance of need unless the probability that you will need the flexibility thus provided is very high.

This is a widely accepted principle. Even Robert C. Martin, whose "single responsibility principle" tends to lead to especially verbose and well-abstracted code, stated in the book "Agile principles, patterns, and practices in C#":

If, on the other hand, the application is not changing in ways that cause the the two responsibilities to change at different times, then there is no need to separate them. Indeed, separating them would smell of Needless Complexity.
There is a corollary here. An axis of change is only an axis of change if the changes actually occur. It is not wise to apply the SRP, or any other principle for that matter, if there is no symptom.

An abstraction provides a degree of freedom, but it also increases code size. When a new feature is added which needs an unanticipated degree of freedom, the difficulty of implementing that feature tends to be proportional to the number of layers of abstraction that the feature needs to cross.

Thus, abstraction makes code more flexible in anticipated directions, but less flexible in unanticipated directions. Developers tend to be very poor at guessing what abstractions will be needed in the future. When abstraction is implemented in advance of need, the abstraction is often permanently unused. Thus, the costs are borne, and no benefit is seen.

References[edit]

  1. (The hooks may have come along later.)

See also[edit]