User:Jeblad/Technical solutions to dynamic imports

From Meta, a Wikimedia project coordination wiki

Technical solutions to dynamic imports is a page (later on an RfC) that describes alternate technical solutions to dynamic imports from external sites. There are several types of imports of which this page only discuss some, and then only in the context of specific projects.

The reason why a technical solution is necessary in some cases is that the external content is dynamic in nature, and because of that we want a solution that does not imply continuous maintenance. That would also imply that the sites disseminating information to our sites must be regarded as "safe" and "dependable". Safe so we can be reasonable assured that other third parties can't interfere with the transfer of information, and dependable so we can be reasonable assured that they will in fact deliver identified content timely, correctly and as promised.

Types of content[edit]

There are some identified types of content that seems to be important

  1. Fragments of descriptive text – typical example is descriptive weather statistics generated as free text
  2. Updating photo streams – typical example is a photo stream from a wild animal shot by an automatic camera
  3. Evolving graphs – typical example is a graph showing amount of water in a river system
  4. Evolving maps – typical example is a map showing a flooded area with estimates and a now-state
  5. Evolving data – typical example is the actual now-value for the amount of water in the river system

Technical solutions[edit]

It should be clear from the start of the page that this is about continuously changing dynamic content. If the data is static or very slowly changing (years) it is possible to upload it to Commons or Wikidata. Such upload can even be done manually.

Upload to Commons[edit]

Upload to Commons usually imply use of bots or the GLAM toolkit. It should be possible to do automatic checking of the source if provided, but the source is often a link to the page where the image is shown and not to the image itself. That makes it necessary to identify the correct image before it can be automatically updated.

An automatic process could be implemented that use semantic data from the file page and if the site is on a whitelist the file can be updated. The file must somehow be set to update, as files can be edited and then they should not be updated. It could be possible to compare the source and if the last source is the external site, then an automatic update can be allowed to proceed.

Upload to Wikidata[edit]

Upload to Wikidata usually imply use of bots, but at some point it should be possible to reference resources or materialize data from such resources. Until resources can be referenced only bots can do automatic updating.

Caching serverside[edit]

This is content and files that has been cached server side. There are basically two different solutions, one that works for media files and one that works for other content (text and data). The later can be stored as files or entries in the database, whatever is more efficient and secure. Media files will typically be kept as ordinary files. Media files will also have a separate page where they can be annotated.

All import of external content and files based upon caching server side will not create a privacy problem for the client.

Caching media files[edit]

There is an existing integration layer in Mediawiki called mw:InstantCommons that can be used for integration against external sites exporting media files. (Some documentation both on mw:Manual:$wgUseInstantCommons and mw:Manual:$wgForeignFileRepos.) This can be perused to supply files from a separate media store given that it can supply files, thumbnails and descriptions. It should probably also supply a history list, but if not there should be a local history on the local site. A history can be built locally by detecting changes to the cached media files. That can be done by hashing the file as is, or with an invariant hashing scheme like w:en:PhotoDNA.

There are two situations that must be handled separately, both has to do with thumbnailing. The simplest situation is when the downloaded mediafile can be thumbnailed as is. That usually is the case for graphs and figures. A more complex situation emerge when the downloaded file is a map. In those cases scaling involves information hiding, decreasing resolution imply removing details from the map. Because there is no information about how to do that locally the scaling must be done at the external site. In some cases it could be enough to ask for a single representative image that can be scaled locally and have a high res image that is used for all other situations.

It could be interesting to have a separate map server/functionality that handles maps. That is although outside the scope of this project.

Caching content[edit]

There is no existing integration layer for integration of external content, either text or data.

For data it could perhaps be possible to use something like Wikibase, there is now in work a version for Commons and it will probably do much of the same. Unfortunately Wikibase does not have any mechanism for importing external data. An alternative could be to have a local data store (triple store / quad store) of some kind, and import external data to this, but that imply recreating Wikibase in yet another tool. It seems like the best way to do this is to add some kind of integration layer to Wikibase, but this is not a small task.

Caching proxy[edit]

While delivering data it can be interesting to use the server as a caching proxy, thereby hiding the real IP-address from the external provider. In this scenario we can't allow any code to be forwarded as that would allow the external provider to make a call back to himself, thereby leaking information from the client. For this to be secure the js-code utilizing the data should run as a resource with its own security level.

Mashup clientside[edit]

This is content and files that are loaded dynamically in the client. There are several different solutions here. Hotlinking images, embedding frames and objects, and allowing scripting from external sites. The server load that can be created is a serious issue. Wikipedia on its worse can create more traffic than most sites can handle. If we hotlink in any way to any external site we can easily take the site down. It can be very difficult to avoid this effect if we create mashups.

These methods have various degree of isolation of the external content, and all have various degrees of privacy issues. To give some level of security it is crucial to use Content Security Policies, and to verify that all parts comply with the given security levels.[1] If an external provider can't use our minimum security level we should not use the provider at all.

Note that even if we can set a sufficient security level we will still leak IP-addresses. To some of our users that is enough, even if they have no problems with surfing on other sites with both Flash, Java and Javascript enabled.

Hotlinking images[edit]

Only static images would be available. It should be possible to define a sufficient security level.

Embedding frames and objects[edit]

Technically this is a webpage within the the webpage, with some limitations on what the inner page is allowed to do. The inner page would be coming from the external site. Newer browsers has clear boundaries on what the inner page is allowed to do, while older browsers can have serious loopholes in the security model.

By using embedding inside frames and objects we can allow Javascript and thereby create interactive content. That can be interesting, especially if the content allow analysis and drilling down into other aspects of the content. It could for example be possible to interactively manipulate a rainmodel to see how that changes a flooding further down a river system. Even if that is possible it would quickly interfere with what a site is, do we do dissemination of knowledge or do we do teaching? Some such interactive content could be interesting on Wikipedia while other content could be better suited on Wikiversity.

Allow external data[edit]

It is possible to allow Cross-Origin Resource Sharing and thereby make it possible to collect and present data from external sites.[2] Without introducing external scripting the data provider must use well-defined formats the digester can use. Examples of such formats are JSON-stat, JSON-ld and GeoJson, but also XML-based formats are possible.

Instead of delivering the data directly to the client, it is possible to redirect through a proxy on the server. This hides the clients IP-address from the external provider and also allows for caching.

Allow external scripting[edit]

By allowing external scripting outside embedded frames we essentially says that anything on the page can be altered. That makes it very difficult to verify where our content ends and the external sites content starts. Unless we can make sure our content remains the same, and the external content does what it says it should do, this is not a solution we should opt for.

[Do we need description of the inherent problems and how we can counteract some of them?]

Licensing problems[edit]

The licensing problems can basically be divided in three groups

  1. Free content with free licenses, still the licenses might be different from whats used on our sites
  2. Open content with non-free licenses, still the content is free to use (free as in beer)
  3. Closed content with non-free licenses, the content often comes with limiting End-user license agreements

It is our opinion that all content used should be openly available, if not it is difficult for a reader to verify the correctness of the content. Unless all content is openly available is main characteristics should be easily available elsewhere.

Free content with free licenses[edit]

Reuse of the content should not be a problem.

Open content with non-free licenses[edit]

If use of the content can be covered by an Exemption Doctrine Policy it could be possible to use the content. That usually means that either the site owners allow the use of the content within a specific context, usually within an Wikipedia article, or that both the local law and the US law both allow use of the content as a citation.

Closed content with non-free licenses[edit]

Should not be used as it will be difficult for a reader to verify the correctness of the content.

Example sites[edit]

This list is for example purposes only, it is not complete at all.

Norwegian Environment Agency[edit]

Environment.no[3] (Miljøstatus.no[4]) is a website for Norwegian Environment Agency, a subsidiary of Ministry of the Environment. It has updated figures, maps and data about the present environment status in Norway.

They has contacted the Norwegian (bokmål) Wikipedia about providing such content.

Norwegian Meteorological Institute[edit]

Yr.no[5] is a joint project between Norwegian Broadcasting Corporation and Norwegian Meteorological Institute. It has updated text, figures, maps and data about the present weather situation in Norway and large parts of the world. Farmers in Africa are known to use weather forecasts from this site.

There has been some discussions about using data from this site on Wikipedia and Wikinews some years back.

Norwegian Water Resources and Energy Directorate[edit]

NVE.no[6] is a website for Norwegian Water Resources and Energy Directorate. It has updated maps, figures and data about the present water resources and energy situation in Norway. They are also involved in risk analysis for areas with high risk on avalanches, mudslides, etc.

Statistics Norway[edit]

SSB.no[7] is a website for Statistics Norway. It has updated analysis, figures, maps and data about statistics for present and past about Norway, and also predictions about future trends. Some of the data is not readily available as a rather convolved interface must be navigated to reach the correct data tables. There is an on-going effort to simplify the interface, and some of the data is available through an interface that uses JSON-stat.

Data from the site is used on Wikipedia but mostly through manual download and updating.

Norwegian Social Science Data Services[edit]

NSD.no[8] is a website for Norwegian Social Science Data Services. It has updated data about present and past elections, mostly for Norway.

Brønnøysund Register Centre[edit]

Data.brreg.no[9] is a website for Brønnøysund Register Centre. It has updated data about identity, ownership and financial matters for Norwegian companies and entities.

It is also possible to get access to the European Business Register where it is possible to access information about business enterprises in fourteen European countries, including Norway. That is unfortunately information that needs credentials to access, but registration and some use is free.

Data from the site is used on Wikipedia but mostly through manual download and updating.

References[edit]