Grants:IdeaLab/Import and visualize census data

Import and visualize census data

A simple import scheme for standardized data from census bureaus, to keep the imported values up to date. The method utilize the Wikidata project as a store for some identifying values, but accesses the data provider directly.

idea creator• Jeblad

give feedback

join

endorse

created on17:32, 2 July 2015 (UTC)

Project idea

What is the problem you're trying to solve?

It is very laborious to manually import and visualize census data or statistics data in general. Because of this the data is often imported and adapted once, and never updated. This is very unfortunate.

There are at least three very common scenarios for articles about administrative areas

Editor wants a single value, usually for an inline phrase
Editor wants a sequence of values, often a time series, presented verbatim inline, or as a list, or as part of a table
Editor wants a sequence of values, often a time series, visualized in some kind of graph

What is your solution?

By storing the minimum data necessary to identify the described entity at the data provider, we can automate much of the process and create once and publish everywhere.

Many of the census bureaus have started to use JSON-stat for publishing their statistics. One of several census bureaus is Statistics Norway (SSB), others are UK’s Office for National Statistics, Statistics Sweden, Statistics Denmark, Instituto Gallego de Estadística, Central Statistics Office of Ireland. SSB provides a list of preconfigured datasets they make available, but in addition to those it is also possible to create customized statistics. From the list we can drill down to Population changes. Municipalities, latest quarter and there we find the data as JSON-stat.

Rewamping this data as entity-specific (our context) information for Wikidata is not simple, but it is possible. The contents tab from the example has entries called variables in this example that somehow maps to properties. In the given example the entries "Folketallet1" and "Folketallet11" can be mapped to d:Property:P1082 (population) with a qualifier for d:Property:P585 (point in time). Usually it is only a small number of variables that maps directly to properties on Wikidata, and it will be to cumbersome if editors on Wikipedia must ask the Wikidata community to add a property for a variable before they can write their articles. So we must make some simplification.

Note that variables in the example are instances of dimensions in JSON-stat. They are slightly easier to interpret as variables, but dimensions are more powerful than that.

Stored on Wikidata

A statistic is assumed to have a table (dataset) or be a collection of tables (bundle of datasets), and being described on a Wikidata item. The item is given a label and description that reflects the statistic, if possible by importing labels and notes from one or more languages. The label and the description makes it possible to localize the generated table or graph when it is reused.

This would probably be done by a bot.

Caching URL

The statistic can be connected to an external page through an ordinary URL or a caching URL. If a URL is used, then the access must de done from the clients or the browsers. If a caching URL is used, then the referenced resource can be cached somewhere. A caching URL can provide better security for the user. Necessary cache maintenance will not be visible from the outside, a client ask for access to the resource through the repo and will then be provided a copy which may or may not be a fresh copy. Usually the resource will be cached on the first save of a claim, but later on they will be refreshed due to the cache parameters.

If no caching parameters are given by the external service then some sane parameters must be given, but note that we should only generate a cache refresh downwards if there are changes in the JSON-stat.

JSON-stat URL: A possible property for identification of the external page, could be either of URL or Caching URL.

Selector

Specific values can be overridden as necessary to extract data or to localize data. Those will be identified by a selector statement and have a qualifier set to the new value. A selector will just be a string value with an added interpretation. Usually this will be nothing more than the JSON dotted notation to access a value. It might be that a selector should be defined as a qualifier. The data type of a selector is nothing more than a string value. It seems like most selectors should only operate inside the dimension substructure.

JSON-stat selector: A possible property for identification (qualification) of an additional or overridden value, could be defined as a String value.

Slice

One type of statement with a selector will be a slice. Such a overload will typically identify a dimension in the external table that is connected to a value set in a specific type of statement, that is a statement using a specific property. When the statistic is linked from another item (the context) that value will be used, which imply that the cube is reduced to the values stored in statements intersecting that specific property value. There can be several trackers defined, and only some of them might apply for a specific use case.

JSON-stat slice: A possible property for statements to hold references to properties used to identify additional or overridden values. Must be used together with a selector.

Index

Another type of statement with a selector will be a index. This will simply bind the indexes from the categories to properties used as qualifiers. That will make it possible to fold the remaining dimensions into something that can be listed in the connected item. If the selector does not identify a specific index, then it holds for all contained indexes.

JSON-stat index: A possible property for statements to hold references to properties used to qualify values. Must be used together with a selector.

Label

Inside the statistics there will be labels that needs localization. Note that this is not the label for the statistic as such, but labels inside the statistic. Usually this will only be localization, but it could also be specialization due to actions by a tracker.

JSON-stat label: A possible property for statements to hold localized labels. Must be used together with a selector.

Prepare extracts for Wikidata

At this point we have a blob that is a JSON-stat that comes from an external source, and we have an item that describes this blob in a reusable way. The statements in the statistics item will act as connectors between linked items on Wikidata and the dimensions in the JSON-stat, and allow us to extract specific values from the JSON-stat (hyper)cube and import them into Wikidata.

Those values can be used as-is and inserted as statements when provided as single values, or in some cases as lists of qualified statements when provided as vectors. When provided as matrices it will probably only make sense if collections are added to Wikidata.

It seems like remaining dimensions can be translated into qualifiers, that is a time dimension can be folded into per (P585), but it is not completely clear how this should be done.

Recreation of JSON-stat as Vega-graph

As vectors and matrices it is possible to create visualizations, and that is perhaps the most interesting for Wikipedia. If the extract from the JSON-stat blobs can be transformed somehow, it is possible to imagine a new JSON format that renders a Vega-based graph, which then can be transformed into some input for mw:Extension:Graph. It is important that we want to transform a known form into another known form, and we want this new transformation to be defined in such a way that as much as possible of the work can be reused. The dimensions for the extract should be the same, and the localization of the strings.

There must be some method to configure the remaining parts of the transform, or filter process. This needs further investigation, but one option could be to pipe the processed JSON-stat through a Lua script.

A coarse example of the simplest possible call to a graph-function should be like

{{#graph:stat=Q123
  | format=timeseries
}}

Where stat=Q123 identifies the statistics to use, and the page itself defines the item used as the context.

Because the whole pipeline can be stringed together and run automatically, we can recreate the graphic whenever the data source changes. This makes it possible to keep the final graphic up to data without wasting man-hours on manual and tedious work.

Goals

Get Involved

Participants

Endorsements

This will make it easier to keep demographic data from statistical agencies updated. H@r@ld (talk) 15:50, 24 August 2015 (UTC)

Expand your idea

Do you want to submit your idea for funding from the Wikimedia Foundation?

Expand your idea into a grant proposal