Talk:Wikidata/Development/Representing values

From Meta, a Wikimedia project coordination wiki

Fractions[edit]

There are going to be values that are natively represented as fractions. Obviously 1/2 is going to convert to decimals very easily. Not so with 1/3 or 13/99. How should those be handled? This is also Sven Manguard (talk) 20:05, 17 December 2012 (UTC)[reply]

We won't. Those numbers appear, but rather rarely I'd guess - can you give examples? --Denny Vrandečić (WMDE) (talk) 20:52, 17 December 2012 (UTC)[reply]
Given enough time, I'm sure I could find some. I know fractions have much more historical use than they do present use (we can thank cheap, common calculators for that), but that they are still used in recipe-like instructions, including in some corners of modern medicine. Is there a good reason to write off fractions? This is also Sven Manguard (talk) 15:37, 18 December 2012 (UTC)[reply]
Rational numbers is very common in several disciplines, but we can probably do without them initially. Its very common in conversions between units, and it could be important in length/angle calculations. An other thing is rational numbers often describes a fraction of a transcendental number, and those are more or less meaningless given as real numbers. Still I think we can do without them initially. — Jeblad 05:03, 20 December 2012 (UTC)[reply]
These are very common when converting between units, for example, from m s-1 to km h-1. CC0 (talk) 20:56, 27 October 2016 (UTC)[reply]

Strings vs. entities[edit]

What about indicating that a value is a simple string, i.e. that it shouldn't correspond to a Wikidata entity? An example would be a person's birth name, or a country's motto. Yaron K. (talk) 01:29, 18 December 2012 (UTC)[reply]

Oh, good point. The list on the page is incomplete -- there are also datatypes for items, strings, etc. as discussed in the full data model. I should at least point to them here. I discussed here those that are immediate and unclear to me. Thanks, changing here. --Denny Vrandečić (WMDE) (talk) 10:24, 18 December 2012 (UTC)[reply]

Uncertainties, confidence[edit]

A few thoughts on uncertainties and confidence:

  1. Some numbers and quantities do not come with uncertainties, because these are unknown (in many cases) or zero (e.g. for some simple counts). Can both the lower and upper uncertainties be left blank? Will zeroes cause any problems?
  2. The confidence will also sometimes be unknown. Will it be possible to leave this blank?
  3. I'm glad to hear that the data model allows for differing upper and lower uncertainties.
  4. I'm concerned about the statement that "the quantity value should not have a higher resolution than the uncertainty". That might often be a sensible guideline for presenting values, although even there it can sometimes be useful to present more detail than is completely meaningful. But when recording data, we should strive to capture as much detail as is available. This extra detail is often important when we need to do something more than merely parroting back the data values (e.g. graphing or further analysis).

--Avenue (talk) 03:34, 18 December 2012 (UTC)[reply]

Thanks, Avenue.

  1. 0 will not be a problem. I was assuming to default to 0 instead of nothing if no value is given (or actually to use a sensible default when the number is entered, i.e. based on the number of non-zero leading digits). But maybe you are right and we should explicitly allow nothing as a value.
  2. I was thinking of defaulting it to the first standard deviation of 0.68. But maybe this also should be nothing instead.
  3. Glad you like it.
  4. It is just a 'should'. Nothing in the software will enforce it.

--Denny Vrandečić (WMDE) (talk) 10:44, 18 December 2012 (UTC)[reply]

I recommend you default confidence to blank instead of one standard deviation. By filling in a value automatically, you create the incorrect assumption that the original source measured, determined, and is willing to put their name/credibility behind a piece of information that they may not have measured or determined, and certainly did not decide to put their name/credibility behind. It also makes it impossible to tell whether one standard deviation is the actual value from the source or if it is the default. Finally, a lot of fields don't deal in standard deviations, they deal in non-numerical/non-mathematical confidence levels like "about", which can translate into anything from "+/- 140 years" to "I think that it's pretty close to this number, but don't want to go on the record with something that specific because that's a very bold statement to make in my field and I'm not tenured yet". This is also Sven Manguard (talk) 15:51, 18 December 2012 (UTC)[reply]
  1. It's good to hear zeroes won't be a problem. I think there is a meaningful distinction between zero and unknown uncertainties, worth maintaining if it's not too difficult. And maybe I'm reading too much into your bit about a "sensible default", but I think it would be better if this was only applied at the time of data output, and did not overwrite a blank/zero uncertainty entered as part of the data. I do agree that the number of non-zero leading digits would provide an adequate default for presentation purposes, if no other data was available.
  2. I think Sven has a good point about making blanks the default for confidence.
  3. I'm still quite concerned about the "should", especially as it could be read as applying to the data to be entered. I worry that this would lead some people to enter less detailed values than they have available. I also think this unqualified "should" is too strong even if it's only applied to how the data should be presented. (I can give an example where I believe this would be problematic, if that's useful.) Something like "The quantity value should not usually be presented with a higher resolution than the uncertainty" would be better, especially if something was added to make clear this does not apply to how data should be entered. I am glad to hear nothing in the software will enforce it.
--Avenue (talk) 16:25, 18 December 2012 (UTC)[reply]
I don't think an uncertainty/error on nothing makes sense… It will make sense on a zero value, but not on an non-existing value. A range value might have an unknown center value, but a range should probably not have a center value at all. If given a center value it must have a uniform distribution, and that creates a whole bunch of problems.
It should be possible to enter a value without giving any reference to intervals, std.dev., or whatever as it could be impossible to get qualified numbers and it could be interpreted as some kind of implicit qualification (or disqualification). The norm should be "these are the numbers extracted from the source and any other numbers are due to editors own (original) research". That is; I'm still very much pro pushing everything smelling of errors, intervals, std.dev., and similar to one or more additional object(s) specialized for the purpose.
The mean value, which is usually not part of the summary, is what will be displayed in most cases. This value can be very accurately given even if the measurement system is much less accurate. A well-known problem is accuracy (error probability) of GPS on short term measurement compared to mean of a long term measurement. In my opinion the precision should be implicitly given by the entered number of digits or explicitly by overriding this (that is increased). — Jeblad 03:17, 21 December 2012 (UTC)[reply]
  1. The definition of the uncertainty interval values could perhaps be clarified, at the moment it is possible to interpret: "upper uncertainty: a decimal number" and "lower uncertainty: a decimal number" as fractional of additive uncertainty (i.e. 10+/- 0.75 could be 9.25 to 10.75 or 2.5 to 17.5).
  2. I would prefer to avoid the term "uncertainty" because it limits to one form of confidence interval. However "upper interval endpoint" / "lower interval endpoint" are somewhat more clumsy. Can some statistician with insight or guidance comment? I find it already interesting how the perspective in http://en.wikipedia.org/wiki/Uncertainty#Measurements differs from http://en.wikipedia.org/wiki/Measurement_uncertainty . Perhaps "lower interval value" "upper interval value" may be less clumsily. Whether the interval describes dispersion (single values) or a confidence interval for a mean is then left open.
  3. You seem to be interpreting "confidence" as the significance level of the confidence interval from upper uncertainty to lower uncertainty. I believe using "confidence" here as a value is somewhat confusing.
  4. I strongly believe the default for confidence (in whatever form expressed) should be "unknown"/"nothing". While in engineering 68.3% ("one sigma"), 95.4% ("two sigma"), or 99.7% ("three sigma") confidence intervals may be common, in science the CI is expected to follow the common scientific experimental significance levels, i.e. CIs are expected to be 95%, 99% or 99.9%.

--Gregor Hagedorn (talk) 16:38, 18 December 2012 (UTC)[reply]

On Gregor's point #2, I'd agree that "confidence" is probably too restrictive a term. In contrast, "uncertainty" has a pretty wide scope; see e.g. w:Uncertainty#Applications, and could stretch to cover most applications here. I don't like the term for another reason - I think of it as a general concept, not a specific measure.
Confidence intervals are just one kind of w:interval estimate, derived from a particular way of modelling the uncertainty in the measurement process. Statisticians use other methods too; Bayesian credible intervals are increasing common.
Intervals are also used as descriptive statistics, to summarise the spread of values in a distribution. Perhaps the simplest example is the maximum and minimum values seen. Upper and lower quartiles are also widely used in some fields.
And intervals can be used without any formal statistical underpinning, e.g. to indicate the extent of uncertainty in an expert's opinion.
Ideally we would be able to record not just the interval given by a source, but the type of interval it is (where this is specified).
--Avenue (talk) 23:59, 18 December 2012 (UTC)[reply]

We must not confuse a number with an interval. A number has an uncertainty which may be expressed as an upper and lower limit. An interval has a start and a finish; these are two numbers each of which may have an uncertainty. For example there are lots of people whose birth dates have a big uncertainty but whose death dates are well known. Filceolaire (talk) 22:00, 10 January 2013 (UTC)[reply]

Missing central value[edit]

First, presently it is defined as: "A number is represented by its quantity value, together with the uncertainties, a confidence, and an optional unit of measurement". That means, it is not possible to express values where no central value is recorded or meaningful. Examples of such value might be:

  • The population of a given city in a century (12200 to 302000 inhabitants in the 18th century) where a mean value has no meaning, and data may be insufficient - the midpoint is not the mean of all years in the century).
  • The length of many biological objects (leaf or insect length, tree height, etc.) is often only recorded as an interval

quantity, see e.g. http://en.wikipedia.org/wiki/Carrion_Crow "48–52 cm or 18 to 21 inches in length"

From memory (but I read more biology than other topics) I believe that occasions of range-without-central-value are more frequent than value plus minus measurement error.

I believe that this can be easily fixed by making the set of attributes more flexible, allowing to give an interval range without requiring a central value.

I also believe this ties in with your question about the date/time model. Allowing intervals solves the problem of "We cannot enter a value like 2nd-5th century AD (we could enter 1st millenium AD, which would be a loss of precision)."

--Gregor Hagedorn (talk) 16:38, 18 December 2012 (UTC)[reply]

I support this idea. In general I think missing values should be allowed for all fields unless there is an very strong reason not to. Prompting the user to confirm that they meant to leave a field blank would usually be sensible, though. Alternatively we could require that some specific code (e.g. NA) is entered when the user means that no value is available. --Avenue (talk) 00:22, 19 December 2012 (UTC)[reply]
For now I think the plan is to support singular data entries, even if they have some error bounds. Later on it could be interesting to have range values. I think some of the problems with the present model is that singular entries are somewhat extended to be range-ish and that creates a lot of problems. — Jeblad 07:20, 20 December 2012 (UTC)[reply]

Completely missing values[edit]

It would also be useful to be able to record completely missing values (i.e. with no central value, and no uncertainty values), when this reflects the source. w:Missing values can arise for several different reasons, including inaction, confidentiality, and impossibility. For instance, international time series may be missing values for certain countries in certain years because the underlying surveys were not conducted there in those years. Recording such missing values will save users from having to check if their omission from Wikidata was intentional or not. Missing values also need to be treated properly when further analysis or summaries are made, and it is hard to do this if they are not recorded. Ideally we should be able to record not just that a value is missing, but the reason why (when this is known). --Avenue (talk) 00:22, 19 December 2012 (UTC)[reply]

Having read some more, I see that missing values for which no value really exists can be indicated by a PropertyNoValueSnak, and that a PropertySomeValueSnak can indicate that a real value is missing. If that could be coupled with a source in a Statement, to say that the value is just missing from that source, then that would go a long way towards addressing my concerns above. However, the description of PropertySomeValueSnaks in the data model says that they should only be used for values missing entirely from Wikidata. They also don't seem to allow the reason that the value is missing to be recorded. --Avenue (talk) 04:34, 19 December 2012 (UTC)[reply]
A reason for both novalue and somevalue seems like a good idea, but I'm not sure how this fit in. It would be like an additional summary field. — Jeblad 06:17, 20 December 2012 (UTC)[reply]

Minimum and Maximum[edit]

If we could add a minimum and maximum value to the values we would have the full power that is required on average for scientific values. In cases where dispersion is broad, minimum and maximum are often given in addition to a confidence or dispersal interval, such as "(2-) 5-10 (-12) mm"

I propose to use a model of:

  • value: central value, decimal number
  • minimum
  • maximum
  • lower interval value
  • upper interval value
  • interval type (e.g. "precision", "percentile", "confidence interval", "unspecified")
  • interval significance (0.5 for first to third quartile, 0.8 for 10% to 90% percentile, 0.95 for e.g. 95% CI)
  • unit prefix (E T G M k h c m n p f etc, for Exa to femto) - necessary to preserve desired unit precision, e.g. "100 nm" from "1 * 10e-7 m" or to know that circumference of earth is appropriately reported as 40075 km and neither 4*10^7 m nor 40 Mm.
  • significant digits, necessary to preserve desired "1.20 nm" from "1.2 nm"
  • unit: a Wikidata item (e.g. Q11573 for Metre)

in such a way that each of the attributes may be missing. A similar model (with the omission of interval significance) is in use in biology since the 1980 (Dallwitz, M. J. 1980. A general system for coding taxonomic descriptions. Taxon 29: 41–6) and has proven to be very useful and flexible. --Gregor Hagedorn (talk) 16:38, 18 December 2012 (UTC)[reply]

There are alternative summaries that include more than one interval, e.g. w:Seven-number summary. I'm not sure that allowing multiple intervals per value is worth the added complexity. Couldn't they just be entered in another Statement? But if we do allowing multiple intervals, I'd hope we would allow for different kinds of intervals to be stored (e.g. arbitrary quantiles, not just max and min). --Avenue (talk) 00:31, 19 December 2012 (UTC)[reply]
I know, but in practice the w:Five-number summary model (with added flexibility in the choice of central tendency (median, mean, midpoint) and lower/upper interval choice (precision, quartile, other percentiles [including the 16-84 = ± 1 S.D. percentile mentioned in the current data model draft by Denny], CI, etc.) is one of the sweet spots of versatility. The reason for this is not so much that it is common to have the full set of all five, but that various combinations: central tendency alone, central tendency with CI, central tendency with other dispersion measures (such as +/- s.d.), central tendency with min and max, min and max alone, uncertainty interval alone, uncertainty interval plus min, uncertainty interval plus max, uncertainty interval plus min and max, etc. cover the vast majority of expressions of quantity. Whether a "3-number summary" (as far as I know not an official term) that is proposed in the current data model is a better sweet spot probably depends on whether you want to cover scientific data or data explicitly from statistical agencies. If not, I have not seen a w:Seven-number summary ever in my field. Whether it is meaningful to split them into 2 statements depends in part whether Wikidata can support outputting them in one string, such as "(2-) 12-15-16.5 (-22) m". --G.Hagedorn (talk) 12:33, 19 December 2012 (UTC)[reply]
I am certainly not sure myself what the best model is. Basically every model is a compromise between versatility, quick learning, adaptability, precision of information transmission, and performance considerations for large datasets. In my area (biodiversity) the DELTA used since the 1980s uses a five point model, Lucid omits the central value from it and people are still happy, xper2 uses 5 points again, SDD (an xml exchange standard) uses a maximum flexible model in which all kind of descriptive statistics (including variance, sample size, skewness, kurtosis) may or may not be used. This could be done in Wikidata through the qualifier system The problem with implementation of the fully flexible system is that the central value and intervals are relevant for searching and ordering, whereas other statistics (variance, s.d., s.e., sample size, CV) are not, or at least not as directly. --G.Hagedorn (talk) 12:33, 19 December 2012 (UTC)[reply]
Some comments in random order. — Jeblad 07:00, 20 December 2012 (UTC)[reply]
I would prefer to have the number as observed as the default, whatever the number should be, and then use an extended or additional object to represent errors and/or summaries like Seven-number summary, Five-number summary or Three-point estimation. In other disiplines these numbers can be represented in very different ways, and I think it would be more consistent (and simpler) to always split of this from the central topic at hand. To say something usefull about distribution of values (or errors) means that the user must know something about the analyzed dataset, which highly unlikely he has in most cases.
Unit prefixes has implications on precission and should be preserved somehow. Also unit prefixes can be implicit in some units and can seriously mess up presentation. A couple of weird examples is to say that something weights 1Kkg or 1Mg. Correct way to phrase it is 1000kg, and an acceptable form is 1t! This could be very difficult to get right.
A unit relates to a specification (standard document) from a standardization body (entity). A "meter" is not just a meter, but it is a meter defined by a specific standardization body and in a specific standard document. That is most of the time. So there will not be one "meter" but there will be several. That means all of them must be defined as separate items. Each one of these must again relate to the item for the standardization bodies. I'm not sure if all wikipedias will describe each meter definition separatly, but at least it is possible as an idea. Perhaps a meter as an concept could be used as an item to connect the wikipedia articles. The problem is somewhat bigger with feets and some other length measurements.
Some units are unnamed derived units. Those could be spelled out in strings, but the number of variations in normal use is quite large. I guess one of the more well-known examples are momentum/impulse or kg ms⁻¹. It is possible to spell out the formula, but I'm a bit unsure if this form should identify an entity.

Altitude[edit]

I believe you do not mean http://en.wikipedia.org/wiki/Altitude but http://en.wikipedia.org/wiki/Elevation --Gregor Hagedorn (talk) 16:38, 18 December 2012 (UTC)[reply]

Time as intervals?[edit]

I think that for time values we'd probably not just want to specify instants down to near-arbitrary precision, but also intervals. This means that one temporal "field" can refer to an instant, an approximation, or a range between two instants or approximations. For example, a person might want to be defined as living for "1840s to 1953-05-20T04:24:45"

I would strongly suggest re-using existing patterns for this from the LOD world - in particular, the excellent placetime.com by Ian Davis, which was extended into the date/time intervals URI set created by Stuart Williams as part of my team's work for data.gov.uk's RDF.

In particular as well as "regular" intervals such as the year 1434, the month of March 2045, or the 45th second of the 12th minute of the 1st hour of 3 February 1030, you can define arbitrary date ranges like the period of 10 years and 9 seconds from 12:30 UTC on the day that Wikipedia was founded or the first three months of my life, assuming starting at 00:00.

If we were to use it for Wikidata, we'd want to extend it a little - to cover non-Gregorian calendars (theoretically supported already), to cover some more standard time units that are interesting for Wikipedia but were out of scope for the UK Government at the time (weeks, decades, centuries, millennia, æons), to cover the AD/BC split, and possibly to cover untethered concepts of lengths of time ("a period of 7 minutes"). I'd be happy to talk to the people involved in creating it for their advice, and of course, help out any way I can.

James F. (talk) 12:39, 19 December 2012 (UTC)[reply]

Geolocation[edit]

1) Positional accuracy

"an uncertainty (decimal, representing degrees of distance, defaults to 0)" ... it is not suitable to represent distances in units of degrees, because you can't really measure a distance in degrees at all. Also dimensions as arc lengths of degrees (on a great circle) are not very much understandable to humans (also not applicable to an ellipsoid). Positional accuracy should be given in meters on ground instead.

Thanks for the hint. I agree that it was a bit confused and unclear. I added a section explaining it a bit, I hope it is now acceptable. The issue with why I am not taking meters is that it only makes sense on the Earth. --Denny Vrandečić (WMDE) (talk) 15:58, 27 December 2012 (UTC)[reply]
That means it is intended to store celestial coordinates over this datatype as well? I don't think that's a good idea to mix celestial with geocoordinates. Btw. GEO means earth, so geolocation is per definition a location on earth. I don't know a common word that describes lat/lon on earth (geocoordinates) as well as lat/lon on the moon (Selenographic coordinates).
The problem with degree distances (beside that they don't work on ellipsoids) is, that a normal user assumes he can simply add or substract this value from his lat/lon values. But a "distance" of 1° applied to 12° E, 52° N doesn't depict a frame of 11-13° E, 51-53° N. --Alexrk2 (talk) 22:45, 28 December 2012 (UTC)[reply]
2) Accuracy vs. Uncertainty

I think, there could be a mixing of the concepts of accuracy and uncertainty. IMO uncertainty means: I don't know the exact location of an object (e.g. like the assumed location of an old gravesite). On the other hand accuracy means that the value is defective (e.g. by the method of measurement). For instance if I take a geolocation from Google Maps, these values are accurate only around 10 meter. Do we need both information - accuracy and uncertainty ?

There is also a third thing: precision. A value of 52.4° is not equal to 52.400°. So if the user enters 52.4 than the database should also store 52.4 (not 52.400000).

In this case it is about precision, not about accuracy or uncertainty. I renamed it thus to precision. Thanks. --Denny Vrandečić (WMDE) (talk) 16:27, 27 December 2012 (UTC)[reply]
3) Altitude

en:Altitude usually means height above ground (in aviation). In geography the term elevation is preferred to describe the height of a point on ground (above a reference surface).

Dropped. Is better handled through a property of its own with a quantity value datatype. --Denny Vrandečić (WMDE) (talk) 16:27, 27 December 2012 (UTC)[reply]

--Alexrk2 (talk) 15:52, 19 December 2012 (UTC)[reply]

A few comments. — Jeblad 05:33, 20 December 2012 (UTC)[reply]
1) Positional accuracy can be given on several forms, and I think we should as a minimum use a circle with a percentage for accurately giving a position within that circle. That is approx the same as CEP, and can be supplied by some GPS. Others are RMS, 2DRMS, R80, R95, and a bunch of others. In those cases where an angle is more meaningful I would prefer to use an alternate error model. That would probably be for other celestial objects.
2) Don't mix accuracy and uncertainty, they are dissimilar. I think whats described on the subject page is accuracy, while an uncertain vale is "somevalue". As such the common form is an error ellipsoid, possibly limited to an altitude plane. Usually this can be simplifies to an error disc and called CEP or similar. I would prefer if the accuracy is represented in a separate object as there can be several error functions.
2.1) Precision should be implicit given by number of digits, but could be explicit given if necessary. It must be verified if a geopos can be stored as a float or if it should be stored as a string. Precision could be much higher the accuracy indicates, that is the mean value of the error disc can be known to have a high precision even if the standard deviation is large.
3) Altitude (usually elevation) is measured in several ways; altitude above sea levels, altitude above ground, altitude of ground above a reference surface, and altitude of a given point above the reference surface. In aviation the altitude is often given by pressure and do not relate to a real height but to a barometric isoplane. Elevation is usually the correct therm in the context of geographic information. Only way to get this right is to identify the reference globe, and that is definitely not "earth".
4) The entity should not refer "earth" (Q2) but the reference model WGS84/EGM96/EGM2008 describing the stellar object. For example WGS84 or one of several others that describes what we call earth. That is the reference model references the globe, the globe is not the reference model. We should not support recalculation from one reference model to another unless we really want to put effort into heavy calculations.
5) The parser should support gon, UTM and mgrs in addition to lat-lon -pairs.

Numbers and quantities[edit]

Numbers consists of several subclasses, but only one is important in an first implementation – real numbers. In my opinion they should contain the following

  • value – often called central value or observed value. In our context it is the reported value. Note that some values might be to large to save as ints, or have to high precision to be saved as ints. This can be implemented as objects.
  • summary – often called error or bounds. Implemented in a separate object. Initially as an upper and lower bound, that is probably sufficient for most cases.
  • precision – often also called resolution. This could be implemented as a digit count, but could also deviate from this. It does not have to be an integer number.

Ver 1: Quantities have an additional unit object

  • prefix – scaling of the value. Can be empty. Must follow a vocabulary defined by the standard doc.
  • unit – can be any kind of unit, also an unnamed derived unit. Names according to established vocabulary.
  • standard doc – link to doc given by standard body. That is an entity.

Ver 2: Quantities have an additional unit object

  • prefix – scaling of the value. Can be empty. Must follow a vocabulary defined by the later standard doc.
  • unit – list of decomposed units given as entities. The entities must point to a standard doc that is given by a standard body. All decomposed units must use the same standard doc.

Note that some quantities form subdomains that cant be converted between each other. An example is weight or length measurements where no artifact has survived to modern times, but it is still possible to convert between known related old units. — Jeblad 04:44, 21 December 2012 (UTC)[reply]

Oh, and the present implementation does not support parsing of floats. — Jeblad 07:16, 21 December 2012 (UTC)[reply]

The more I'm thinking about it the more sure I am that we should not implement storage of numbers and quantities as finite precision numbers. We should store whatever the user inputs. If those values are small enough to be recalculated (casted) as single or double floats (there will be js-problems) any necessary recalculation can be done by converting to one of those, but otherwise the number should be represented as strings and operated upon with the BC Math library. Precision should then be according to the scale property of the library.

It could be possible to do the casting automatically, but I believe it would be less error prone to explicitly specify this when requesting properties. Not quite sure how we should handle casting to high precision, perhaps giving a cast limit. That is do not try to make a number out of a string above a set limit.

Most users probably won't notice any difference, but it could have implications on the usefulness of the stored values. — Jeblad 01:26, 24 December 2012 (UTC)[reply]

Some suggestions about strict scalar values represention[edit]

Upper and lower tolerance is not only thing which can be characterizing precision. In fact, to be strict, we should supply to a magnitude an item field of distribution law, and it may require some parameters, which are given as named list of values. Different laws have different required values; e.g. Gauss distribution will have average value and dispersion (both with similar dimension), some distributions have more parameters with dimensions different to each other and may not have these ones. Distribution law "a value of fixed set" while being so simple to understand can have unlimited number of parameters. One of popular laws seems to be "unit of unknown (maybe somehow like Gauss or T) distribution with unknown but seemingly high possibility to be between X and Y" and "unit somehow equal to X by source Sx and between Y and Z by Syz" (and X is not in [Y;Z], and both sources are notable and reliable). Also, we shuold somehow handle values like "distance between Earth and Moon". It's not constant, it changes itself periodically and is slowly moving on large timescale. But users intend to want to get a simple not very precise value of what order it looks like. Also maybe for some cases it will be necessary to describe something by TeX formula (lang-independent string type, seemingly), which has list of parameters (similar formulae) with comments what they are either as multilingual text or as item.

And don't forget about that values may be tabulated by time, geolocation etc. And don't forget that not all values are scalars... Ignatus (talk) 08:26, 15 March 2013 (UTC)[reply]

Geolocation[edit]

The uncertainty values should be removed and an error model/summary should be added. That could reimplement the error disc now proposed, but it could also use other models.

The elevation value could be referred to alternate models to handle above/under surface, above/under geoid, at some specific barometric pressure (normal pressure) or at some adjusted barometric pressure (ie. above geoid). The two last ones are to be able to handle altitude correctly. I tend to think that elevation is either an object that can be replaced, or even be undefined, or that there are several different types of geolocations.

One abstraction could be to use a geoobject that encapsulates a list of other objects where some simply are geolocations. I think that in the initial version single geopoints are sufficient, but at some point we should be able to define geoareas and also spatiotemporal objects. Such objects could later on create animated sequences on top of maps, for example Titanics sailing route or tectonic movement.

Also some notes in another thread on the page. — Jeblad 07:43, 21 December 2012 (UTC)[reply]

Single precision would give arcminute resolution slightly higher than the present noise floor (accuracy) of the earths movements, and also double precision floats makes sense as long as earth has the highest precision numbers for geolocations and we want to store the values as decimal values of degrees. Note that PHP makes no assurance of using either single or double precision for floats.[1] That could produce some very nasty bugs. We should probably have a testcase for this. JavaScript represents numbers using the 64-bit floating-point format defined by the IEEE 754 standard, which means it uses double.[2] Standard compliant implementations may (later, still at proposals:decimal) convert doubles into decimals (IEEE 754r) during application of operators and the conversion may be lossy. (See also proposals:numbers)

If we later on stores geolocations for real celestial bodies very much larger than earth we might run into problems. We could also run into problems with very large virtual worlds. For now I don't think any of the points merits higher precision. — Jeblad 02:25, 24 December 2012 (UTC)[reply]

Geographic shapes[edit]

Many geographic features have a shape which cannot be represented by a diameter

  • linear features such as roads
  • areas such a country boundaries, lakes
  • branching linear features such as rivers, railway lines.

Can we subcontract the description of these to Open Street Maps by referring to objects in the OSM database for this data? Or to objects which are imported from OSM into the wikiatlas Filceolaire (talk) 21:52, 10 January 2013 (UTC)[reply]

UCUM - A recommended standard for Units of Measure[edit]

The Unified Code for Units of Measure (the UCUM) is a system of codes for unambiguously representing measurement units to both humans and machines.

The Unified Code for Units of Measure is a code system intended to include all units of measures being contemporarily used in international science, engineering, and business. The purpose is to facilitate unambiguous electronic communication of quantities together with their units. The focus is on electronic communication, as opposed to communication between humans. A typical application of The Unified Code for Units of Measure are electronic data interchange (EDI) protocols, but there is nothing that prevents it from being used in other types of machine communication. How does it relate?

The Unified Code for Units of Measure is inspired by and heavily based on ISO 2955-1983, ANSI X3.50-1986, and HL7's extensions called ISO+.

UCUM Homepage

--Linforest (talk) 15:05, 14 May 2013 (UTC)[reply]