From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

Most of the languages get their coordinate data updates from database dumps. These, however, ran infrequent and unreliable in the past. I'm testing a new way of extracting the data directly from the database. The raw data needs a bit of postprocessing.

Postprocessing is language dependent and I need your help foreign spekers!

Below is a summary of the goals of postprocessing. In short: it makes the map less cluttered by abbreviating the label text or excluding bad labels alltogether.


Town, township, city and village articles often contain counties, states, countries in their title (Paris). These are redundant as the position on the map should be information enough. A set of regular expressions is used to shorten these labels.


Some list articles contain a huge set of coordinates. If the name parameter of the coordinate template is not used, all these coordinates show up under the name of the list article, generating lots of identical labels - not very informative. A set of regular expressions is used to blacklist these list articles.



Add blacklists for more languages

$blacklist['en'] = "List of |History of |Geography of |Latitude and longitude of cities|Former toponyms of places in |Second Happy Time|Finnish exonyms for places in |Abbeys and
priories in England|Wide Area Augmentation System|Australian places named by ";


Variables used in the regular expressions below.

$usstates = "Alabama|Alaska|Arizona|Arkansas|California|Colorado|Connecticut|Delaware|Florida|Georgia|Hawaii|Idaho|Illinois|Indiana|Iowa|Kansas|Kentucky|Louisiana|Maine|Maryland|
Massachusetts|Michigan|Minnesota|Mississippi|Missouri|Montana|Nebraska|Nevada|New Hampshire|New Jersey|New Mexico|New York|North Carolina|North Dakota|Ohio|Oklahoma|Oregon|Pennsy
lvania|Rhode Island|South Carolina|South Dakota|Tennessee|Texas|Utah|Vermont|Virginia|Washington|West Virginia|Wisconsin|Wyoming";
$auterritories = "New South Wales|Victoria|South Australia|Queensland|Western Australia|Northern Territory";
$cdnterrprov = "Ontario|Quebec|Nova Scotia|New Brunswick|Manitoba|British Columbia|Prince Edward Island|Saskatchewan|Alberta|Newfoundland and Labrador|Northwest Territories|Yukon
$country = "Democratic Republic of the Congo";


Regular expressions to do the filtering

 if    ( $title2 =~ /^(.*) Township, .* County, ($usstates)$/ )         { $title2 = "$1 Twp."; }
 elsif ( $title2 =~ /^(.*) Township, ($usstates)$/ )                    { $title2 = "$1 Twp."; }
 elsif ( $title2 =~ /^(.*), ($usstates|$auterritories|$cdnterrprov|$country)$/ ) { $title2 = "$1"; }
 elsif ( $title2 =~ /^(.*) \(($usstates)\)$/ )                          { $title2 = "$1"; }
 elsif ( $title2 =~ /^(.*) \(i.* County, ($usstates)\)$/ )              { $title2 = "$1"; }