Research talk:Are the bots really fighting/Work log/2017-03-07

From Meta, a Wikimedia project coordination wiki

Wednesday, March 7, 2017[edit]

Exploring the dataset for the paper on figshare to see if redirect pages are in that dataset. The dataset is a compressed, processed version of the revision history for each page. Each title starts on a line by itself, then revisions are below, starting with "^^^_". Then there are four fields for each revision: timestamp, is_revert, sequential integer identifying a unique SHA1 revision, then username. It does not include a field for redirect. First ten lines of enwiki dataset:

Herbert_Art_Gallery_and_Museum   
^^^_2011-10-28T12:51:21Z 0 2 MystBot
^^^_2011-10-28T12:28:07Z 0 1 Rock_drum
Waiting_to_Exhale   
^^^_2011-10-28T08:32:39Z 0 7 DJDunsie
^^^_2011-10-28T08:32:21Z 0 6 DJDunsie
^^^_2011-10-28T08:31:56Z 0 5 DJDunsie
^^^_2011-10-28T08:30:30Z 0 4 DJDunsie
^^^_2011-10-28T08:29:46Z 0 3 DJDunsie
^^^_2011-10-28T06:51:15Z 0 2 Luckas-bot

I've been writing about a particular case in detail about the en::Japan–United_States_relations redirects, so I went looking for those articles.

wget https://ndownloader.figshare.com/files/7442404 -O f.zip
mkdir data
mv f.zip data
unzip -j "data/f.zip" "all/ld_en_wiki.zip" -d "data/"
zipgrep -in Japan data/ld_en_wiki.zip | grep relations | grep -E "United_States|US|U.S.|United_states|American"

Results:

en_wiki.txt:18610348:Japan-United_States_relations
en_wiki.txt:18630262:United_States–Japan_relations
en_wiki.txt:26769703:U.S.-Japan_relations
en_wiki.txt:33589633:Japanese-American_relations
en_wiki.txt:97577821:Japan-United_states_relations
en_wiki.txt:97626835:Japan-US_relations
en_wiki.txt:97626870:US-Japan_relations
en_wiki.txt:97626885:American-Japanese_relations
en_wiki.txt:97628349:United_States-Japan_relations
en_wiki.txt:97629501:Japan_–_United_States_relations

The redirects are in the dataset.

Staeiou (talk) 20:13, 7 March 2017 (UTC)[reply]