WSoR datasets/user first msg

From Meta, a Wikimedia project coordination wiki

Explain what the dataset is and what it is useful for. Make sure to *at least* include what you used it for.

Location[edit]

db42:staeiou.user_first_msg

mysql> describe staeiou.user_first_msg_final; select * from staeiou.user_first_msg_final limit 1000,6;
+-----------------------------+-----------------+------+-----+---------+-------+
| Field                       | Type            | Null | Key | Default | Extra |
+-----------------------------+-----------------+------+-----+---------+-------+
| talk_page_id                | int(6) signed | yes  | PRI | 0       |       |
| talk_page_title             | varbinary(143)  | NO   |     |         |       |
| recipient_user_id           | int(5) unsigned | YES  |     | 0       |       |
| recipient_name              | varbinary(143)  | YES  |     | NULL    |       |
| rev_comment                 | varbinary(143)  | YES  |     | NULL    |       |
| sender_user_id              | int(5) unsigned | YES  |     | 0       |       |
| sender_user_name            | varbinary(143)  | YES  |     |         |       |
| rev_timestamp               | datetime        | YES  |     | NULL    |       |
| recipient_user_registration | datetime        | YES  |     | NULL    |       |
| datediff_msg_reg            | int(7)          | YES  |     | NULL    |       |
| timediff_msg_reg            | time            | YES  |     | NULL    |       |
| user_reg_ymd                | varchar(10)     | YES  |     | NULL    |       |
| first_msg_ymd               | varchar(10)     | YES  |     | NULL    |       |
| bot                         | int(1)          | YES  |     | NULL    |       |
| self                        | int(1)          | YES  |     | NULL    |       |
| msg_type                    | varbinary(20)   | YES  |     | NULL    |       |
| msg_type_sub                | varbinary(20)   | YES  |     | NULL    |       |
| tool                        | varbinary(50)   | YES  |     | NULL    |       |
| recipient_edit_count        | int(11)         | YES  |     | NULL    |       |
| first_msg_year              | int(4)          | YES  |     | NULL    |       |
| block_date                  | datetime        | NO  |     | NULL    |       |
| is_anon                     | int(1)          | YES  |     | NULL    |       |
| edits_after_msg             | int(11)         | YES  |     | NULL    |       |
| edits_before_msg            | int(11)         | YES  |     | NULL    |       |
+-----------------------------+-----------------+------+-----+---------+-------+
24 rows in set (0.00 sec)

+--------------+-----------------+-------------------+----------------+---------------+----------------+------------------+---------------------+-----------------------------+------------------+------------------+--------------+---------------+------+------+----------+--------------+------+----------------------+----------------+------------+---------+-----------------+------------------+
| talk_page_id | talk_page_title | recipient_user_id | recipient_name | rev_comment   | sender_user_id | sender_user_name | rev_timestamp       | recipient_user_registration | datediff_msg_reg | timediff_msg_reg | user_reg_ymd | first_msg_ymd | bot  | self | msg_type | msg_type_sub | tool | recipient_edit_count | first_msg_year | block_date | is_anon | edits_after_msg | edits_before_msg |
+--------------+-----------------+-------------------+----------------+---------------+----------------+------------------+---------------------+-----------------------------+------------------+------------------+--------------+---------------+------+------+----------+--------------+------+----------------------+----------------+------------+---------+-----------------+------------------+
|       152291 | Wishu Burnstar  |              5814 | Wishu Burnstar | Greetings!    |             62 | Mav              | 2002-11-27 07:45:18 | 2002-11-27 04:47:55         |                0 | 02:57:23         | 2002-11-27   | 2002-11-27    |    0 |    0 | other    | NULL         | NULL |                   19 |           2002 | NULL       |       0 |            NULL |             NULL |
|       152730 | Talshiarr       |              5363 | Talshiarr      | Welcome!      |           1078 | Toby Bartels     | 2002-11-28 18:59:30 | 2002-11-09 08:34:11         |               19 | 466:25:19        | 2002-11-09   | 2002-11-28    |    0 |    0 | welcome  | NULL         | NULL |                  256 |           2002 | NULL       |       0 |            NULL |             NULL |
|       152927 | MartinHarper    |              5862 | MartinHarper   | experimenting |           5862 | MartinHarper     | 2002-11-29 10:56:11 | 2002-11-29 10:38:06         |                0 | 00:18:05         | 2002-11-29   | 2002-11-29    |    0 |    1 | other    | NULL         | self |                26089 |           2002 | NULL       |       0 |            NULL |             NULL |
+--------------+-----------------+-------------------+----------------+---------------+----------------+------------------+---------------------+-----------------------------+------------------+------------------+--------------+---------------+------+------+----------+--------------+------+----------------------+----------------+------------+---------+-----------------+------------------+
3 rows in set (0.00 sec)

Each row represents a user who has at least one live revision to their user talk page. There is a row in this table for the first message sent to each registered user.

  • talk_page_id: enwiki.page.page_id of the receiving user's talk page
  • talk_page_title: enwiki.page.page_title of the receiving user's talk page (is close to, often exactly the user's user_name)
  • recipient_user_id: enwiki.revision.rev_user of the editor who made the first edit to the user's talk page
  • recipient_name: enwiki.revision.rev_user_text of the editor who made the first edit to the user's talk page
  • rev_comment: enwiki.revision.rev_comment of the first edit to the user's talk page
  • sender_user_id: enwiki.revision.rev_user of the user who made the talk page edit
  • sender_user_name: enwiki.revision.rev_user_text of the user who made the talk page edit
  • rev_timestamp: enwiki.revision.rev_timestamp of the first edit to the user's talk page
  • recipient_user_registration: enwiki.user.user_registration date of the user whose talk page was edited
  • datediff_msg_reg: days between registration and first message, using DATEDIFF(rev_timestamp,user_registration); ~5000 are negative due to bad registration data pre-2004
  • timediff_msg_reg: time between registration and first message, using TIMEDIFF(rev_timestamp,user_registration); due to the variable type, values are out of bound for > 838 hours
  • user_reg_ymd: text field containing the year, month, and day of the user's registration (used for indexing)
  • first_msg_ymd: text field containing the year, month, and day of the user's first message (used for indexing)
  • bot: boolean value as to whether the user who left the first message was a bot (from halfak.bots?)
  • self: boolean value as to whether the first user_id who left the first message was the user_id who received the message (edits own talk page)
  • msg_type: message type; can be deletion, notification, talkback, warning, welcome
  • msg_type_sub: sub message type; can be blocked, csd, listing, prop, tag, level1, level2, level3, level4
  • tool: bot or automated tool used
  • recipient_edit_count: enwiki.user.user_editcount (approximate edit count of the user who received the message as of July 2011)
  • first_msg_year: year of the first message sent (for indexing)
  • block_date: enwiki.logging.log_timestamp of the first time the user who received the message was blocked (NULL if never been blocked)
  • is_anon: whether the username is an IP address
  • edits_after_msg: number of edits to all namespaces before the message was received
  • edits_before_msg: number of edits to all namespaces after the message was received

Reproduction[edit]

Pseudocode (will be expanded):

  1. Get every distinct user_name from enwiki.user
  2. Join enwiki.user.user_name with enwiki.page.page_title to get all user talk pages
  3. Join enwiki.revision.rev_page with enwiki.page.page_id to get all the first revisions to a user talk page
  4. Parse rev_comment to determine tool and message type (see section below)
  5. Parse rev_user and user_id to determine metadata

Parsing rev_comment for tool/bot used[edit]

  

UPDATE user_first_msg as ufm, halfak.bot_20110711 as b SET ufm.bot = b.user_id WHERE ufm.sender_user_id = b.user_id;  
UPDATE user_first_msg SET bot = 1 WHERE bot > 0;

UPDATE rev_table SET tool = 'huggle' WHERE rev_comment LIKE '%WP:HG%' OR rev_comment LIKE '%WP:HUGGLE%' OR rev_comment RLIKE "(Reverted ([0-9]+ )?edits by \[\[Special:Contributions/[^\|]+\|[^]]+\]\] \(\[\[User talk:[^\|]+\|talk\]\]\) to last version by .+)|(Message re\. \[\[[^]]+\]\])|(Level [0-9]+ warning re\. \[\[[^]]+\]\])"
UPDATE rev_table SET tool = 'twinkle' WHERE rev_comment LIKE '%WP:TW%';
UPDATE rev_table SET tool = 'friendly' WHERE rev_comment LIKE '%WP:FRIENDLY%' OR rev_comment LIKE '%WP:Friendly%';
UPDATE rev_table SET tool = 'vandalproof' WHERE rev_comment LIKE '%WP:VPRF%' OR rev_comment LIKE '%WP:VandalProof%' OR rev_comment LIKE '%VandalProof|VandalProof%' OR rev_comment LIKE '%WP:VP2%' OR rev_comment LIKE '%WP:VandalProof%';
UPDATE rev_table SET tool = 'stiki' WHERE rev_comment LIKE '%|STiki]]%';
UPDATE rev_table SET tool = 'npwatcher' WHERE rev_comment LIKE '%|NPWatcher%';
UPDATE rev_table SET tool = 'vandalsniper' WHERE rev_comment LIKE '%|VandalSniper%';
UPDATE rev_table SET tool = 'wikimonitor' WHERE rev_comment LIKE '%m:WikiMonitor%';
UPDATE rev_table SET tool = 'mwt' WHERE rev_comment LIKE '%MWT|MWT]]%';
UPDATE rev_table SET tool = 'awb' WHERE rev_comment LIKE '%AWB|AWB]]%' OR rev_comment LIKE '%AutoWikiBrowser%';

UPDATE rev_table SET tool = 'cluebot' WHERE lower(cast(rev_user_text as CHAR)) LOWER LIKE 'cluebot%';
UPDATE rev_table SET tool = 'antivandalbot' WHERE rev_user_text = 'AntiVandalBot';
UPDATE rev_table SET tool = 'orphanbot' WHERE rev_user_text = 'OrphanBot'; 
UPDATE rev_table SET tool = 'pseudobot' WHERE rev_user_text = 'PseudoBot';
UPDATE rev_table SET tool = 'voabot' WHERE rev_user_text LIKE 'VoABot%';
UPDATE rev_table SET tool = 'martinbot' WHERE rev_user_text = 'MartinBot';
UPDATE rev_table SET tool = 'stbot' WHERE rev_user_text LIKE 'STBot%';
UPDATE rev_table SET tool = 'sqlbot' WHERE rev_user_text = 'SQLbot';
UPDATE rev_table SET tool = 'tawkerbot' WHERE rev_user_text LIKE 'Tawkerbot%';
UPDATE rev_table SET tool = 'sinebot' WHERE rev_user_text = 'SineBot';
UPDATE rev_table SET tool = 'csdwarnbot' WHERE rev_user_text = 'CSDWarnBot';
UPDATE rev_table SET tool = 'antispambot' WHERE rev_user_text = 'AntiSpamBot';
UPDATE rev_table SET tool = 'imagetaggingbot' WHERE rev_user_text = 'ImageTaggingBot';
UPDATE rev_table SET tool = 'dashbot' WHERE rev_user_text = 'DASHBot';
UPDATE rev_table SET tool = 'bjbot' WHERE rev_user_text = 'BJBot';

Parsing rev_comment for message type[edit]

UPDATE rev_table SET msg_type = 'warning', msg_type_sub='level1' WHERE rev_comment LIKE 'Message re.%' AND tool = 'huggle';
UPDATE rev_table SET msg_type = 'warning', msg_type_sub='level2' WHERE rev_comment LIKE 'Level 2 warning%' AND tool = 'huggle';
UPDATE rev_table SET msg_type = 'warning', msg_type_sub='level3' WHERE rev_comment LIKE 'Level 3 warning%' AND tool = 'huggle';
UPDATE rev_table SET msg_type = 'warning', msg_type_sub='level4' WHERE rev_comment LIKE 'Level 4 warning%' AND tool = 'huggle';
 
UPDATE rev_table SET msg_type = 'warning', msg_type_sub='level1' WHERE rev_comment LIKE 'General note:%' AND tool = 'twinkle';
UPDATE rev_table SET msg_type = 'warning', msg_type_sub='level2' WHERE rev_comment LIKE 'Caution:%' AND tool = 'twinkle';
UPDATE rev_table SET msg_type = 'warning', msg_type_sub='level3' WHERE rev_comment LIKE 'Warning:%' AND tool = 'twinkle';
UPDATE rev_table SET msg_type = 'warning', msg_type_sub='level4' WHERE rev_comment LIKE 'Final warning:%' AND tool = 'twinkle';
 
UPDATE rev_table SET msg_type = 'deletion', msg_type_sub = 'csd' WHERE rev_comment LIKE 'Notification: speedy deletion%' AND tool = 'twinkle';
UPDATE rev_table SET msg_type = 'deletion', msg_type_sub = 'tag' WHERE rev_comment LIKE 'Notification: tagging for deletion%' AND tool = 'twinkle';
UPDATE rev_table SET msg_type = 'deletion', msg_type_sub = 'prop' WHERE rev_comment LIKE 'Notification: proposed deletion%' AND tool = 'twinkle';
UPDATE rev_table SET msg_type = 'notification', msg_type_sub = 'listing' WHERE rev_comment LIKE 'Notification: listing at%' AND tool = 'twinkle';
UPDATE rev_table SET msg_type = 'talkback' WHERE rev_comment LIKE 'Talkback%' AND tool = 'twinkle';
UPDATE rev_table SET msg_type = 'notification', msg_type_sub = 'blocked' WHERE rev_comment LIKE 'You have been%' AND tool = 'twinkle';
UPDATE rev_table SET msg_type = 'notification', msg_type_sub = 'blocked' WHERE rev_comment LIKE 'Your IP address has been blocked%' AND tool = 'twinkle';
UPDATE rev_table SET msg_type = 'welcome' WHERE rev_comment LIKE 'Added welcome%' AND tool = 'twinkle';
 
UPDATE rev_table SET msg_type = 'welcome' WHERE rev_comment LIKE 'Welcoming%' AND tool = 'huggle';
UPDATE rev_table SET msg_type = 'deletion', msg_type_sub = 'csd' WHERE rev_comment LIKE 'Notification: Speedy deletion%' AND tool = 'huggle';
UPDATE rev_table SET msg_type = 'deletion', msg_type_sub = 'prop' WHERE rev_comment LIKE 'Notification: Proposed deletion%' AND tool = 'huggle';
UPDATE rev_table SET msg_type = 'notification', msg_type_sub = 'blocked' WHERE rev_comment LIKE 'Notification: Blocked%' AND tool = 'huggle';

Notes[edit]

Is the dataset incomplete? Are there known issues? What kind of things should someone who uses this dataset be aware of?