Jump to content

Research:MediaWiki events: a generalized public event datasource

From Meta, a Wikimedia project coordination wiki

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.


A conceptual diagram of an event processing system for MediaWiki is presented.
Conceptual diagram. A conceptual diagram of an event processing system for MediaWiki is presented.

Wiki-tool builders & researchers rely on various sources of information about what's happened and is currently happening in Wikipedia. These data sources tend to be structured in differently and contain incomplete or poorly structured information. Some datasources are queryable, but require complexity to "listen" to ongoing events while others are intended to only be used to "listen" to current events. In this project, we'll describe a common structure for public events in MediaWiki that mimics recentchanges, but also contains historical information. We'll also explore means for implementing this functionality on top of existing datasources and propose changes to infrastructure that would allow us to improve efficiency and completeness of data.

This user has autopatrolled rights on MediaWiki.org. (list)

link /list of all your own web pages that will help you to find the right place.

Expensive parse function count and easy to copy instructions or options /fit event

MediaWiki
Text[1]

Events

[edit]

Available datasources

[edit]
API
list=recentchanges -- Gathers a joined set of revision/logging and does some event metadata parsing
MySQL db
recentchanges -- Sequences both revision and logging events.
revision -- Revision and page creation events.
logging -- All non-revision and page creation events.
RCStream -- see https://wikitech.wikimedia.org/wiki/RCStream
IRC Stream -- see Research:Data#IRC_Feeds
EventLogging -- see mw:Extension:EventLogging

Relevant events

[edit]
  • RevisionSaved
fields
  • timestamp -- revision.rev_timestamp
  • user
    • id -- revision.rev_user
    • text -- revision.rev_user_text
  • comment -- revision.rev_comment
  • revision
    • rev_id -- revision.rev_id
    • parent_id -- revision.rev_parent_id
    • bytes -- revision.rev_len
    • sha1 -- revision.rev_sha1
    • page_id -- revision.rev_page
    • minor -- revision.rev_minor
    • text -- ...
  • RevisionsDeleted
fields
  • timestamp -- logging.log_timestamp
  • user
    • id -- logging.log_user
    • text -- logging.log_user_text
  • comment -- logging.log_comment
  • revision
    • rev_ids -- parse logging.log_params
  • PageCreated
fields
  • timestamp -- revision.rev_timestamp
  • user
    • id -- revision.rev_user
    • text -- revision.rev_user_text
  • comment -- revision.rev_comment
  • page
    • id -- page.page_id
    • namespace -- page.page_namespace
    • title -- page.page_title
  • PageMoved
fields
  • timestamp -- logging.log_timestamp
  • user
    • id -- logging.log_user
    • text -- logging.log_user_text
  • comment -- logging.log_comment
  • action -- logging.log_action ("move", "move_redir")
  • old
    • id -- logging.log_page (currently set to the wrong page_id, see bug 57084)
    • namespace -- logging.log_namespace
    • title -- logging.log_title
  • new
    • id -- logging.log_page (currently set to the wrong page_id, see bug 57084)
    • namespace -- parse logging.log_params
    • title -- parse logging.log_params
  • PageDeleted
fields
  • timestamp -- logging.log_timestamp
  • user
    • id -- logging.log_user
    • text -- logging.log_user_text
  • comment -- logging.log_comment
  • page
    • id -- logging.log_page (currently always set to zero. see bug 26122)
    • namespace -- logging.log_namespace
    • title -- logging.log_title
  • PageRestored
fields
  • timestamp -- logging.log_timestamp
  • user
    • id -- logging.log_user
    • text -- logging.log_user_text
  • comment -- logging.log_comment
  • old_page_id -- ???
  • page
    • id -- logging.log_page
    • namespace -- logging.log_namespace
    • title -- logging.log_title
  • PageProtectionModified
fields
  • timestamp -- logging.log_timestamp
  • user
    • id -- logging.log_user
    • text -- logging.log_user_text
  • comment -- logging.log_comment
  • page
    • id -- logging.log_page
    • namespace -- logging.log_namespace
    • title -- logging.log_title
  • action -- logging.log_action ("protect", "modify", "unprotect")
  • protection
    • action -- parse logging.log_params
    • group -- parse logging.log_params
    • expiration -- parse logging.log_params
  • UserRegistered
fields
  • timestamp -- logging.log_timestamp
  • user
    • id -- logging.log_user
    • text -- logging.log_user_text
  • comment -- logging.log_comment
  • action -- logging.log_action ("newusers", "create", "create2", "byemail", "autocreate")
  • newuser
    • id -- parse logging.log_params
    • text -- parse logging.log_title
  • UserRenamed
fields
  • timestamp -- logging.log_timestamp
  • user
    • id -- logging.log_user
    • text -- logging.log_user_text
  • comment -- logging.log_comment
  • old
    • id -- not available in log
    • text -- parse logging.log_params
  • new
    • id -- not available in log
    • text -- parse logging.log_params
  • UserRightsModified
fields
  • timestamp -- logging.log_timestamp
  • user
    • id -- logging.log_user
    • text -- logging.log_user_text
  • comment -- logging.log_comment
  • modified
    • id -- not available in log
    • text -- logging.log_title
  • old -- parse logging.log_params
  • new -- parse logging.log_params
  • UserBlocked
fields
  • timestamp -- logging.log_timestamp
  • user
    • id -- logging.log_user
    • text -- logging.log_user_text
  • comment -- logging.log_comment
  • block
    • flags -- parse logging.log_params
    • duration -- parse logging.log_params
    • expiration -- parse logging.log_params and infer from current timestamp (how does the API do it?)
  • UserUnblocked
fields
  • timestamp -- logging.log_timestamp
  • user
    • id -- logging.log_user
    • text -- logging.log_user_text
  • comment -- logging.log_comment
  • unblocked
    • id -- not available in log
    • name -- parse logging.log_title

Desired functionality

[edit]

Listening

[edit]
for event in mw_events.listen(start="20140729000000"):
    # do thing with event
    if isinstance(event, RevisionSaved):
        revision_saved = event
        # do thing with revision_saved
    elif isinstance(event, RevisionDeleted):
        revision_deleted = event
        # do thing with revision_deleted
    else:
        pass

Querying

[edit]
events = mw_events.query(start="20140729000000", end="20140731000000", types={RevisionSaved})
for revision_saved in events:
    # do thing with revision_saved

Dumps

[edit]
events = MWEventReader("event_dump.enwiki.1.json.7z")
for user_registered in mw_event_reader.filter(types={UserRegistered}):
    # do thing with user_registered

Relevant bugs

[edit]
  • T28122 No way to get the ID of a deleted page from deletion logs
  • T59084 Store the page_id of the moved page in log_page
  • T71005 Add a list=recentchanges result property for title without namespace

Standardization

[edit]
MediaWiki events
  • consolidates domain knowledge and wiki archaeology
  • hides complexity -- produces standardized data structures
  • reads from MySQL database and api.php. Extendable to new formats.
  • produces JSON
  • provides a special Unavailable datatype to flag critical data that is not currently available

Support needed

[edit]
  • DBA's at the Wikimedia Foundation to explore means of publishing EventLogging infrastructure
  • Developers in non-python languages to talk over cross-language API similarities


Ready to create a project page?


See also

[edit]

References

[edit]
  1. Bold text