Wikimania05/Workshop-AE1
- One of the few workshops with full papers!
Workshop - Python Wikipediabot Framework
[edit]- Leader(s): Andre Engels
- License: GFDL
- Language: English
- Slides: None
About the leader(s): Andre Engels comes from the Netherlands. He holds an MA in mathematics and a PhD in computer science, but is nevertheless jobless. He was one of the early birds in Wikipedia, starting working there in March 2001 and never getting rid of it. At one time he was the second-most-active Wikipedian (after Daniel Mayer), and when all languages are added up he might still come close. He is also a programmer for the Python Wikipediabot framework and operator of the bot Robbot, which easily holds the record for being the user with main-namespace edits in most Wikipedia languages.
Abstract: The Python Wikipediabot Framework is a collection of tools that can be used to do editing on Wikimedia pages automatically or semi-automatically. It is written in the language Python. The workshop starts with a short overview of bots in general, with the advantages and disadvantages of using bots, then zooming in on the Python Wikipediabot Framework, its history and some hints of its inner workings. The main emphasis however will be on actually using the bot. It will be shown what is needed to get the bot to work.
Bots
[edit]Bots have been active in the wiki comunity for some time now, doing various edits to the wiki pages. In this article we will show some things about the Python Wikipediabot Framework [1], which is the
The word 'bot' is derived from 'robot'. And just like a robot is a machine that looks like a human, a bot in the meaning that I want to use it here, is a program that looks like a human visitor to a site. More technically, I would define a bot as anything that contacts a website (in this case a wiki) through the internet using http, but is not a browser.
Such bots can be divided into two groups. The first group consists of bots that only receive material from the website. They usually read HTML and do something with it, analyzing, saving or both. The most famous of these are undoubtedly the bots from the search engines, like GoogleBot [1] for Google. They load large numbers of sites and pages, and put (information from) them in a database, which is then used for searching purposes.
The rest of this article will deal with bots of the second kind: Bots that not only read from the site, but also make changes to Wiki pages. The best known of these, and the one that will mainly be discussed in this article, is the Python Wikipediabot Framework. As the name says, it is not a single bot, but a framework, that is, a set of bots partly sharing the same code, and it has been written in Python [3,4,5]. It was originally written for Wikipedia [6], but nowadays can be used in any MediaWiki [7] based wiki, with only a small specialized file needed.
Functions
[edit]Bots have several advantages over using only normal human editing, but there are some disadvantages too. We will first mention the advantages:
- Speed. The most important advantage of bots is that they can work much faster than humans. Checking all interwiki links on the English Wikipedia by hand could easily cost a year of full-time hours. With the bots, the author did so in two weeks. And in those two weeks, most of the time it was just his computer doing something, the human work was just a few hours.
- Accuracy. A bot does not make typos, and will not make copying errors. Thus, they are better at copying and straighforwardly manipulating data.
- Boring and repetitive tasks. Much work on a Wiki can be considered repetitive and monotonous, for example resolving disambiguations. Other are intellectually much more stimulating, like writing new articles. And as luck has it, it is exactly the monotonous tasks that bots are strong at. So why not give those to the bots and let the 'real' users go on to the more interesting tasks?
- Uniformity. When bots are used to do the same thing multiple times, they will do it always in the same way, and thus give more uniformity.
In short, bots enable us to do various tasks, that otherwise would be done only partly, because they are tedious and cost too much time, faster and with at least as much accuracy.
Still, the usage of bots also has a number of disadvantages. Personally I do think the advantages well outweigh the disadvantages, but it is nevertheless important to know the latter as well, so we can take steps trying to avoid them or diminish their effects.
- Programming errors. The largest problem are probably programming errors. A bot with a programming error might do more bad than good to the page it is editing, upto deletion of large chunks of text. Of course, human editors are fallible too, but bots are capable of making many more edits in the same time, and thus repeat the same error over and over again. Also, bots are generally assumed to make small, inconsequential changes, so their mistakes may not be noticed as easily. On the other hand, bots that make errors usually do so in a predictable way, so once it has been noticed, the errors can be corrected relatively fast.
- Clueless operators. Starting to use a bot is in some ways like starting to be an editor: One is full of good intentions, but also quite likely to make a mistake. Again, the main problem here is that a bot gives one the possibility to do anything faster and more efficiently - including making errors.
- Vandalbots and spambots. Bots can be used by anyone. This not only includes well-intentioned people; vandals might also be using bots. In fact, there have been a few large-scale vandalism attacks where scale, speed and uniformity make point towards the use of bots by vandals or spammers. As far as known, the framework has not been abused for this, but with the code out in the open it is not unlikely that that will happen one day.
- Clutter. Bots are at their best when they have to edit or create many pages in a similar way. However, this also means that it makes many edits, and thus could take up a large portion of the recent changes. This has been lessened by using minor changes, and especially by using bot accounts (which do not show up on recent changes except on specific request), but it still exists, because similar cluttering of watchlists has not been resolved.
Bots can have various functions. In fact, one can do anything with a bot that one can do with a normal, browser-aided, human edit. Still, there are some things that they are good at, and some things they are not good at.
Things bots are good at:
- Making the same or similar changes to a large number of pages
- Importing data from structured data sets to create a uniform set of pages
Things that bots are not as good at:
- Changes that have to be done only once to a relatively small set of pages. Writing a bot for these cases is usually more work than doing the same change by hand
Things that a bot cannot do without aid:
- Importing non-structured data for which direct copying would be a copyright violation
- Anything that requires grammatical rather than purely syntactical analysis
More specifically, the Python Wikipediabot Framework has been used for the following purposes, without striving for completeness:
- Adding and updating interwiki links
- Solving disambiguation problems
- Changing HTML syntax to Wiki-syntax
- Adding and reworking categories
- Finding and uploading images
- Resolving double and broken redirects
- Replacing pieces of text
- Spell-checking
The Python Wikipediabot Framework
[edit]The word 'framework' in the name 'Python Wikipediabot Framework' has not been chosen without reason. Rather than a single bot, or a simple collection, there is a network of files and programs. Some of them bots, others called by bots for various functions. It is programmed in Python.
The heart of the framework is a formed by a library file called wikipedia.py
(the "py
" says that it is a Python program, the name wikipedia dates back from the time that the bots were only used on Wikipedia). It loads the content of wikipages (using the edit page to do so), keeps them in memory so that they do not need to be loaded again if their content is checked twice, and saves pages. It also has functionality to load many pages at once, using the MediaWiki Export option [8], can analyze wiki texts to find interwiki links, categories and more, and reads special pages such as Allpages. With the help of the classes, methods and functions from wikipedia.py
, building a new bot can be reduced to an issue of text editing.
For more specialized functions, there are two more libraries: catlib.py
deals with categories, and lib_images.py
can upload images. Another important module is family.py
, which ensures that project or language dependent settings are used correctly.
History
[edit]The first work on the framework was done by Rob Hooft, who wrote the first versions of the interwiki bot in the Summer of 2003. Later that year, around August, the current author joined as the second user/coder, and in September solve_disambiguation.py
was included as a second bot.
In October 2003, the project became an open-source project on sourceforge, after some deliberations of advantages (making the bots more useful for other interested parties, both users and coders) and disadvantages (the framework might be used for vandalbots). Some new coders soon were added, for example Tom K (TomK32@de) wrote table2wiki.py
, a bot to translate HTML-tables into wiki-tables, in December.
Since then, the project has been growing and blooming, with new code appearing all the time, which can roughly be divided in the following categories:
- new bots
- added features to existing bots
- bugfixes
- adaptations to changes in the MediaWiki code and the Wikimedia sites
- extended object orientation
Some of the main highlights are:
- Usage of the export feature to load several pages at once (December 2003)
- Addition of code to use the bot on Wikitravel and more generally on any MediaWiki site (April 2004)
- Addition of the possibility to edit on several languages at once (July 2005)
Although Rob Hooft originally started the project, work and family now have caused him to not be very active on Wikipedia and the bot, and the main coders of the project at the moment are Daniel Herding (Head from de:) and Andre Engels. Several more coders are helping out in smaller amounts, of which deserve mention: Leonardo Gregianin (LeonardoG@pt), Yuri Astrakhan (Yurik@ru), Ashar Voultoiz (Hashar@fr), Gerrit Holl (gerritholl@nl, and within the pywikipediabot team the Python expert), Thomas R. Koll (TomK32@de) and Rob Hooft himself.
There is also a much larger group of users who do not code or only code privately. On the various Wikipedia languages this now amounts to a few dozen people, several of which are helpful for improving the framework by issuing bug reports, feature requests and similar.
Using the Framework
[edit]Preparations
[edit]Now we finally get to the main subject of this article, namely the user guide for the bot.
To use the bot, you will need two sets of programs:
- Python
- The framework itself
Python can be downloaded at http://www.python.org/. The bot is written mostly in Python 2.3, but it should work in all versions from 2.1 onwards. Download Python and install it. The bot can be found at http://sourceforge.net/projects/pywikipediabot/. This might be outdated, however. A more recent version (usually about 24 hours old) can be seen at http://cvs.sourceforge.net/viewcvs.py/pywikipediabot/pywikipedia/, but this has the disadvantage that each file has to be downloaded separately. Once you have unpacked the bot files in some directory, you're ready.
The next thing to do is to create a bot account on your local wiki. Then, in the same directory as the bot files, create a file user-config.py
with the following contents:
mylang = 'xx' usernames['family']['xx'] = 'MyBot'
Here xx: is the language code for your language, 'family' is your family - usually this will be 'wikipedia', but it can also be 'wiktionary', 'wikitravel', 'wikiquote' etcetera, and MyBot is the user name of your bot. If you are on another family, you will also need to add a line:
family = 'family'
You can also change some other settings in your user-config.py
. The one you are most likely to want to change is put_throttle
. This is the minimum amount of time between two edits, so your bot will not clutter the recent changes. The default value is 60 seconds, but it can be less, in particular if you have a registered bot. This can be done by adding a line like
putthrottle = 20
to your user-config.py (the number is the minimum number of seconds of waiting time).
We are almost there, all that needs to be done now, is logging in your bot. It might be possible to edit without being logged in, but even if it is possible, it is considered bad style to edit anonymously with a bot. To log in, open a command window. In Windows XP this is done by choosing 'Run' in the start menu and then selecting "cmd.exe" as the program to open. Using cd (change directory) go to the directory where you have downloaded the bot. Then type
login.py
You will be asked to provide a password. Type in the bot's password as you defined it previously. If you did it correctly, the bot will answer 'Should be logged in now.
' If so, your bot is ready to go!
Luckily, most of the steps above will not have to be done often. Python, once it has been installed, need not be installed again, your user-config.py
has to be edited only rarely, and you will also normally remain logged in (although there are exceptions). However, you might need to re-download the bot to get the latest updates, bugfixes and adaptions to changes in the Mediawiki software or Wikimedia sites.
The Interwiki Bot
[edit]Getting started
[edit]The oldest, best known and most intensively developped bot is the interwiki bot. Therefore it is this bot that we will be discussing here first. Its main function is to complete the interwiki links on pages, but as we shall see, it can also be used on pages that do not have any interwiki links yet.
To start the bot, open a command window and go to the bot directory again (see the previous chapter). Now type
interwiki.py
You will be asked Which page to check:
Fill in the title of a page on your home wiki, and press enter.
Now the bot will start working. It will load the page you chose, read it and find its interwiki links. Those linked pages will also be loaded and checked for interwiki links, etcetera. You will see that the bot is getting one page from one wiki after the other, maybe also more than one page from the same wiki at once, until it has found all directly and indirectly linked pages, or until there is some exception we will discuss later. When a page is a redirect, the bot will notice it, and check the page that is being redirected to. When there are no more pages to check, one of the following is the case:
- The bot has found exactly the pages that already are on the page you were checking. The program will end, but before it does so, it shows you which links are missing, if any, from the other pages it found.
- The bot has found some pages in languages that were not yet linked to from the starting page, or it has found that some page has been moved. The bot will then automatically update the page to include the new links.
- The bot has found that some of the interwiki links on the page are to pages that do not exist. It will ask whether it should remove those links. If you say yes, it will do so, if you say no, the program will end without changing the page.
- There is an 'interwiki conflict', that is, the bot found more than one link to at least one other language. It will now ask you which one to choose. First, it gives the languages with more than one link. For each page, you see the title of the page and which pages are linking to it. For each language with more than one page found, you can choose between:
- A number: Choose that option to link to. Often there are only two choices, but there have been, and perhaps still are, convoluted cases where upto about ten different alternatives were provided.
- N (none): Declare that none of the options is correct
- G (give up): If you don't know what to do any more, you can choose this option. This will end the questioning and leave the page unchanged.
- After the languages with more than one link have been dealt with, the languages with one link follow. Here the options are:
- A (accept): Include this link
- R (reject): Do not include this link
- G (give up): Same meaning as before
- L (all): Accept this link and all links that follow it.
If you do get an interwiki conflict, it might be a good idea to check where the error is (which page or pages has or have an incorrect link), and correct the page on that language or those languages, so that those who follow do not have to deal with the problem again.
The exception I mentioned before is that the bot does check whether a page is a disambiguation page or not. This is checked by the presence or absence of (language-dependent) disambiguation templates. If there is a link from a disambiguation page to a non-disambiguation page or vice versa, the bot will ask whether it should include that page. If a disambiguation page is found from a non-disambiguation page, the option to 'add a hint' is also given. If this is chosen, you can type in the name of another page - one of the disambiguation hints.
Before going on to the more complicated options, I want to note that you can specify the page you want to be working on with the command, like this:
interwiki.py pagename
Now you will not be asked for a page to work on, but you will work on the page [[pagename]]
Working on several pages at once
[edit]The bot has the option to load several pages from the same language at once, upto 60 pages. To put that to use, one would want to work on more than one page at once. There are several possibilities for this.
The most usual one is to go through the pages from Special:Allpages. This is done with the option -start. If you give the command:
interwiki.py -start:Pagename
then the bot will use Special:Allpages to get pages starting at the page "Pagename", load pages from your home wiki by groups until it has at least 100 of them, then work on all of these at once. When a page from the set has finished, it will be dealt with as above. If at some time the group of pages working on becomes less than 100, a new set of pages is loaded from the home language. If you want to go through all pages, choose the pagename ! - that's the first character we know of in the MySQL alphabetical order.
Running the interwiki bot on a complete language will take a lot of time - not minutes or hours, but many days. This means that it is likely that you want to do this in parts. For this purpose, every time when the bot is stopped, through a crash of the software or through control-C, the bot creates a list of the pages it is working on. When such a list has been made, you can continue working the next day with:
interwiki.py -continue
This will run the bot first on the pages it was working on last time, and then continue from the last page of the list. A similar one is:
interwiki.py -restore
This will run the bot on the pages it was working on, and then end.
You might want to run the bot still faster, perhaps in the background as you're doing something else (like working on the wiki yourself). For this there is the option -autonomous. If you run the bot with the option -autonomous, it will not ask you when it has a problem (removing pages, interwiki links, disambiguation-non-disambiguation links), but skip those pages. A file autonomous-problems.dat
will be created where all such problematic cases are given with the nature of their problems, so you can do those separately later if you like.
Another way of doing more pages at once is the option -file. By typing:
interwiki.py -file:pages.txt
You will get the interwiki from the file pages.txt
. The file should look like:
[[Pagename]] [[Another page]] [[This page too]] [[Page]]
Etcetera. Such a file can also be made from an HTML page using the bot extract_wikilinks.py
. This way you can for example work with the pages on Special:Newpages. Save that page as pages.htm, and then run
extract_wikilinks.py pages.htm > pages.txt
Of course that's mostly useful when you use the hints that are explained in the next section.
Using hints
[edit]We have seen how the bot can be used to find extra interwiki links from the existing ones. But what can you do if there is no interwiki link yet? Then you can use so-called hints. Hints are given in the following way:
interwiki.py House -hint:fr:Maison -hint:de:Haus
This means that the bot will work on the page [[House]], and will check the pages [[fr:Maison]] and [[de:Haus]] as if they were interwiki links on that page. But there are extra possibilities. Apart from the 'normal' languages, there are also extra hints, which give a number of languages rather than just one:
- A number of languages separated by commas
- 10 or 20 or 30 or 50: That number of the largest languages
- all: All languages with at least ca. 100 pages, which at the moment is about 100 languages on Wikipedia
- cyril: Languages written in Cyrillic
Note that those hints except the first type only work on Wikipedia and Wiktionary.
Furthermore, if the link is the same as the title of the page itself, you can remove the hint as well. Thus, to get through almost the whole Wikipedia looking for pages on Albert Einstein, you can use:
interwiki.py Albert Einstein -hint:all
If you are working on several page, such a hint is of course not very useful - it is rarely the case that all such pages can have the same hint. But there are possibilities for these cases too:
- -askhints
- Ask for hints on all pages
- -untranslated
- Ask for hints only on pages that do not have interwiki yet
- -untranslatedonly
- Like -untranslated, but pages with interwiki are skipped completely
If you use one of those, for each page that you get, you get asked to give one or more hints. Those are just like the hints after -hint, except that you need to include the final : when giving an 'empty hint'. Thus, you can give hints such as:
- en,fr:Accepte
- 50:London
- de,id,csb:
Instead of giving a hint, you can also select "?", which shows you the beginning of the text of the page. Selecting "?" repeatedly, gives a longer piece of the text.
Other options
[edit]If in your user-config.py
you have specified the login name in more than one language, the bot will edit in all those languages, not just your home language.
Apart from the options already mentioned above, there are several more command line options:
- -always
- Do the edit with every little change, not just when a link is added/removed/changed, but also if there's a small change in the layout
- -array:#
- (# is a number) - work on # pages at once; the default is 100
- -confirm
- Always ask for confirmation before doing a change, not just if there is a problem
- -days
- Run on the 365 dates of the year.
- -force
- Do not ask when making 'controversial' changes (such as removing a link), but just do it
- -name
- Like -same (below), but the last word is UPPERCASE on eo:
- -neverlink:xx
- (xx: is a language code) - skip all links to language xx:
- -noauto
- Automatic translation is done for years and dates. This option switches off that automatic translation.
- -nobacklink
- Do not give warnings for missing links on other wikis
- -noredirect
- Do not follow redirects, but treat redirects similar to non-existing pages
- -noshownew
- Do not write on the screen when a new link has been found
- -number:#
- (# is a number) - check exactly # pages, then stop.
- -same
- Outdated equivalent of -hint:all:
- -select
- Always ask of each page whether it should be included, not just if there is some interwiki conflict
- -shownew
- When asked for hints, show the beginning of the page as if "?" has already been selected once
- -skipfile:filename.txt
- On a -start run, skip all links in the file
filename.txt
- -wiktionary
- Meant for use on Wiktionary: Only link to the exactly same word. It also handles capitalized/uncapitalized differences more precisely. It has an implicit "-hint:all", but uncapitalized words are only hinted to uncapitalized wiktionaries
- -years
- Run on all years in numerical order. Default is to run from 1 to 2050. You can start at another year with -years:# where # is a number. Negative numbers are also possible.
An example of a useful combination of options would be:
interwiki.py -start:! -untranslatedonly -array:20 -select
The -untranslatedonly option here is the reason for the -array:20 and -select: If only untranslated pages are chosen, if you have to give hints for 100 pages first, you could easily be giving hints the full time you are working, so you work on less pages at once. And if you give hints, you easily give a wrong one, so you choose -select so you can check first whether the found pages are really correct before creating the interwiki links.
Disambiguation
[edit]Another bot is solve_disambiguation.py. It is used to resolve links to disambiguation pages.
Again we get a command window and go to the bot directory. Now we type:
solve_disambiguation.py
The bot will come back to ask which page we want to check. Give the name of a disambiguation page on your wiki. The bot will load the page, and find the links on it. It writes these down as being 'possible disambiguations'. Next, the bot will check what pages are linking to the disambiguation page. It might ignoring one or more of these pages because they are pages that often have 'correct' links to disambiguation pages, then load the rest of the linking pages (in groups of 20).
For each page you get to see the title of the page and the text around the occurence of the link to the disambiguation page. It then gives you the following options:
- A number: This corresponds to the numbers in the list of possible disambiguations. Disambiguate using that disambiguation (that is, let the text link to that specific page instead of the disambiguation page)
- R followed by a number: Change not only the link, but also the text of the link to the title of the disambiguation page. This I have found useful for example on the English disambiguation page 'Columbia', which is sometimes linked when 'Colombia' is meant.
- A: Add new: Give a new possible disambiguation
- E: Edit the page by hand: A simple editor will pop up, on which you can edit the page just like a normal wiki page
- U: Unlink: Make the text non-linking.
- S: Skip link: Do not change the link
- N: Next page: Do not change this page at all
- Q: Quit: Stop working
- M: More context: Give more of the text from the page around the link
- L: List: Show the list of possible disambiguations again.
You can also work on a series of pages, using the -file option. However, this time the file should look like:
- Pagename
- Another page
- This page too
- Page
If you want to get them from a web page, use the bot extract_names.py
rather than extract_wikilinks.py
. It would be logical to use a list or category with disambiguation pages to build such a file from. Note that if you are using the -file-option, the choice "Q" will move to the next disambiguation page in the list rather than stop the program completely.
solve_disambiguation.py can also be used if you want to have pages that link to redirects link to the correct page immediately. In this case, add the option "-redir", and select the redirecting page as the page you are working on. Other options for this bot are:
- -always
- x
- Do not ask the user, but always choose the action "x". Dangerous option to use in most cases!
- -file:xxx.txt
- Work on all pages in the file xxx.txt. Pages are given by simple names. You can get them from (for example) [[Category:Disambiguation]] using extract_names.py.
- -main
- Only work on pages in the main namespace, not in other namespaces
- -pos:xxxx
- Add 'xxxx' as a possible disambiguation
- -just
- Use only the disambiguations given through a -pos (or an add new option during the run), not the links on the disambiguation page. If you want to use a non-existing page instead of the disambiguation page, the -just option is obligatory.
- -primary
- For primary topic disambiguation, that is, the case where there is a [[Subject (disambiguation)]] page. Give 'Subject' as the page you are working on. The bot will remember which pages you have been working on, so you will not get asked again if you run the bot with this option again. "-primary -just -pos:XXX" can be abbreviated to "-primary:XXX"
Other bots
[edit]There are many more bots with various functions, and you might also want to build your own. These subjects are not discussed in this talk, but can be brought up during the questions or by asking me, either during the conference or later by email (andreengels@gmail.com), ICQ (#6260644), Skype (a_engels) or IRC (if I happen to be on, engels)
References
[edit]- Engels, Andre and others: Using the python wikipediabot. http://meta.wikimedia.org/wiki/Using_the_python_wikipediabot
- Google: Googlebot: Google's Web Crawler. http://www.google.com/bot.html
- Rossum, Guido van, and Drake, Fred L., Jr.: The Python Language Reference Manual. Bristol: Network Theory Ltd. (2003)
- Rossum, Guido van, and Drake, Fred L., Jr.: An Introduction to Python. Bristol: Network Theory Ltd. (2003)
- Martelli, Alex: Python in a Nutshell. Sebastopol, CA: O'Reilly (2003)
- http://www.wikipedia.org
- Wikipedia: MediaWiki. http://en.wikipedia.org/wiki/MediaWiki
- Patrick, and Vibber, Brion: Export and import. http://meta.wikimedia.org/wiki/Help:Export_and_import