Talk:List of Wikipedias by sample of articles/Source code (original)

From Meta, a Wikimedia project coordination wiki

Jump to: navigation, search

[edit] Modified source

Suggestions:

  • include .encode('cp437','replace') whenever printing to console to avoid errors
  • optimize by caching English pages
  • remove interwiki text for article length calculation
  • weight text length
  • color code score

--MarsRover 11:05, 2 December 2007 (UTC)

[edit] Modifying source

I was looking at modifying this program for my own use (namely, directing it towards a different page; for example, Vital Articles / Extended, a specific wikiproject's topic list, or a specific topic outline's list. Who would be the right person to ask about doing such? Almafeta 05:50, 1 October 2009 (UTC)

Smeira is the original author but he has been missing for couple of years. I could probably help. I've been working on code to create a extended article list (see below). It may need some tweaking for your needs but it can read from the lists you mentioned. --MarsRover 07:36, 1 October 2009 (UTC)
I've been working on that (apparently my installation of Python had... issues), and finally have it working with the groups I'm interested in. Thank you. =)
Also, it's too bad Smeira's gone... it occurs to me that the original was probably the most significant piece of code ever written in Volapük. Almafeta 16:49, 26 October 2009 (UTC)

[edit] GetExtendedArticleList.py

# -*- coding: utf_8 -*-
import sys
 
sys.path.append('./pywikipedia')
 
import wikipedia
import pagegenerators
import re
 
entry_re = re.compile(r"([\*|#]+)(\s*)('*)\[\[([^\]]+)\]\](\s*)\(?(\[\[([^\]]+)\]\])?\)?")
link_re  = re.compile(r'(:?([a-z\-]+):)?([^\]\|:]+)(\|([^\]]+))?')
 
def parseEntry(line):
    m = entry_re.search(line)
    if m:
        return {'name':m.group(4),'sibling':m.group(7),'indent':len(m.group(1)),'span':m.span()}
 
def parseLink(link, wiki_name):
    m = link_re.search(link)
    if m:
        if m.group(2):
            linkWiki = m.group(2)
        else:
            linkWiki = wiki_name
        return {'wiki':linkWiki,'name':m.group(3),'alias':m.group(5)}
 
def findAll(text, parseFunction):
    return_list = []
    pos  = 0
    item = parseFunction(text)
    while item:
        pos = pos + item['span'][1]
        item['pos'] = pos
        del item['span']
        return_list.append(item)
        item = parseFunction(text[pos:])
    return return_list
 
def getArticle(wiki_name, wiki_family, article_name):
    print "reading %s" % (article_name)
    wiki         = wikipedia.Site(wiki_name, wiki_family)
    page         = wikipedia.Page(wiki, article_name)
    article_text = page.get(get_redirect=False)
    return {'text':article_text}
 
def getArticleList(wiki_name, wiki_family, article_name):
 
    article = getArticle(wiki_name, wiki_family, article_name)['text']
    arts = findAll(article, parseEntry)
    for art in arts:
        art['link'] = parseLink(art['name'], wiki_name)
    return arts
 
print "working..."
 
lists = {}
lists[':en:Wikipedia:Vital articles/Expanded/People']                     = getArticleList('en',  'wikipedia','Wikipedia:Vital articles/Expanded/People')
lists[':en:Wikipedia:Vital articles/Expanded/History']                    = getArticleList('en',  'wikipedia','Wikipedia:Vital articles/Expanded/History')
lists[':en:Wikipedia:Vital articles/Expanded/Geography']                  = getArticleList('en',  'wikipedia','Wikipedia:Vital articles/Expanded/Geography')
lists[':en:Wikipedia:Vital articles/Expanded/Arts']                       = getArticleList('en',  'wikipedia','Wikipedia:Vital articles/Expanded/Arts')
lists[':en:Wikipedia:Vital articles/Expanded/Philosophy and religion']    = getArticleList('en',  'wikipedia','Wikipedia:Vital articles/Expanded/Philosophy and religion')
lists[':en:Wikipedia:Vital articles/Expanded/Everyday life']              = getArticleList('en',  'wikipedia','Wikipedia:Vital articles/Expanded/Everyday life')
lists[':en:Wikipedia:Vital articles/Expanded/Society and social sciences']= getArticleList('en',  'wikipedia','Wikipedia:Vital articles/Expanded/Society and social sciences')
lists[':en:Wikipedia:Vital articles/Expanded/Health and medicine']        = getArticleList('en',  'wikipedia','Wikipedia:Vital articles/Expanded/Health and medicine')
lists[':en:Wikipedia:Vital articles/Expanded/Science']                    = getArticleList('en',  'wikipedia','Wikipedia:Vital articles/Expanded/Science')
lists[':en:Wikipedia:Vital articles/Expanded/Technology']                 = getArticleList('en',  'wikipedia','Wikipedia:Vital articles/Expanded/Technology')
lists[':en:Wikipedia:Vital articles/Expanded/Mathematics']                = getArticleList('en',  'wikipedia','Wikipedia:Vital articles/Expanded/Mathematics')
lists[':en:Wikipedia:Vital articles/Expanded/Measurement']                = getArticleList('en',  'wikipedia','Wikipedia:Vital articles/Expanded/Measurement')
 
lists[':m:List of articles every Wikipedia should have/Version 1.1'] = getArticleList('meta','meta',     'List of articles every Wikipedia should have/Version 1.1')
lists[':en:Films considered the greatest ever']                      = getArticleList('en',  'wikipedia','Films considered the greatest ever')
lists[':en:Outline of biology']                                      = getArticleList('en',  'wikipedia','Outline of biology')
 
print "merge lists..."
 
fullList = []
for l in lists:
    for i in lists[l]:
        ok = True
        for fli in fullList:
            if fli.lower() == i['link']['name'].lower():
                ok = False
                break
        if ok:
            fullList.append(i['link']['name'])
 
print len(fullList)
 
print "sorting..."
sortedFullList = sorted(fullList, lambda a,b: cmp(a.lower(),b.lower()))
 
for i in sortedFullList:
    print i