Talk:List of Wikipedias by sample of articles/Source code (original)
Contents |
Modified source [edit]
Suggestions:
include .encode('cp437','replace') whenever printing to console to avoid errors- optimize by caching English pages
- remove interwiki text for article length calculation
- weight text length
- color code score
--MarsRover 11:05, 2 December 2007 (UTC)
Modifying source [edit]
I was looking at modifying this program for my own use (namely, directing it towards a different page; for example, Vital Articles / Extended, a specific wikiproject's topic list, or a specific topic outline's list. Who would be the right person to ask about doing such? Almafeta 05:50, 1 October 2009 (UTC)
- Smeira is the original author but he has been missing for couple of years. I could probably help. I've been working on code to create a extended article list (see below). It may need some tweaking for your needs but it can read from the lists you mentioned. --MarsRover 07:36, 1 October 2009 (UTC)
- I've been working on that (apparently my installation of Python had... issues), and finally have it working with the groups I'm interested in. Thank you. =)
- Also, it's too bad Smeira's gone... it occurs to me that the original was probably the most significant piece of code ever written in Volapük. Almafeta 16:49, 26 October 2009 (UTC)
GetExtendedArticleList.py [edit]
# -*- coding: utf_8 -*- import sys sys.path.append('./pywikipedia') import wikipedia import pagegenerators import re entry_re = re.compile(r"([\*|#]+)(\s*)('*)\[\[([^\]]+)\]\](\s*)\(?(\[\[([^\]]+)\]\])?\)?") link_re = re.compile(r'(:?([a-z\-]+):)?([^\]\|:]+)(\|([^\]]+))?') def parseEntry(line): m = entry_re.search(line) if m: return {'name':m.group(4),'sibling':m.group(7),'indent':len(m.group(1)),'span':m.span()} def parseLink(link, wiki_name): m = link_re.search(link) if m: linkWiki = m.group(2) or wiki_name return {'wiki':linkWiki,'name':m.group(3),'alias':m.group(5)} def findAll(text, parseFunction): return_list = [] pos = 0 item = parseFunction(text) while item: pos += item['span'][1] item['pos'] = pos del item['span'] return_list.append(item) item = parseFunction(text[pos:]) return return_list def getArticle(wiki_name, wiki_family, article_name): print "reading %s" % (article_name) wiki = wikipedia.Site(wiki_name, wiki_family) page = wikipedia.Page(wiki, article_name) article_text = page.get(get_redirect=False) return {'text':article_text} def getArticleList(wiki_name, wiki_family, article_name): article = getArticle(wiki_name, wiki_family, article_name)['text'] arts = findAll(article, parseEntry) for art in arts: art['link'] = parseLink(art['name'], wiki_name) return arts print "working..." lists = {} lists[':en:Wikipedia:Vital articles/Expanded/People'] = getArticleList('en', 'wikipedia','Wikipedia:Vital articles/Expanded/People') lists[':en:Wikipedia:Vital articles/Expanded/History'] = getArticleList('en', 'wikipedia','Wikipedia:Vital articles/Expanded/History') lists[':en:Wikipedia:Vital articles/Expanded/Geography'] = getArticleList('en', 'wikipedia','Wikipedia:Vital articles/Expanded/Geography') lists[':en:Wikipedia:Vital articles/Expanded/Arts'] = getArticleList('en', 'wikipedia','Wikipedia:Vital articles/Expanded/Arts') lists[':en:Wikipedia:Vital articles/Expanded/Philosophy and religion'] = getArticleList('en', 'wikipedia','Wikipedia:Vital articles/Expanded/Philosophy and religion') lists[':en:Wikipedia:Vital articles/Expanded/Everyday life'] = getArticleList('en', 'wikipedia','Wikipedia:Vital articles/Expanded/Everyday life') lists[':en:Wikipedia:Vital articles/Expanded/Society and social sciences']= getArticleList('en', 'wikipedia','Wikipedia:Vital articles/Expanded/Society and social sciences') lists[':en:Wikipedia:Vital articles/Expanded/Health and medicine'] = getArticleList('en', 'wikipedia','Wikipedia:Vital articles/Expanded/Health and medicine') lists[':en:Wikipedia:Vital articles/Expanded/Science'] = getArticleList('en', 'wikipedia','Wikipedia:Vital articles/Expanded/Science') lists[':en:Wikipedia:Vital articles/Expanded/Technology'] = getArticleList('en', 'wikipedia','Wikipedia:Vital articles/Expanded/Technology') lists[':en:Wikipedia:Vital articles/Expanded/Mathematics'] = getArticleList('en', 'wikipedia','Wikipedia:Vital articles/Expanded/Mathematics') lists[':en:Wikipedia:Vital articles/Expanded/Measurement'] = getArticleList('en', 'wikipedia','Wikipedia:Vital articles/Expanded/Measurement') lists[':m:List of articles every Wikipedia should have/Version 1.1'] = getArticleList('meta','meta', 'List of articles every Wikipedia should have/Version 1.1') lists[':en:Films considered the greatest ever'] = getArticleList('en', 'wikipedia','Films considered the greatest ever') lists[':en:Outline of biology'] = getArticleList('en', 'wikipedia','Outline of biology') print "merge lists..." fullList = {} for x in lists.values(): for i in x: if i['link']['name'].lower() not in fullList: fullList[i['link']['name'].lower()] = i['link']['name'] print len(fullList) print "sorting..." sortedFullList = sorted(fullList.values(), key=str.lower) for i in sortedFullList: print i
Perl version? [edit]
Has anybody implemented this in Perl? I've been working on a similar routine, just for my own amusement, looking only at articles in my home WP (= Latin), and I can't get the sizes to come out right. Grab the page, take out the inter-wiki links, take out the comments, see how many characters you've got, and multiply by the language weight -- how hard can it be? I'm wondering if I've run up against some Perl-ish oddity about Unicode (which I thought I was handling correctly), or just made some fluff-ball error. A. Mahoney 18:09, 8 November 2011 (UTC)
- I think you might be the first to try Perl. Yeah, I think you're right. It is probably related to Unicode. Make sure the length() function returns the number of characters and not the number of bytes (http://stackoverflow.com/questions/1326539/how-do-i-find-the-length-of-a-unicode-string-in-perl). --MarsRover 22:53, 8 November 2011 (UTC)
-
- What I ended up having to do was to trim trailing white space; Unicodeity was OK. The numbers are still a little off but close enough for planning purposes. If anybody else wants to use Perl, the MediaWiki::Bot package is the way to go; it's quite straightforward. A. Mahoney 17:36, 13 January 2012 (UTC)