This page is part of the Proceedings of Wikimania 2005, Frankfurt, Germany.

Network Analysis for Wikipedia[edit]

Author(s):'' {{{...}}}
License: Francesco Bellomi and Roberto Bonato
Slides: Francesco Bellomi and Roberto Bonato
Video: {{{radio}}}
'Note:' {{{slides}}}

About the slides: {{{agenda}}}

About license

cc-by-sa

<include>[[Category:Wikimani templates{{#blocked:}}]]</include>

Abstract[edit]

Network analysis is concerned with properties related to connectivity and distances in graphs, with diverse applications like citation indexing and information retrieval on the Web. HITS (Hyperlink-Induced Topic Search [1]) is an network analysis algorithm that has been successfully used for ranking web pages related to a common topic according to their potential relevance. HITS is based on the notions of hub and authority: a good hub is a page that points to several good authorities; a good authority is a page that is pointed at by several good hubs. HITS exclusively relies on the hyperlink relations existing among the pages, to define the two mutually reinforcing measures of hub and authority. It can be proved that for each page these two weights converge to fixed points, the actual hub and authority values for the page. Authority is used to rank pages resulting from a given query (and thus potentially related to a given topic) in order of relevance.

The hyperlinked structure of Wikipedia and the ongoing, incremental editing process behind it make it an interesting and unexplored target domain for network analysis techniques. In particular, we explored the relevance of the notion of HITS's authority on this encyclopedic corpus.

We've developed a crawler that extensively scans through the structure of English language Wikipedia articles, and that keeps track for each entry of all other Wikipedia articles pointed at in its de¯nition. The result is a directed graph (roughly 500000 nodes, and more than 8 millions links), which consists for the most part of a big loosely connected component. Then we applied the HITS algorithm to the latter, thus getting a hub and authority weight associated to every entry.

First results seem to be meaningful in characterizing the notion of authority in this peculiar domain. Highest-rank authorities seem to be for the most part lexical elements that denote particular and concrete rather than universal and abstract entities. More precisely, at the very top of the authority scale there are concepts used to structure space and time like country names, city names and other geopolitical entities (such as United States and many European countries), historical periods and landmark events (World War II, 1960s). "Television", "scientifc classification" and "animal" are the first three most authoritative common nouns. We will also present the first results issued from the application of well-known PageRank algorithm (Google's popular ranking metrics detailed in [2]) to the Wikipedia entries collected by our crawler.

This is a work in progress. We plan to design a set of experiments on set of words related by specifc linguistic relationships like meronymy, hyponymy or other domain-specific commonalities.

Final version of the paper: "Network Analysis and Wikipedia"

"The aim of our simple experiment is twofold: to gain some understanding of the high level structure of Wikipedia, and to get some insights about its content, and in particular on its hidden cultural biases. Each user usually browse a (relatively) small set of entries during the normal usage of an encyclopedia; and such small sample is more representative of the enquiring user world view, rather that the whole encyclopedia. As a consequence, nobody is really able to have a mile-high fisheye view of Wikipedia’s content; of course it is possible to perform some basic statistic analysis (like counting the entries, or measuring the rate of growth) but these are purely syntactic measures. We maintain that network analysis offers a simple way to have some more "semantic" measures, since it formally analyzes an intrinsically semantic human-generated type of content: the use of terms to define other terms."

Francesco Bellomi (weblog) is interested in knowledge representation, natural language processing and social software. In the last years he has led the development of a number of industrial software applications related to knowledge management and statistical natural language processing. He is currenlty a Ph.D candidate at University of Verona, Italy.

References[edit]

[1] J. M. Kleinberg, "Authoritative sources in a hyperlinked environment" Journal of the ACM, vol. 46, no. 5, pp. 604-632, 1999.

[2] S. Brin and L. Page, "The anatomy of a large-scale hypertextualWeb search engine" Computer Networks and ISDN Systems, vol. 30, no. 1 7, pp. 107-117, 1998.

A reference to our previous work "Lexical authorities in an encyclopedic corpus: a case study with Wikipedia" can be found here.

Final version of the paper: "Network Analysis for Wikipedia".