Research:Community visualization using Gephi

From Meta, a Wikimedia project coordination wiki
Jump to: navigation, search

This page introduces the necessary tools, and techniques to generate a visualized representation of Wikimedia projects communities. Gephi, the open-source network analysis and visualization software, is utilized to generate graphs that represent users and the interaction among them based on the frequency they send messages to each other on their talk pages.

The following three images show some of the possibilites of such visualization, where Turkish Wikipedia is used here as a case study.

A node cloud represents the community of the Turkish Wikipedia, generated using Gephi.
A node cloud represents the community of the Turkish Wikipedia, with clustered sub-communities in different colors, generated using Gephi.
A graph shows users with the highest 'betweenness centrality' from among the users of Turkish Wikipedia, generated using Gephi.


csv file[edit]

Data query[edit]

The first step to generate a visualization is to query the database of the target Wikimedia project. The following SQL query returns in each row the usernames of two users each time one of them posted a message on the other's talk page.


SELECT rev_user_text, page_title FROM revision
JOIN page ON page_id = rev_page
JOIN user_groups ON ug_user = rev_user
WHERE page_namespace = 3
AND page_is_redirect = 0 
AND ug_group != 'bot'
AND page_title NOT LIKE '%/%';

The above query excludes all edits made by bot accounts, and doesn't take into account edits made to pages in the sub-domain of the user talk page.

The query can be customized based on specific requirements such as considering a timeframe, or even using more sophisticated query to generate data using a list of specific usernames.

Data export[edit]

Data generated from the above query should be exported to csv file which is in the following format.

A, B
A, B
A, B
B, A
B, A
C, C
D, E
D, E
A, D
D, B

Please note that no quotation marks (") are needed to wrap the usernames, and Gephi will be able to read the full username even it contains spaces.

dl file[edit]

Another option is to use *.dl files, which have smaller size compared to csv files. The data in *.dl files is represented as groups of all the links between two nodes in one line followed by a number representing the weight of that link.

Data query[edit]

The following SQL query generates a table in which the first two columns contain the username of two users, and the third column is the the number of times the second user contacted the first one.

SELECT rev_user_text, page_title, COUNT(*) FROM revision
JOIN page ON page_id = rev_page
JOIN user_groups ON ug_user = rev_user
WHERE page_namespace = 3 
AND page_is_redirect = 0
AND ug_group != 'bot' 
AND page_title NOT LIKE '%/%'
GROUP BY rev_user_text, page_title 
ORDER BY COUNT(*) DESC;

The above query excludes all edits made by bot accounts, and doesn't take into account edits made to pages in the sub-domain of the user talk page.

The query can be customized based on specific requirements such as considering a timeframe, or even using more sophisticated query to generate data using a list of specific usernames.

In case of large data output, in which case Gephi will not be able to create a graph for, 'DESC LIMIT 10000' can be added to the above query to limit the number of nodes.

Data conversion[edit]

The output of the above SQL query will be consisted of 3 columns as shown in the table below.

User1 User2 connection
user name 1 user name 2 4
user name 3 user name 4 5
user name 1 user name 4 6
user name 2 user name 3 3
user name 3 user name 1 1

This output need to be converted into dl format, which can be read by Gephi.

Add the following header at the top of the file, and replace "n" with the number of rows in the data :

dl
format = edgelist1
n = 5
labels embedded:
data:

As a result, the *.dl file should look as follows :

dl
format = edgelist1
n = 5
labels embedded:
data:
user_name_1 user_name_2 4
user_name_3  user_name_4 5
user_name_1  user_name_4 6
user_name_2  user_name_3 3
user_name_3  user_name_1 1

Gephi[edit]

Gephi graph of a raw data set.
  • Open the data file, and choose the 'File Format' from the open dialog. Once the file is loaded the raw data will be visualized in gray square shaped node cloud.
  • Choose a suitable layout from the layout box located on the left. 'Yifan Hu' found to be the best layout to visualize community data. Once the layout was chosen, click 'Run', and the graph will start taking circular shape.
  • Add color to the graph nodes using 'Ranking' box. More information on how to add colors, labels to graphs can be found in the official Gephi tutorials.
  • In order to get the most central users within the community, you need to run 'Avg. Path Length' calculation from the 'Statistics' menu on the right. Once the calculation is done, new parameters will be added to 'Ranking' menu. By choosing 'Betweenness Centrality' and clicking on 'Result list' the users will be ordered by their centrality within the whole network.


Further reading[edit]