And good morning everyone, hope everyone is awake and if not, we'll try to get you reawake during the presentation. Hopefully I'll try to explain all this stuff -quant stuff that we did in this paper.
So I'm gonna present the paper that GroupLens research team did last year: WP:Clubhouse?: Exploration of Wikipedia's gender gap.
We had so much fun during this paper. And, my lead author, Angelo Cartly is present in this dissertation, that's why he cannot be here and that's why I'm here. And the interesting factor here is that in the last four authors, all four of them are professors. So it's an interesting paper.
Before I go further, let me give you an introduction, and just, I call it GroupLens advertisement. This is roughly about the GroupLens.
I’m from University of Minnesota. It is a beautiful campus.
And if anyone is from Minnesota. Nobody is from Minnesota, so. [laughs]
It's a beautiful campus, is actually surrounded by the -in between the Mississippi river. It's a beautiful place.
And GroupLens is a research team inside Minnesota and we are very famous for our social network and recommender system research and also Q and A research that we are doing in community development, online community development, and also we are currently involved in smart interface development. So that's also around the key area that we are currently developing.
We've been exploring lot of social networks, social participation, online communities, that Wikipedia has - is one of the biggest research areas that we actually focus on.
And some of our products actually, if you get a chance to get a look at it, MovieLens is actually -Anyone has heard about Netflix? Oh great, fantastic. So MovieLens is a great movie recommender system and it started long time ago and it has gone through a couple of phases now so basically you put in ratings in the movie, movies that you have seen, you get recommendations and it is a very good product.
And Cyclopath is a product that we started probably six to seven years ago. It's is actually a geowiki, so if you look at Googlemaps; Googlemap is like this static thing that Google provides you -this is the map and this is where you should go.
But Cyclopath, the critical thing about it is that is a geowiki maintained by community, so as community you can go and edit, and correct the maps. And provide ... to people. Lot of bikers actually use it so they provide routes that you've never seen before. So that's the best part about it.
And Lenscape is another recommender system. Actually MovieLens currently has been switched to support from Lenscape now so it is no longer a recommender system that runs on the back.
And the BookLens is a system for readers and is another similar system to MovieLens, you can get book reviews and get recommendations so something that my wife is really enthusiastic about because she reads a lot. So...
And going through the paper, why did we really start doing this paper? Because we as a group we do a lot of statistical data analysis, online communities, what made us interested in gender gap in Wikipedia?
The main reason was that in NY Times there was an article -actually I think there were some documents from Sue Gartner on this article.
When we saw, actually Lauren saw first, this article from first hand, he jumped at us and says "there's an article and we should get on this thing". And we read this and we didn't find the article was doing the justice for the bigger part of this research so... and more as... the article was talking about the gender gap in Wikipedia in Sex and the city and bracelets, and friendships bands, and shoes, so they were actually looking at anecdotal examples whereas females are more interested in reading or writing mainly about Sex and the City and bracelets or friendship bands, and shoes, whereas males are more focused on Sopranos, and tin soldiers, and baseball cards.
And we thought this was not really a statistical approach to understand what's really going on. And we also felt that this was more steer difficult.
This is all, is on the top of what you see if you look at it without digging in. So we thought "let's do a reader's data analysis" and...
So before going further -I -- just before I came to Argentina, I just went and checked what the other social media look like at the moment, and also we have seen when we started this research that all the cores Twitter, Facebook have actually turned the tables. It started out with the gender gap being male-dominant places, but over the time, as you see, this is US, and right now we have a higher female participation and also, people look at Argentina population... and it's actually a similar story. Female have a higher participation and gap is not as wide as in the US, but there's still a higher female participation in Argentina.
So, one would think that Wikipedia should have a similar impact, because Wikipedia it's been there for so long, and it's one of the premier places for knowledge base. So it should have us in like effect.
So with this in mind we started our research. So this research project we broke it into three main research questions.
The first is "what is the extent of Wikipedia's gender gap, and how has it changed over time". So we have seen how other social networks have changed the tables, why not Wikipedia.
And next, we actually try to see how is Wikipedia affected by the gender gap. We don't know at the moment what the gender gap looks like, if there is any gender gap, how it has affected the Wikipedia.
And last, we wanted to look at the conflicts, and Wikipedia, I mean, I'm pretty sure everyone knows what a conflict is here, if not I will give a brief of all Wikipedia conflicts when I start talking about this research.
So we wanted to actually go and understand how the conflict looks like, and in conflicts, what is the gender gap in these conflicts, and how does that actually impact.
So going further means we need data. So how do we get free data to analyze these research questions? We used Wikipedia data dumps. So that's been the fantastic thing about Wikipedia and I hope this will continue to be like this and they keep providing vast amounts of data that we can play around, and it's the biggest playground for data analysis.
So... And we used the English Wikipedia and 2011 data dumps, along with the Wikipedia APIs. And also, something close to us, MovieLens data, said that we make it available for all the people, so we ended up using MovieLens data because that's something we control and we have the full dataset, and we own the datasets.
So considering these three data, we started the exploration. And how do you actually get the gender data from Wikipedia? So as you know, if you use the Wikipedia user boxes, in the user page you can put this template, to identify your gender. What we did was we went and we actually extracted these user templates to get the gender of the user. And also in 2009, Wikipedia introduces the preference settings, which was a very good thing that they did, because it reduced the complexity of the user boxes, and the technical technicality of the upload those user boxes.
So we also actually grabbed the data from user preference settings, so combining all these, we had some self-reported gender data of a 440,000 k, so that's a lot of data, which is great for us.
So going forward, let's start looking at it in detail, what is the extent of Wikipedia's gender gap and how it has changed over the time.
To actually analyze this question, we start breaking this research question into two hypotheses.
So one is, what has a substantial editor gender-gap, and if the gap exist, does it have a substantial editor gender gap in the editors; and the second hypothesis is, we wanted to see if the gap is shrinking, we saw the other the other social networks and we assumed that the gap is shrinking over the time.
So we took the users that joined during 2009, for this analysis. And we actually looked at editors who joined in 2009, only 16% of them are female. And actually that means only 1 in 6 are female. And this gets worse actually. So, from that 16% we only get 9% not-edit, so the contribution from the 16th person, we only get 9% of total editors.
So that means only 1 in 11 edits is from actually a female. So we wanted to really understand this in more detail and we wanted to study at what edit level this gap really expands.
So to do that, we actually went ahead and we plotted this graph that you see, on your horizontal axis you see the edit count, so it's like, you see, starting from your left, it's 1 edit, 2 to 3 edits, 4 to 7, 8, so we broke the edits into pockets. On your vertical axis you see the percent of female editors.
So as you see, when you increase the number of edits you see a downward trend. So you have very low number of percentage with a higher edits, which is, only 6% of female editors have about 500 edits. Which is not really a good thing if you seek female participation and you want to see a reduced gender gap.
So, keeping this mind, we wanted to see if there has been an increase over the time, we know that editors have actually dropped, looks like, based on the previous slide you saw, we wanted to see if there has been an increase of the participation. So we looked at the user boxes that I showed you earlier and we looked at, so user boxes actually started from 2006 to 2011, 'cause the technology was introduced in 2005, at the end of 2005.
So when you see this graph, you see a little bit of noise actually, but if you focus on that 10, 15% range, it's been constant, there's no increase. If you look at that band it's actually constantly staying there.
So it means it does not show us an increase, though if you look at 2009, 2010, and when you see it --this is from the user boxes in the template, but if you look at it, the 2009, starting from 2009 to 2011, this is from the preference settings, this is from preference settings that users have setup-- and this shows less noise but it also shows that it's been a constant, steady line, that means that the gap has always been constant, it's not been reduced, so the gap has not been shrinking.
So far we checked if the gap exists. We know that the gap exists. We just saw that that the gap has not been shrinking. So that's unfortunate, but that's what the data shows, that's what the data give us and normally data don't lie.
Going forward, we wanted to look at the second research question, which is, how is Wikipedia [is] affected by the gender gap. So we know that editor participation is dropping. The gap is constant.
So we now want to take a look at if this gap has had any effect on the Wikipedia as an entire thing. To do that, we actually analyzed three sections of this research question. We actually wanted to analyze if female and male actually focus on the same areas or focus on different areas. And then we wanted to look at the coverage and we hypothesized actually as to how is it that for female topics is inferior to the male.
And as a third hypothesis, we found that female are more likely to be involved in social activities, social participation. Because the research shows that female are more willing participate in helping others, doing community work --that has been shown by the social work research-- so we wanted to see, Wikipedia being a collaboration place, so we wanted to see if that actually happened in Wikipedia too.
So breaking this into three hypotheses, we started analyzing the focus differences, so what do male and female focus on. And what we saw is that females pay more attention to people and arts as males pay more attention to science and geography, which was very interesting.
And the way we actually did it is a very interesting way. So you all know categories in Wikipedia right? So every page in Wikipedia is categorized. So we went through the category structure and we actually leveled into this people-arts-philosophy and we looked at the editors of those pages, and that's how we actually analyzed how is this female participation and male participation.
And we actually used the 2008 Wikipedia data dump.
So and as you see, men are more interested in geography and science, and female are more interested in arts and people.
And. So we wanted to take a look at the coverage, because if there's a gap that exists, and we know that, the NY Times article mentions that there's a coverage deficit on female articles and male articles, because of that anecdotal examples brought out there of bracelet, and tin soldiers and Sex and the City, comparing those two, sex and the city articles are very sharp, and less... content is not precise, but when you look at the baseball cards, tin soldiers, very lengthy, very developed article, but we wanted to really see if this is a really a true thing or this is just a stereotypical anecdotic thing.
So in large scale we wanted to really measure the quality of the article and I'll explain to you in detail how we measured the quality of the article... and then actually we analyzed the gender of the article, so this might be a little strange for you, how you define the gender of an article. I’ll get to that in a moment.
So keeping those things in mind, we wanted to really see if the coverage of the Wikipedia articles, in terms of female articles, gets worse. We wanted to really see if actually Wikipedia is skewed to its male.
So coverage quality, as I said, article length is a simple and very good predictor, and previous research has actually shown that is simple but is the best way to analyze the quality of the article. And the topic gender we actually take it from the user activity.
So if you look at a page, and if you have, let's say, take a Wikipedia page, and you have 5 female editors, and a 1 male editor, we can say that that article is of high percent female participation and it's a female topic.
Because we already know that females are focused on different topics, and males are focused on different topics. So considering that, we could justify that this is a female topic.
Likewise, we have each article; we analyzed it with a 30+ non gender editors. And as I mentioned, considering the number of editors are female or male, we analyzed whether this article is a female or male based article.
So topic gender is the percent of editors that are female. So let's take an article, and you see that this is a 80% female dominance, participation wise, and you stack them up. So one has 80%, one has 79%, one has 76%. Once you stack them all in order, you have got top 20% of those articles are female articles, of female high percentage, of female high representation, and the bottom 20% are male articles.
Hopefully I'm very clear about this. And with this in mind we go and analyze actually the coverage and actually the article quality. So we see female articles is roughly above 28k, size wise, and the male articles is roughly above 33k, so there is a gap, a precise gap. And we also see gender neutral articles, that's the article percentage in the middle range, that 20% of articles in the middle range. So there's a high number of articles with both genders' participation. Which is a good thing in other words.
But if you look at the male and female articles there's a precise gap in the article size.
This means that female interest topics actually have a lot of quality in the article, that's why we really needed to dive into this question more deeper and analyze what is actually happening.
So we realized that, yes we got the article quality, but we want to understand the article coverage also.
To do that, we thought it's better to use our data that we have in the MovieLens. So we did is a very smart and tricky thing. We took our MovieLens ratings, and we actually matched MovieLens rating to each article, and it's the movie and the article in the Wikipedia. And focuses mainly on the article about movies.
And our intuition is, normally you go for a movie, you don't think about the gender of the movie. If you movie is fun, nice, action. You think about, basically, if the movie is good. You go and watch the movie. That's our intuition. And the gender is also the same. You don't think about the gender of the movie to rate. You think whether it is a good movie, a quality movie, directed well, acted well, all that combination comments. So our intuition was that, and with that intuition we dive into and see if that can bring us any results.
So users rate movies in the MovieLens radar, and 80% users provide the demographics, which is higher than Wikipedia. One of the biggest problems that we had in Wikipedia is that is-- was really hard to find the gender of the users. We had to jump so many moves to do this. But here, since we control the date in MovieLens, we had absolute upper hand on getting 80% of gender-known users.
So similarly, like I identified the gender of the topic in a Wikipedia article, movie gender also, we actually analyzed from the ratings of the female, so if you have a higher ratings from a movie from female participations, it returns it is female movie or female oriented movie. And similarly, if you have a lower percent from the female participation ratings, it's a male movie.
That in mind, we actually did, somewhat simple for us, but I will try to explain in a very simple way, a multiple regression analysis. In a regression analysis, what you do is you have a dependent variable, and you have an independent variable. That's a simple regression, but here we are doing a real world example. So we need a multiple regression. We have our independent variable, which is the gender of the movie, and our dependent variable is the article length, which is the quality of the article, and then we actually control.
The reason that we control these variables is age of the movie, and the movie popularity, and movie quality, because this actually can participate to make the article lengthier and better.
So if we really need to understand if there is any gender related quality disparity here we really need to control this, because if the movie is 1920's, then it has the potential to be a lengthy article, it has been there forever, and so many users have seen it, so we need to control it. And also the popularity. High popular movie, we need to control, because everyone has seen it, and everyone has something to say about this. Same wise about movie quality. If it's well directed, Spielberg movie, higher rating, higher buzz, so because of that we have to control these variables to understand the quality and the movie gender. So by doing that we actually managed to explain 47% of the variants, what you see. Hopefully I'm clear about this. So we managed to explain 47% of variants... And this curve shows --let me explain real quick this...
On your horizontal line you see the male audience and movie gender and the female audience.
When you see -2 you see, you see two standard deviations more towards male. And if you see a +, it means it is two standard deviations to the female participation. So you see -2 movie on a 1.4 rating effect you see it's a more male related topic. When you see Rambo, look at it, I cannot even show it in the scale because it is so high, and quality wise, and the length, it's even above the scale.
Actually Monsterball, actually is right above in the middle, quality wise, lengthwise is right above the average --actually-- we are looking at. And actually I don't know if anyone has seen this movie, Richard Manson’s coffin, look at that line. It's even below, the line of the actual length is way below, and the quality is really poor. We see that this means female interest topics; actually, as a low quality article, and also interestingly the few numbers related to the Monsterball and Rambo, Monsterball is about 15000 words, where Rambo is 4000 words, so you can see roughly the word gap in an article.
So, but there is a silver line in all of this. Can anybody tell me what is? --I mean, that's President Obama, and Mother Theresa, and somebody from South Africa is here, so Nelson Mandela, so they are all Nobel prize winners. So what we did was to take a look at the quality of the Nobel Prize winners. At the same time we also did it for Oscar winners. So if you're famous enough your article is well off. So we don't see actually a male and female Nobel Prize in this. Keep in mind there are 700 something male Nobel Prize winners, and that there's only 40 female Nobel Prize winners, so that's why you don't see, the entire 700 because there's no point and the curve dropped right after we did the translation on the number 16 on the grant.
So to run by on the size, there's no bigger gap like you saw in the movie gender. If you're famous enough, you don't suffer. Which means if you're famous, your article has a really good participation and gender has really no impact on it. Which is a good thing, I guess, maybe we should focus on this to actually reduce the gender gap in the Wikipedia community and editor participation.
The next, actually, the social and community participation, as I mentioned, the social research shows that female are more towards helping and participation in social communities, so we wanted to really understand what the difference is of the social participation. So what we did there is we looked at actual percent of edits happening in user page, or user top page. Anybody does not know what user page or user top page is? I guess everyone knows about that. So basically, user page is pretty much you talk about yourself, user top page, is actually you can interact with other people. You edit something, somebody don't like something about it, you come in and post it in your top page, so you go and post it other person's top page, it's like a chat session. It's the wiki version of chat.
We analyzed that. Average female actually have 25% of edits in the users and top page, comparing to the male, so I see very higher female participation here --I’m not trying to say that we men talk a lot, but that's it, this is what the Wikipedia research shows, not what I'm saying, don't try to kill me.
And then, what interested us was what kind of percentages we see in becoming an administrator. Because we see in the real world that female are actually gaining power, they are becoming head of places, and actually running for presidency, and winning noble prices, and we wanted to really see if that has actually trickled down to the Wikipedia, so becoming in an administrator is one of the top things in Wikipedia as an editor, so we really wanted to understand what really happens. But unfortunately more males are becoming administrators once they join. So we actually really need to understand this and really focus on this going forward, to really reduce the gender gap, it's very important, because in Wikipedia administrators have a lot to say, and they actually have control on a lot of things that happen. So if you really want to actually have a proper balance in the gender gap, this is something we should, as Wikipedians, address. And we also look at the edits in administrators. And this is actually another silver line in the research that we found. We look at all the administrators and the editors. Actually female administrators have higher edits than the male, except for 5. But this means that once you get to a certain level they keep going and keep participating. So the target is actually to try to get these users to this certain level where they can be very productive. May be there's something going on that we don't really see why they cannot get to this certain level, but that's something for us to figure out. So once they get to this certain level females have a higher participation than male.
Just to check what we really discussed so far, we know that a gap exists, and we know that the gap is not shrinking. And we actually see that females do focus on different areas, whereas female focus more on arts and people, whereas male more on geography and science, and the coverage is actually worse when it comes to female topics, and we also know that females are more into social participation, from user top pages and user pages.
So establishing that, we wanted to really understand the conflicts. So this is actually a really big thing. Conflicts make everything interesting in this Wikipedia world. And we wanted to really understand how the conflict gaps and how it really plays out in the Wikipedia.
So to do that we broke down this conflict related research question into 4 hypotheses. It's an intuition again.
First hypothesis. Female tend to avoid controversial or contentious articles. So this was actually we thought, ok, may be female don't want to really argue, but they hate to really have a fight, again, in a conflict and they back down, so because of that they actually stay away from conflictive articles.
And we also hypothesized that female editors are more likely to have their early edits reword. And I'll show you later, which is a major effect on actually losing the female participation. And we also wanted to see whether female editors are more likely to start editing or participating, once they've been reworded, or that if they never come back, or they just completely stay away from Wikipedia.
We saw that female editors are less likely to be blocked. Because they actually do participation in social activities, as we saw before, because if they get blocked they probably put up in a notice, or put up in a blog, they actually get their block released. So considering this, we thought, females are less likely to get blocked.
So conflicts and rewords. Anybody just knows what a reword is?
Reword is you post something, and another user goes, and you actually reword it from the previous editor.
So that's a reword. And conflict is... So then after that, you actually start a disagreement between the two users, that becomes a conflict of the page.
When you say conflictive articles, controversial articles, that means some articles, people figures, or for example Obama and some presidents... Sorry about that... So some people's figures, some countries' presidents, and some football players... Some articles under the controversial tag, because there are more people who have opinions about those articles, so those are the articles that we look at when we look at controversial.
So actually our hypothesis did not support on this, we assumed that, or we hypothesized that female are... females going to stay away from controversial articles, but actually what we saw here is the complete opposite.
But female actually do have a higher participation, on higher controversial articles, than the males. And they do also participate in higher number of male controversial articles. So considering Mother Theresa, the articles that we saw, all that, and if an articles is a female topic and it's uncontroversial, there are more female oriented articles that are more towards controversial than the male articles.
So which is an interesting thing, and is also a reword from our hypothesis, we really didn't expect this to happen, considering the how, the gender gap so far, and the gap has not been shrinking, so we completely thought the opposite of it.
So if an article is a female oriented article, that means, if an article has a higher number of female editors, it has a twice opportunity to be a controversial article. I'm not trying to state female articles are, but it's just that a high number of female editors actually shows that there is a controversial takes place in an article if it's a female oriented article.
So next thing we wanted to really understand is female editors are more likely to have their early edits reword. So, but when we look at this, we really wanted to stress that; we only follow the men space. Because there are other articles spaces in the Wikipedia that you can analyze, so we really wanted to stick to the men space in the Wikipedia, and we really wanted to see the edits that have been reworded only in good faith, not because of anything else. So you go and try to really do an improvement of an article. Let's say it's an article about Sue Gartner. So and somebody goes, I go and try to edit Sue on Wikimedia, on Wikicamp. And somebody goes and completely rewords that, and that's actually, I did that edit in good faith, but somebody went and really delete it, or reworded back, thinking that's is not important.
So we really wanted to understand how this actually takes place and what's the gap, and whether female get reworded more when female do a good faith edit.
The question is, do they really get reworded more?
And... This is the graph that you see, we have broken down to the eight, state of the editor, so you see you see first edit, three edits, four to seven edits, and reword edit percentage. You see that actually female with the red line, and male actually with the blue line, you see female newcomers get reworded more often. If you look at the top it's actually 7 percent with the female and 5 percent is the male. So if you're a female, and you are a newcomer, which is first edit of the seven edits, you are more likely to be reworded from the Wikipedia edits. And next thing that we also really wanted to really understand is when exactly this actually happened.
So you log in to Wikipedia, as a new user, and start editing immediately. That's the first thing people do. And we saw that regardless the gender, actually, a reword in the first 24 hour edits, has a major impact to actually leave the Wikipedia. So we actually ... we did a very rigorous analysis on that, a very statistically proven in our paper, if you really want to it to read about how we really did it, I'll read it in a minute, but we see that actually the effects takes place in more genders, so it's really similar in that case.
So being female and editing in the first 24 hours it actually has the same effect as a male editor. But it doesn't explain this --why the females actually get reworded in their first few edits.
We don't know exactly what causes it: that is something that is really not analyzed here; this paper was about what was really going on. This is something for us together to understand why, is the low quality, are they doing something wrong, are they really not understanding the techniques in Wikipedia. There can be many effects. The qualitative analysis due to understand really what's going on.
So the next is actually the blocks. Block means that you go an edit, and may be to be around generalizations and we don't know exactly what it is, you go, if you go into an edit, you keep editing, and at some point you get completely blocked from the Wikipedia. That means you that you have to actually start talking to administrators, or start publishing, or sending emails to get you unblocked and really reason to get unblock. So we really need to understand: are females block more often? Actually is not the case but is very similar percentage in this case, so both actually genders get blocked fair amounts similarly, so it's not really significant in this case, what gender could have the same impact. But more interestingly, once you get blocked, whether you're gonna you stay blocked indefinitely, or whether you're gonna back, what we see is that 3.85% of the blocked users stay indefinitely blocked. Which means that actually female users never make an attempt may be to come back. Or may be never participate in administrative conversations to really get back. So we really don't know what's causing them to not to come back. There can be many reasons, so that's something to in the future to really understand, what's happening in this case.
So, so far I went through a couple of hypothesis and actually statistically, I showed that what's exactly happening on the Wikipedia. And we know that the gap exists and there is a larger gap; keep in mind that 16% newcomers are female, it's 1 in 6, and also 16% only contributes 9% of edits. And also we see the gap has not been shrinking, so 2008-2009 participation using user preference percentages and what we saw is that it's a flat 15% proportion that we see in the female participation.
And the coverage we see is supported on the hypothesis that female articles are actually lacking quality. And we also saw that females are more likely to be participating in social activities. Which is a positive thing. And what we saw in the edits is that articles with higher female participation are more contentious, we have a higher number of female related articles which are bound to be conflict.
We also saw that being reworded as a newcomer has the same effect for both males and females. And the last is, if you're blocked, then females are more likely to be blocked indefinitely.
So considering this, we, there is something of this that I would like to explain to you. And the feelings that take away is the quality of the female articles. And think that's what's something we need to address right here. Because we don't want that in a knowledge repository in the world right now. So we need to have more lines on this. And editor conflict is actually driving away other females. That is something we have to address, maybe as a community, or administrators in Wikipedia, they have to be a little bit, you know, nicer or... I've had many experiences during my research by trying just several things in Wikipedia, my accounts being blocked, my accounts being, you know, so many things have happened to me during this research process, just by trying this.
So maybe it's time for us to really understand why these things happen. So, we, this is actually Wikimedia Foundation's vision. And I think proudly, it's the one proudly stating this... yes... Due changes are: improving female participation, addressing... and it actually shows we can come to a place where we can improve the quality of the female participations and make this place a better Wikipedia, and accomplish this goal.
And I think they also have a goal in 2015 to increase to 25%, and hopefully this research will open the eyes of the Wikipedia population, and actually really become a good place to start talking about it, addressing things, and understand what is really going on, I'm hoping that this research, we have actually seen recently that CWSW, Wiki SIM has been focusing on different gender related topics in Wikipedia, so we are so glad that we started this, and hopefully people can... That's normally a GroupLens research thing, normally start things and let up people really explore, and hopefully that will go on in the future.