Research:Genealogies of online communities

From Meta, a Wikimedia project coordination wiki
00:34, 2 August 2019 (UTC)
Duration:  2019-October – 2022-September

This page documents a planned research project.
Information may be incomplete and change before the project starts.

Geneaologies of Online Communities is a National Science Foundation-funded[1] three-year project to study the emergence of online groups on Reddit and Wikipedia. New online groups do not appear spontaneously but are created by users who participated in existing online groups. A new group's early members carry their own activity history, which reveals their past group memberships as well as the relations between existing groups and this new group. The ability to trace the behavior of users before they start new groups presents a unique opportunity to understand how new groups emerge, how social norms arise in new groups, and what factors contribute to the success of new groups.

This research will quantitatively model the genealogical relationships between online groups by tracing the socio-technical lineage of "child" groups through their early members' previous participation in "parent" groups. Platform administrators or group moderators can use genealogical approaches for adopting existing norms in other groups, recommending new groups to users, and identifying opportunities to create new groups. A genealogical perspective can also explain the variability in group success as well as how norms spread throughout online groups. This work will enable transformative changes in online communities, either via the redesign of their platforms or through the implementation of new community policies.

The project draws on theories about the formation of online communities, organizational ecology, and kinship to explore how a group's position within a genealogical graph influences the group's identity, norms, and success. This research will analyze log data about user behavior over time from Reddit and Wikipedia. Aggregating the socio-technical lineages of multiple groups together generates a genealogical graph documenting how new groups emerge from the old. These genealogies will be evaluated across platforms, users, and time. The project will advance human-centered data science methods by employing qualitative methods, such as interviews and focus groups, to validate the proposed genealogical graphs, which will inform quantitative methods for building genealogy graphs from large-scale log data sets and analyzing their structure and dynamics.


This project will use a mixed methodological framework to make sense of online group formation and evolution by tracing the foundation of new groups through the existing groups in which their members previously participated. This framework proposed computational methods for reconstructing genealogical relationships from large-scale event log data, a human-centered data science approach for validating these genealogical constructs through inductive and qualitative analyses of digital traces, and an evaluation of genealogical relationships on canonical group processes. Because this interdisciplinary framework generalizes across socio-technical systems, we will analyze the genealogies of groups using large-scale datasets from Reddit and Wikipedia. This project will accomplish three primary goals:

  • Goal 1: Characterizing genealogical graphs. What is a genealogical relationship in an online community? We develop a preliminary quantitative method for identifying parent-child relationships between online groups based on the temporal sequences in their users’ public activity logs.[2]. We use Reddit to obtain proof-of-concept results based on initial constructs and parameters. We propose further extensions and plan to conduct the first large-scale characterization of genealogies using datasets from other platforms like Wikipedia.
  • Goal 2: Validating of genealogical graphs. Does our genealogical construct capture substantive relationships between online groups? We will employ a battery of mixed methods approaches such as trace ethnography, trace interviews, and focus groups to validate this construct. This triangulation step will elicit alternative definitions of genealogies, produce labeled data, and identify outliers that will require induction and iteration to generate more robust constructs.
  • Goal 3: Evaluating group processes with genealogical graphs. How do genealogical relationships explain group processes? We analyze how processes like group success and group norms are influenced by genealogical relationships. We propose to examine how genealogical graphs relate to group success through a prediction framework and study the effectiveness of features based on genealogical graphs. Further, we investigate the diffusion and evolution of norms along the genealogical connections using explicit norms documented in organization pages and implicit norms reflected by language use.

Research design[edit]

We will collect, store, and analyze public-facing data made available by Reddit and Wikipedia for all three goals described above. These data are generated by users and made publicly available under the respective Terms of Service and Privacy Policies of each platform. These data contain the history of users’ contributions on each platform, including the content of their contributions (Wikipedia), posts (Reddit), and comments (Reddit and Wikipedia) as well as associated meta-data including timestamps. These data can be accessed programmatically from the platforms through Application Programming Interfaces (APIs) or through database dumps. Databases will be updated on a monthly to annual basis depending on the availability of new data and the priority for using it in the research.

We will also collect data from our trace ethnographies, trace interviews, and focus groups as a part of Goal 2. The trace ethnographies will start with the same public-facing data using in Goals 1 and 3 and will not require participant interaction. The trace interviews and focus groups will be conducted through a combination of remote interviews (e.g., Skype) as well as in-person at the data collection workshops. Interview data will be collected and stored in digital audio formats and transcribed to digital text formats. Field notes, inductive codes, and other derivative qualitative research products will be created and digitally stored using software like MAXQDA or Nvivo.


  • Data collection: Spring 2019 - Spring 2020
  • Characterizing genealogy graphs: Fall 2019 - Fall 2020
  • Cross-platform comparisons: Spring 2021 - Spring 2022
  • Trace ethnographies: Fall 2019 - Fall 2020
  • Trace interviews: Fall 2020 - Fall 2021
  • Focus groups: Spring 2020 - Spring 2022

Policy, Ethics and Human Subjects Research[edit]

The proposed research necessarily involves tracking user activity across contexts, which raises important ethical and privacy concerns. Large-scale digital trace data analysis present very real tensions for bedrock ethical principles like respect for persons, beneficence, and justice and there is a mixed consensus for ethical best practices among researchers using online community data. First, users maintain different identities to different groups but research designs can collapse these contexts together and upsets users’ imagined audiences. Second, just because users’ trace data are accessible through public APIs does not automatically exempt it from ethical concerns. Third, while the policies governing ethical review boards in the United States interpret digital trace data as less risky to participants than other research designs, social media users express reservations about their content being used for research. The absence of a professional consensus or clear regulatory guidance are not exculpatory arguments for researchers to employ laissez-faire strategies for collecting and analyzing online user behavior.

We will address the ethical concerns of our research through several approaches. First, we will be transparent about our use of data and findings with the communities whose data we are using. We will participate in appropriate forums where research about the communities like Reddit’s /r/TheoryOfReddit or Wikipedia’s “Village pump” to disclose our research designs and share our results. Second, we will support community- and professional-led deliberation about our research. We will use community-led deliberative genres like “Ask Me Anything” engagements and blog posts to share our research results and to invite feedback or co-creation of research designs. On the professional-led deliberations, we will consult with on-campus colleagues as well as other Reddit and Wikipedia researchers through workshops and panels at conferences to assess the risks and benefits of different data and research designs. Third, our analyses under Goals 1 and 3 will employ de-identified and aggregated data from public sources and will not involve joining in other data that could lead to de-anonymization. Because our analyses under Goal 2 will require identifying, tracing, and interviewing individual users, the research protocols for interviews and focus groups will be reviewed by the IRB and participants will be debriefed and will never be identified in published material. This reflexive, empirically-grounded, and multi-pronged approach embodies fundamental ethical values of beneficence, justice, and respect for the public and persons.

The two types of studies for this project have been screened by CU Boulder's Institutional Review Board. The first study using publicly-available revision history data is not human subjects research as it involves archival data. The second study involving interviews and focus groups with editors and administrators has preliminary IRB approval (Protocol # 19-0424, July 2019 to January 2020) pending additional research design to be informed by other (non-human subjects) findings.


In progress.