Research:Breaking into new Data-Spaces/Proposal

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

This page documents a workshop proposal for CSCW 2016 (due Oct. 16th) intended to explore the needs of scientists of open online communities by providing a set of CSCW researchers with state-of-the-art data tools and reflecting on what's difficult about replicating and extending past work.

Titles options[edit]

List your title proposal in this section and discuss it on the talk page

  • Breaking into a New Data-Spaces: Infrastructure for Open Community Science
  • Data Data Everywhere: Strategies for finding the needles within Open Community Science


  • Aaron Halfaker - Aaron Halfaker is a senior research scientist at the Wikimedia Foundation. His research interests include a variety of topics in human-computer interaction and social computing. He is known for his work studying the socio-technical causes of the decline in active contributors of the English Wikipedia. His work brings a social science and feminist HCI learnings to the application of advanced computer science techniques in computer-mediated social communities.
  • Jonathan Morgan - Jonathan Morgan is a senior design researcher at the Wikimedia Foundation. He holds a PhD from the University of Washington, where he runs workshops focused on open data and research literacy and occasionally teaches courses in HCI theory and UX research methods.
  • Sean Goggins
  • Yuvi Panda
  • David Laniado - David Laniado is a researcher in the Digital Humanity department at Eurecat (Barcelona, Spain), working on quantitative approaches to analyze and characterize online communities. His main research interest is the study of interaction patterns in social media, with a special focus on online conversation and discussion, deliberation processes, and collaboration dynamics in peer-production communities. He received from Politecnico di Milano his Ph.D in Information Engineering in 2012, and his master degree in Computer Science in 2007.
  • William Rand - William Rand is a professor of Marketing and Computer Scientist in the University of Maryland's Robert H. Smith School of Business and the Institute for Advanced Computer Studies (UMIACS). He is also the Director of the Center for Complexity in Business. He examines the use of computational modeling techniques, like agent-based modeling, geographic information systems, social network analysis, and machine learning, to help understand and analyze complex systems, such as the diffusion of innovation, organizational learning, and economic markets.
  • Elizabeth Thiry - Elizabeth Thiry is currently a research scientist for Boundless working on user experience initiatives. Elizabeth also teaches online master level courses in information sciences. She received her PhD in Information Sciences and Technology from Penn State University. In addition to her academic experience she also has 10+ years of technical experience holding positions such as developer, business analysis, and user experience researcher to name a few.
  • Kristen Schuster - Kristen Schuster is a doctoral candidate at the University of Missouri. Her work and research focus on methods for developing, implementing and using metadata schema to support collaboration in interdisciplinary projects.
  • A.J. Million - A.J. Million is a doctoral candidate at the University of Missouri, where he teaches digital media and Web development courses. His interests relate to the creation and administration of public information resources. These interests extend to dataset description, dissemination, and management. His dissertation examines website infrastructure in state government agencies.
Confirmed participants
  • Dario Taraborelli
  • Cameron Marlow
  • Nitin Bordankar
  • Andrea Wiggins



Despite being easily accessible, open online community (OOC) data can be difficult to use effectively. In order to access and analyze large amounts of data, researchers must first become familiar with the meaning of data values. Then they must find a way to obtain and process the datasets to extract their desired vectors of behavior and content. This process is fraught with problems that are solved (through great difficulty) over and over again by each research team/lab that breaks into datasets for a new OOC. Rarely does the description of methods presented in research papers provide sufficient depth of discussion to enable straightforward replication or extension studies. Further, those without the technical skills to process large amounts of data effectively will often be prevented from even starting work. The result of these factors is a set of missed opportunities around the promise of open data to expedite scientific progress.

In this workshop, we will experiment with strategies -- both technological systems and documentation strategies -- designed to enable our community to more thoroughly reap the benefits of open online data science practice. We will invite participants to attempt the difficult work of breaking into a new dataset using tools and documentation designed to alleviate common difficulties. In the months leading up to the workshop, we will prepare and describe several datasets within an open querying service and invite participants to explore these systems and their functionality through the replication and extension of a selected data-intensive research paper from past CSCW. During the workshop participants will have the opportunity to explore new tools and datasets and to jump start new studies based on our curated documentation and infrastructure. As we observe and interact with our participants, we hope to learn from their successes and struggles and to use these learnings to iteratively improve our tools and documentation protocols.

This work builds on a call to action from a previous CSCW Workshop[1] and ongoing initiatives[2]: to build up shared research infrastructure[3] to support data and method sharing practices. The workshop organizers come from many different backgrounds and have extensive experience with using OOC data, developing infrastructure to support access to and analysis of OOC data, and building communities of practice around OOC research.

We will use this workshop to achieve three goals:

  1. identify common challenges and novel strategies for making open community research easier to replicate and extend -- specifically targeting protocols for documenting research methods (e.g. the ODD protocol[4])
  2. inform the design of data management/analysis infrastructures like Quarry, our experimental open querying service[5]
  3. inform the design of metadata indexes like the Open Collaboration Data Factory's wiki[6]

We also hope to foster a community of practice within CSCW around open data management and plan for next steps towards accelerating scientific progress around the study of computer-supported cooperation just as past workshops have informed our plans for this workshop proposal.

The difficulties of using open online data[edit]

Regretfully, the technical availability of OOC datasets has not been a panacea for the study of socio-technical phenomena in these communities. Based on past research and workshops designed to help us explore OOC data science practices, we have identified three key hypotheses about what makes "breaking into" new datasets so difficult: (1) methods descriptions are often insufficient as a guide to replication & extension, (2) technological literacy bars access to processing large datasets, and (3) inconsistent and poorly indexed metadata prevent discovering what data is available and what the items of a dataset mean.

Methods replication[edit]

OOC research has advanced considerably in the last few decades, but it is still difficult to compare and contrast research findings from different pieces of work. Part of this is due to the very nature of the research; it comes from all sort of fields from information systems (e.g. [7]) to computer science (e.g. [8]) to information science (e.g. [9]) to marketing (e.g. [10]), and is studied in a wide variety of platforms from Twitter (e.g. [11]) to Wikis (e.g. [12]) to question-and-answer forums (e.g. [13]). As a result, different disciplines and different study venues use different language, and different descriptions, making it hard to integrate knowledge gleaned from different origins.

Moreover, the lack of easy translatability between fields and platforms has made the reproducibility of findings very difficult. A researcher in one field may take certain definitions for granted that are not well understood in another field, or at least are not understood in the same way. In other words, it's difficult for a researcher who works with Flickr data to understand how certain concepts are operationalized by researchers who work with blogging data. The field has progressed fine up until this point because there is so much research to do in this space, but it is now time to start to build a cohesive theory of online communities and to create knowledge built on top of other knowledge, and to provide standards that allow different researchers to reproduce each others' findings.

In order to take the field to the next step, it is necessary to develop a standard of communication that will allow different researchers to communicate how and why they performed their analysis and research the way that they did. One development in other interdisciplinary fields that has helped communication across boundaries is the creation of a uniform methods protocol (e.g. [14]). By creating such standards that describe how data is collected and analyzed, as well as how certain measurements in the theory are operationalized within the data, it is possible to make it easier for different researchers to understand each other's work and to reproduce findings.

Technical literacy[edit]

A screenshot of the Quarry public querying system
Quarry screenshot. A screenshot of the Quarry public querying system

While there are many powerful, widely available, free/libre tools for gathering, manipulating, and analyzing large datasets, CSCW is an interdisciplinary field and researchers' expertise for using these tools varies quite widely. Even for researchers with such expertise, the beginning of a large-scale analysis is fraught with technical issues around formatting, types and structure and this results in a long process of trial and error. For researchers without such data engineering expertise these problems can seem intractable. For example, at a past OOC data analysis workshop that we organized for the GROUP'15 conference, our expert participants spent 5 hours (the majority of the workshop day) converting and loading (100m row) datasets into an analysis framework that would allow the larger group to answer basic questions. Even after the data was loaded, there were substantial concerns about inconsistencies between the documentations and the observed row counts.

With the goal of democratizing data analysis, we have been experimenting with open dataset interfaces that make the allow us to do such basic data engineering work up front and minimize the difficulty that future researchers experience when "breaking into" the dataset. We have identified two components that characterize open dataset interfaces: (1) public GUI sandboxes and query interfaces for lightweight in-situ data exploration and (2) approachable query languages (e.g. SQL).

We have developed such an open dataset interface for Wikimedia Datasets in the form of a public SQL querying service, called Quarry[15]. Quarry loads row-based datasets into a relational database management system and allows a user to join and filter datasets on the server through a web-based user interface. This service allows both the direct download of datasets and download/sharing of secondary queries produced by queries. We have found that non-experts can acquire proficiency in SQL over the course of an hour and that experts can use SQL powerfully. Further, by making past queries public, newcomers are able to learn common and advanced querying strategies on their dataset of interest. This helps non-experts to quickly gain proficiency and, thus, become increasingly comfortable with new technologies that support their research agendas.

We see querying interfaces like these as a key opportunity to make OOC datasets more accessible to both data engineering experts and laypeople. In this workshop, we'll put this conjecture to the test by supplying datasets through Quarry and learning from the experiences of participants.

Metadata & documentation[edit]

In order to break into a new dataset, a researcher will need to discover it and determine know how to make use of it. Currently, OOC datasets are scattered across various websites on the internet. They are inconsistently (if at all) documented and the terms used to describe the characteristics of the data differ based on the discipline of the authors. By gathering and standardizing information about OOC datasets, we can dramatically improve the discover-ability and utility of them.

Classifying OOC datasets so interdisciplinary researchers can discover, access, and use them in collaboration with other scholars requires consistent and agreed upon descriptions. Engaging these challenges in a CSCW workshop will allow us to articulate shared research goals, develop common terminology for describing datasets in generalizable terms, and determine how to document metadata at different descriptive levels so that OOC researchers can use these datasets effectively. 

OOC datasets can be described on three levels: the meta, mezzo and micro. The meta-level is descriptive information that aids researchers in finding datasets and conducting a preliminary evaluation of their value prior to use. This level also supports data management{{Digital Curation Center, DCC Lifecycle}}. The mezzo-level describes the meaning captured in a dataset's content. Scholars use mezzo-level information to negotiate and analyze to understand OOCs across disciplines, create theories, conduct scholarship, etc. Last is all micro-level dataset information, which granularly describes the contents and structure of a dataset. Tiered metadata schemas allow us to account for different research methods, modes of analysis, storage systems, and disciplinary norms and to support other considerations such as dataset accessibility (e.g. copyright) and research ethics.

Workshop plan[edit]


Participants will be recruited through a mixture of strategies. We will contact participants from past workshops[16][17]. We'll post announcements on social media (Twitter & Facebook) as well as open science/HCI related listserves. Participants will be selected based on their interest and experience working on OOC datasets. We will be inclusive since we'd like to learn about opportunities to support a wide variety of research experience levels and expertise.

As we accept participants to the workshop we'll survey them to gather ideas for replication/extension studies that could be mostly completed in the context of a Workshop. Participants will be asked to suggest a study to replicate/extend and describe what datasets and analysis would be necessary.

Workshop preparation[edit]

We'll gather and describe a small set of primary datasets relevant to the replication/extension studies we plan to ask participants to run. We'll supplement the methods descriptions of the paper we have chosen to replicate based on our proposed methods protocol. As part of this work, we'll also perform our own replication in advance to know what time-intensive analyses are involved. We'll take the opportunity to produce secondary datasets that would take too long to reproduce in the course of an 8 hour workshop day.

Datasets will be preloaded in our shared querying environment (R:Quarry) and metadata will be described in our structured database[18]. Both of these systems work as intended today, but we'll be continuing to extend them and add features as the workshop approaches.

Workshop day[edit]

Vision statement
A short presentation and extended discussion about the purpose of the workshop and the larger initiative towards better infrastructure for open community data science.
Hack session
Participants (split into teams) work on the replication/extension task. Participants will have a total of 4.5 hours total for time on task besides introduction, breaks, and reflection time. The workshop organizers will work with participants to both answer their questions and observe their work.
Reporting and reflection
Participant teams report on their progress and reflect on what did and did not work for them. We'll specifically ask how the methods description, querying system, and metadata was helpful and how.
  • 8:15-9:00: breakfast mingling
  • 9:00 (sharp!): AH intro to the day (process + brief overview task)
  • 9:10-10:00: Vision statement about Infrastructure for OOC studies
  • 10:00-10:15: Data introduction -- Each team/table reviews the task, documentation and infrastructure.
  • 10:15-10:30: coffee break, email breaktime
  • 10:30-12:00: Morning hack session breakouts (one team per table)
  • 12:00-12:30: Lunch serving, email breaktime
  • 12:30-3:15: Afternoon hack session breakouts (one team per table)
  • 3:15-3:30: coffee break, email breaktime
  • 3:30-4:30: Report-out and reflection (surveys)
  • 4:30: Wrap-up, Thanks & Next steps.
  • ~5:00: Victory! Food? Beer? Share contacts.

Summary reporting[edit]

At the end of the day, we will use the last hour as an opportunity for our participants to discuss what worked and what didn't. We will capture their discussion points in a collaborative document that all participants will be invited to edit and extend. We'll use these notes and our observations during the workshop to publish a report summarizing major take-aways to inform future work.


  1. Cite to CSCW 2015 workshop
  2. Cite OCDF proposal or workshop or something
  3. Cite to CSCW 2015 workshop Report
  4. cite the most recent incarnation of ODD
  5. footnote to quarry URL
  6. Footnote to OCDF metadata wiki URL
  7. Ma and Agarwal, 2007, ISR
  8. Shneiderman, 2000, CACM
  9. Preece, 2000, book
  10. Kozinets, 2002, JMR
  11. Vieweg et al., 2010, SIGCHI
  12. Kriplean et al., 2008, CSCW
  13. Zhang et al., 2007, WWW
  14. Grimm et al., 2010, EMS
  16. GROUP workshop 2015
  17. CSCW workshop 2015
  18. Link to OCDF wiki


Anything below this point is not intended to be included in the proposal, but is kept here for quick reference.

Metadata and indexes[edit]

At the meta and mezzo levels, metadata schema(s) are needed that take the following into account:

  1. How different research methods influence how OOC datasets are sampled and collected
  2. Their varying modes of analyzing data
  3. A variety of storage systems for datasets about OOC
  4. That disciplines often report their findings in unique ways
  5. Other factors such as dataset accessibility (e.g. copyright) and research ethics

Recognizing the many differences among the communities of scholars who study OOC, has helped clarify that there is a need to develop modes and methods for collecting, analyzing and describing datasets so they can be re-used. To do this, shared needs and interests among OOC scholars need to be better articulated. Doing this in a CSCW workshop will determine:

  1. Which “level” of description is currently most needed by OOC researchers along with the an associated metadata schema
  2. Shared research goals to help in the creation of terms for describing datasets in generalizable (but not general) terms

Gaining an understanding of interdisciplinary research projects depends on balancing granular metadata terms with generalizable concepts that support search, retrieval analysis, and the creation of scholarship because it will support:

  1. The development of systems designed to record metadata about OOC datasets will have to function in a dynamic manner so that it can support mezzo-level AND meta-level searches
  2. Creating a well-designed system that will use meta-level metadata effectively to help researchers find datasets, which will make it possible for them study and evaluate these datasets in depth
  3. Cross-disciplinary study and evaluation of existing datasets in-depth at the mezzo-level
  4. Varying levels of description to scholars offers enough documentation for datasets that fosters study replication as an added benefit to promoting cross-disciplinary collaboration

Discussion of opportunities for training researchers in analysis techniques[edit]

  • CSCW is an interdisciplinary field, and many researchers lack experience with powerful, widely available, free/libre tools for gathering, manipulating, and analyzing large datasets.
    • Researchers who use primarily qualitative methods (i.e. ethnography) may be unfamiliar with the methodological requirements of quantitative data analysis.
    • Researchers who have learned how to perform data analysis within specific GUI software platforms (i.e. SPSS, Excel) may not know how to use programmatic tools (i.e. using R, Python, or MySQL) to manipulate and analyze data.
    • Many researchers across the disciplinary and methodological spectrum do not have a background in computer science or experience with programming languages.
  • A lot of big data is inaccessible.
    • corporate IP, locked behind pay/firewalls, no API (or poorly documented API), unwieldy dumps
  • Well designed open data interfaces are needed
    • documentation designed for researchers (as opposed to developers)
    • public GUI sandboxes and query interfaces for lightweight in-situ data exploration
    • approachable query languages (MySQL, human-readable APIs)
  • Researcher-focused technical literacy training opportunities are needed (Aaron delete this section if you think it's out of scope)
    • data science-oriented programming workshops (Boston Python Workshops, UW Community Data Science Workshops, Wikmedia Research Hackathons)
    • free, self-directed online tutorials and short courses teaching programming concepts necessary for data science, rather than software development
    • training in data science research methods (study design, data validation, statistical analysis)

Description of the ODD Protocol[edit]

ODD Protocol

This was a workshop that was held to develop standards for replication for agent-based modeling. Began by deciding that they needed, a general way to describe the structure of an ABM so another modeler could easily recreate the model from the description. They initially proposed a protocol for describing models, and then held a workshop where a bunch of modelers tried to use that protocol to describe their own model. They then commented on how the protocol could be improved. These comments were collected and a revised version of the protocol was issued.

The protocol itself includes three main sections: (1) Overview, (2) Design Concepts, and (3) Details (the first letters spell ODD).

The Overview is supposed to describe the overall purpose of the model. It contains three subsections: (a) Purpose - why was the model built?, (b) State variables and scales - What kinds of entities exist and how are they described?, and (c) Process overview and scheduling - What actions happen when?

Design concepts is supposed to discuss the more abstract concepts that are relevant to ABM that the model embodies. Typically it includes a number of common features, such as: Emergence - what system-level phenomena emerge from agents?, Adaptation - do the agents adapt to the environment?, Fitness - is their a notion of better individuals?, Prediction - do individuals predict he future before taking actions?, Sensing - what do individuals know about their environment?, Interaction - how does one agent interact with another?, Stochasticity - Are there any nondeterministic components?, Collectives - are individuals grouped at all?, and Observation - what is examined about the model?

Details section is supposed to describe the lowest level details necessary to replicate the model. It contains three main sections: (a) initilization - what is the state of the model in the beginning?, (b) input- what are the inputs to the model, (c) submodels - any of the models listed in the process overview are described in-depth.

It should be noted that these are not meant to all be text, but rather they are to be supplemented with diagrams, math, and pseudocode when appropriate.

TRACE is a follow-on in some ways to the ODD Protocol that seeks to build a comprehensive description of a model that is going to be used for decision support (not necessarily an ABM).