Schema talk:CitationUsage

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search
J Train (1973-1979).svg
Maintainer:Bmansurov (WMF), LZia (WMF)
Team:Research
Project:Research:Citation Click Data
Status:
inactive
Sampling:none
Purge:Auto-purge after 90 days



@Bmansurov (WMF): Did you mean to create a separate event for fnHover (on top of fnClick)? --LZia (WMF) (talk) 19:09, 11 April 2018 (UTC)

A couple of more things:

  • Please add pageTitle. That makes eye-balling the data easier.
  • sectionID: how are you planning to do this as of now? I'm wondering if we can do something better than section title (which won't tell us much about where in the page the interaction has happened. See T191086#4115454.

Thanks! --LZia (WMF) (talk) 19:09, 11 April 2018 (UTC)

@LZia (WMF): I've updated the schema with your suggestions. Since, afaik, Parsoid (rather than the MW parser) outputs data-mw-section-id, we cannot use it to select section IDs on article pages. If section title won't give us enough information, we should think about using some other metric. Maybe we should look into using the combination of page height and viewport height.
Also, do we want to set a threshold number of milliseconds before we consider an action as a hover? For example, we'll register a hover only when the mouse dwells on a link at least 100ms.
Are we fine sending both hover and click events? How would we distinguish user's intent? Maybe user wanted to click, but since hover happens before the click, we'd register both events. Should we get rid of hover?
Thanks, Bmansurov (WMF) (talk) 01:30, 12 April 2018 (UTC)
@Bmansurov (WMF): Thanks for the update.
  • re sectionID, since your message, there is now an internal thread about it with Subbu. I'll wait until for us to converge there to see what's best to do here. I'm relying on the two of you to do magic. ;)
  • for page previews, Tilman et al. are using 1000 ms. I'd say let's go with that number for now as these references also require some pondering. We can play with this threshold later if needed. We do need to somehow differentiate between click and hover/preview though.
  • re which event to send: I was hoping we can reuse some of the instrumentation for page previews. Does it make sense or am I missing something? :)
Thanks! --LZia (WMF) (talk) 22:51, 12 April 2018 (UTC)
  • Our talk with Subbu came to a conclusion. If we need Parsoid section IDs we'll have to post-process data.
  • OK, I'll use 1000ms to register a hover event after a user dwells on a link.
  • Given the above point, there won't be a conflict with the click event. We'll register a hover event when the user dwells on a link for 1000ms, and a click event when the user clicks on a link. There won't be a conflict.
Bmansurov (WMF) (talk) 10:23, 17 April 2018 (UTC)

@Bmansurov (WMF): and @LZia (WMF): thank you both for the updates and many improvements. More comments and questions below:

  • extPosition:
    • Both extPosition and sectionId are proxies for how far a reader has made it through a page. Leila suggested word (or character?) count from the beginning of the article to the click target. If that's possible, it would be a better surrogate than link position, could be used for all 4 event actions, and would obviate the need for extPosition.
  • sectionId:
    • Even if position of click is resolved some other way, my team will still want sectionId to support questions like: H5.2: Readers of WP:Med articles on medical conditions click “references” links more frequently in Diagnosis and Treatment sections than in other article sections.
    • Instead of inInfobox standing alone, should it be merged with with sectionId and reported as the value "infobox"?
    • Is sectionId limited to h2 elements or are deeper section values included? For example, if a reader hovers over cite "[100]" in Hepatitis and then clicks one of the two external links, what gets reported? "Hepatitis E" (from h3) or "Treatment" (from h2)? The h2 value "Treatment" would fit our needs better.
  • I reviewed my team's hypotheses document again and see a hole. Q3: Do readers tend to follow links more often when they come from an internal link as opposed to a search engine query? Can sessionToken + post-processing support this question or should the schema include referrer or referrer domain? @LZia (WMF): I hear you loud and clear about needing to justify every data element to the enwiki community, while I ask for more. ;-)
Many thanks to you both! --RyanSteinberg (talk) 04:52, 13 April 2018 (UTC)
@RyanSteinberg: Thanks for these. Bmansurov (WMF) and I discussed. We will keep extPosition and sectionID. Re referrer, Baha will check if this information is already captured as part of the event capsule by default. He will let us know. --LZia (WMF) (talk) 17:57, 17 April 2018 (UTC)
I looked into the event capsule and according to https://github.com/wikimedia/eventlogging/blob/master/eventlogging/capsule.py, referrer is not part of it. --Bmansurov (WMF) (talk) 18:12, 17 April 2018 (UTC)

@Bmansurov (WMF): I reviewed the emails, phab tasks and here. Here are the items I see we should capture (some of which you may have already in your notes):

  • We do need referrer information. I assume userAgent and IP (or hashed IP) are part of the event capsule that you can collect. If not, we need those two as well. If the IP is hashed, we need to store the information about hash+salt to be able to link the data with webrequest log table.
  • Let's make sure all information re upClick and fnClick are collected similar to extClick.
  • Hover event with 1000 ms.
  • I would skip registering the duration of the hover action for now.

My suggestion is to add these items, have the code reviewed, turn on the data collection for a few hours, we review the data, and we assess if any change is needed. --LZia (WMF) (talk) 14:06, 8 May 2018 (UTC)

Thanks, @LZia (WMF):. A follow up questions:
Will the schema be active on desktop web only? Bmansurov (WMF) (talk) 22:54, 8 May 2018 (UTC)