Blog

 4 minute read.

HTTPS and Wikipedia

This is a joint blog post with Dario Taraborelli, coming from WikiCite 2016.

In 2014 we were taking our first steps along the path that would lead us to Crossref Event Data. At this time I started looking into the DOI resolution logs to see if we could get any interesting information out of them. This project, which became Chronograph, showed which domains were driving traffic to Crossref DOIs.

You can read about the latest results from this analysis in the “Where do DOI Clicks Come From” blog post.

Having this data tells us, amongst other things:

  • where people are using DOIs in unexpected places
  • where people are using DOIs in unexpected ways
  • where we knew people were using DOIs but the links are more popular than we realised

By the time the ALM Workshop 2014 rolled around there was some preliminary data and we realised that Wikipedia came into the third category. There are lots of DOIs in Wikipedia and people click them!

I met with Dario Taraborelli, head of research at the Wikimedia Foundation, and shared the data. Dario — who co-authored in 2010 the Altmetrics Manifesto — has been interested in understanding how scholarly citations are used in Wikipedia. Over the years, Wikipedia contributors have made extensive use of references to the scientific literature using DOIs, and by doing so they have created a resource that represents today in many ways the “front matter to all research”. There is growing interest in the community in understanding how DOIs are being used in Wikipedia and in non traditional scholarship.

During our discussions the subject of Wikipedia’s gradual transition to HTTPS was raised: we anticipated that this change would affect our data gathering.

Changes

When you’re reading webpage and click on a link to another page, your web browser will usually tell the server of that second page the last page you were on. This forms the basis of trackers like Google Analytics.

In the days before HTTPS, the next site would know the full URL that you were previously on. With the change to HTTPS, this was reduced to just sending the domain name and not the full URL, or no data at all if you click from an HTTPS page to HTTP.

DOI hyperlinks are just like any other hyperlink, and are mostly HTTP not HTTPS.

Up until 2015, Wikipedia was served over HTTP, only switching to HTTPS when users were logged in or if they requested it. The Wikimedia Foundation started planning to move to HTTPS and we knew that if they did that, and continued to use HTTP DOIs then we would lose valuable research data.

A Plan

We decided that the best course of action was to try and change the DOIs in Wikipedia to use HTTPS. Simple, right?

After some further research, Dario posted a proposal on how to mitigate the impact of the HTTPS rollout, to make sure that Wikipedia can still signal its importance as a traffic source, while preserving the privacy of its users. Discussion followed and the conclusion was to change the format of every single DOI on Wikipedia, which fortunately could be done without having to edit millions of pages. You can read the full story in this post from a year ago.

The result of this effort was that well in advance of the HTTPS switchover, the DOI links were ready to continue reporting referral data.

The Switch

In June 2015 the Wikimedia foundation made the announcement that they were finalising the switch, and that within a few weeks all traffic would be HTTPS.

We held our breath. Would it work? Would we lose all referral data from Wikipedia sites? In February 2016 the last piece of the puzzle fell into place as Wikipedia gained a ‘meta referrer’ tag to explicitly specify how they would like referrers to be sent: a detailed report on the effect of this change is coming up on the Wikimedia Foundation’s blog.

The results

As detailed in the last blog post the traffic that we measured coming from Wikipedia doesn’t seem to have slowed down during 2015:

month-top-10-filtered-domains

I’d call that a success! Over the period covered in the graph, Wikipedia remained prominent as a non-publisher referral of traffic to DOIs.

Looking at the balance of HTTP vs HTTPS traffic coming from wikipedia.org, the switchover was dramatic:

day-code-area

Thank you to Dario Taraborelli, Nemo (Federico Leva), Aaron Halfaker, Alex Stinson and everyone who put in this effort.

I’ll leave the last word to Dario:

It’s great to see this data. It shows that the switchover happened successfully, which better protects the privacy of our users whilst still reporting the fact that Wikipedia is a prominent source of traffic. This is important validation of the increasing role that Wikipedia plays in the education and scientific community.

Related pages and blog posts

Page owner: Joe Wass   |   Last updated 2016-May-31