Dataset for 'Ad-blocking: A Study on Performance, Privacy and Counter-measures'

Summary:

Many internet ventures rely on advertising for their revenue. However, users feel discontent by the presence of ads on the websites they visit, as the data-size of ads is often comparable to that of the actual content. This has an impact not only on the loading time of webpages, but also on the internet bill of the user in some cases. In absence of a mutually-agreed procedure for opting out of advertisements, many users resort to ad-blocking browser-extensions.

In this work, we study the performance of popular ad-blockers on a large set of news websites. Moreover, we investigate the benefits of ad-blockers on user privacy as well as the mechanisms used by websites to counter them. Finally, we explore the traffic overhead due to the ad-blockers themselves.

Dataset:

We collected two datasets for the project.

Top150 This is a list of the top 150 news websites, as ranked by Alexa. For each of them, we have a URL that points to the website homepage, for both its desktop and mobile version (unless the latter does not exist).

GDELT A list of 30,000 URLs pointing to di erent news articles published on a single day (November 8, 2016). The list was obtained from GDELT, a project that collects news stories from all around the world over the years.

We load the webpage of each URL with a clean instance of Chrome browser (no cached content or extensions), using the Selenium Python library on a Macbook pro with 8 cores, with no other major processes running. On each load, we capture the HAR file for the load. The HAR (HTTP Archive File) file is a JSON-formatted file that captures the interactions between the browser and the website, including network requests, types and size of objects, and load times.

We load the same URL in six browser modes, all simultaneously: a vanilla mode (no ad-blocker), and one mode for each of five adblockers. The ad-blockers are AdBlock, AdblockPlus, Ghostery, uBlock and Privacy Badger - chosen as the most popular ad- blockers on the Chrome Store.

Description of the data: The data contains json HAR files for each domain/url. Depending on the dataset the size varies from a 40M (Top150) to 8G (GDELT,per setting). A sample HAR file for nytimes.com can be found here (or on the Github link below). We have 10 settings - desktop vs. mobile, 5 adblockers.

For access to the full dataset, please email Kiran, at kiran.garimella XatX aalto.fi

Code:

The code used for data collection is on Github.

Contact:

For any questions/comments, please contact Kiran Garimella, Orestis Kostakis or Michael Mathioudakis. (first.last@aalto.fi).

Ad-blocking: A Study on Performance, Privacy and Counter-measures (WebSci'17 short)

Summary:

Dataset:

Code:

Contact: