Abstract
Our digital world is awash with cookies, which are simple text files that keep website specific states on the web browser, such as auto-filled login fields. Although cookies are inherently harmless, third-party vendors use the tracking capability for commercial profits, e.g., cookie matching for audience targeting. This dissertation analyses browsing behaviour based on a large real user dataset collected by a browser extension developed, containing data of 2,537 users from 106 countries until August 2021 (from 10k+ installers). Then, providing solutions to inhibit third-party sharing/profiling and automated cookie protection tools.The first part studies the third-party ecosystem in different countries, revealing the impact of the type of first-party website sectors and the location of the user on the number of third parties in the wild. Results demonstrate that most users who are interested in a given site category are likely to encounter category-specific third parties, and around 65% of re-visited websites tend to offer more third parties to the same user profile. In terms of the user location, China is prone to a home-grown third-party ecosystem compared with the UK, due to China Great Firewall’s access blockade of top third parties (i.e., Google, Facebook,etc.).
To better understand the usage of cookies, I utilise the Cookiepedia database as the ground truth for a four-way classification (i.e., strictly necessary, performance, functionality and targeting/advertising cookies). The machine learning-driven framework achieves 94% F1 score and 1.5 ms latency, only 9.79% and 13.35% in the real-user dataset are identified as necessary and functional cookies. Briefly, most cookies are beneficial to the website rather than the user experience.
After the preliminary analysis on the status quo, the dissertation proposes solutions to restrict cookie-based tracking for online behaviours from two aspects. One is a management assistant for multi-account containers for the reduction in third-party interconnectivity based on common third parties in browsing histories (i.e., "tangle factor"). Evidence shows that removing top third-party vendors does better than all ad blockers in decreasing interconnectedness. And uBlock origin is the best one among ad blockers, reducing the raw number of third parties by 60% and required containers by 40%.
The other solution is the auto-processing of the GDPR minimal data option. Since May 26, 2018, the General Data Protection Regulation (GDPR) was promulgated in the EU to protect personal data without user approval. By the end of 2018, third-party cookies of UK users drop by over 10%. However, the consent fatigue and lack of an automatic consent setting mechanism resulted in the rebound of third-party cookies in 2019. Therefore, I build and deploy a browser extension to automatically assist users to protect user privacy in 85% of the websites with GDPR notices, reducing targeting/advertising cookies by 44.6%.
Concisely, this dissertation mainly addresses the collection and classification of real-time browsing data in the wild, privacy risks of the third party interconnected tunnels and the lack of an automated GDPR-enforcing mechanism. And the field deployments increase the feasibility and usability, successfully hardening the protection against user privacy while browsing and paving the way for the automated global online privacy protection.
Date of Award | 1 Jun 2023 |
---|---|
Original language | English |
Awarding Institution |
|
Supervisor | Guillermo Suarez de Tangil Rotaeche (Supervisor) & Nishanth Sastry (Supervisor) |