How are users tracked in Web: fingerprinting your browser

This article discusses issues that affect absolutely all people who are on the Internet. It does not matter whether you visit the worldwide network from a computer or smartphone, it does not depend on the operating system: Windows or any option of Linux, mobile Android or iOS.

I have long been unpleasantly surprised by the intrusive Google AdSense ads based on my old search terms. It would seem that after the expressed interest, a lot of time has already passed, cookies and cache have been cleared in the browser more than once. But advertising still appears. How do they manage to get information about me anyway? As it turned out, there are many ways to do this.

A bit of history and general concepts

Web tracking, or simply put, identifying and tracking users, is not a very difficult activity. For this, a special unique identifier for each browser is used, which is set on the client side when you’re visiting the site.

This tool was conceived for quite decent and useful purposes. For example, this method helps to distinguish live people from bots that have visited the site, or determine the preferences of a person, save and apply them on subsequent visits. But, everything has two sides. And web tracking has become a handy tool in the advertising industry, and it started back in the mid-90s.

A lot has changed since that time, the development of technology makes it possible to track users not only by cookies, but also using other tools. The easiest way to track a user is to use some kind of identifier like the same "cookies". You can also take advantage of the browser information they pass when sending requests. There, the HTTP header contains: the address, the operating system installed on the device, the current time of the user, etc. In addition, users can be distinguished from each other by their habits, for example, by frequently visited sections of the site or by the peculiarities of cursor movement.

Explicit identifiers

So, in order to understand that a visitor has already been to a certain resource, you need to be on his or her side, it means, on a computer, phone or other device, set some kind of long-lived label, and also be able to read it in the event of a repeated visit to the site. In modern browsers, there are many options for doing this, and it is quite transparent to users.

First, these are the "cookies" already mentioned above. Secondly, there are certain programs that are similar to cookies in functionality. This can be LSO (Local Shared Objects) used by Adobe Flash, or its Silverlight counterpart. There are similar capabilities in HTML5, including those implemented through the localStorage, File and IndexedDB API tools.

There are other places for storing tokens - these can be cached data on the local device or various types of also cached metadata (Last-Modified, ETag). You can also use Origin Bound certification fingerprints that browsers generate on SSL connections, based on information from SDCH dictionaries, their content, or metadata, to identify users. In general, there are many options.

Cookies

Cookies is the first tool that was created specifically in order to store small amounts of data on the client side. That is why it is the first thing you remember about when you think about identifiers.

When the user first contacts the server, he or she generates a unique label, which is more correctly called an identifier. At the first visit, the server generates it and transfers it to the client along with other information to be saved in cookies. On each repeated call, the client will immediately send this identifier to the server.

And while most browsers today are equipped with user-friendly options for cleaning and configuring cookies, and you can find a huge number of special utilities for cleaning or blocking cookies on the Internet, this tracking method has not lost its popularity. The reason is that most users rarely remember and clean cookies. Well, think for yourself, when was the last time you personally did this? One of the most common reasons is the fear of accidentally deleting the desired cookie, because the cookie often stores information for quick authorization.

Some modern browsers allow restricting the installation of third-party cookies, but this add-on does not solve the problem. Browsers often confuse third-party and "main" cookies, for example, when receiving information via HTTP redirects and other similar methods of accessing a page.

Using cookies, unlike many other methods, is one of the most transparent methods from the user's point of view. But in order to "recognize" a visitor, there is no need to store a unique tag in a separate file. The identifier can be collected from several cookies, stored in other data types, for example, in Expiration Time, etc. Therefore, it can be extremely difficult to figure out whether a particular cookie is used for tracking or not.

Local Shared Objects (LSO)

This mechanism is used by Adobe Flash to store information on the client machine. LSO is considered an analogue of "cookies" in HTTP, but allows you to save large amounts of data with fewer restrictions, so it is extremely difficult to analyze and check for foreign labels such objects.

Previously, before the advent of Adobe Flash 10.3, you had to configure LSO behavior separately from the browser; you had to use the Flash settings manager at macromedia.com. Later versions allow you to configure Flash cookies in your browser control panel. In addition, most modern browsers are quite deeply integrated with the flash player, so when you delete cookies and other temporary files, LSOs are also deleted. However, the integration is not so deep, therefore, setting the policy for working with third-party cookies in the browser does not always affect Local Shared Objects. Visit the Adobe developer site for information on how LSO is manually disabled.

Data in Silverlight

Silverlight platform is in many ways similar in functionality and features of working with Adobe Flash. It also has its own analogue of flash cookies - Isolated Storage. But the privacy settings for Silverlight are very different. Firstly, this storage does not integrate in principle with browsers, therefore any clearing of the cache and other data in the browser will not affect Isolated Storage in any way. Second, the tool uses the same storage for all browser tabs and windows. The only exceptions are the tabs that you open in "Incognito" mode, as well as for all profiles that are installed on the user's device.

Technically, Isolated Storage is just as easy to store ID labels as LSOs. But it is not yet possible to get to it using the user's browser settings, so it is not often used as a user identification tool yet.

HTML5: Tools for Local Storage

HTML5 is a whole set of tools that specialize in storing structured information of one type or another on the client's device. These include localStorage, File API, and IndexedDB. Each of these mechanisms has its own characteristics, but they all provide the ability to permanently store structured binary data without limiting their size, while binding them to a particular resource. It is important that unlike regular cookies or flash cookies, HTML5 has no limits on the size of this data. HTML5 repositories are located in the same place as other information from sites. But in the browser interface, managing them is difficult, since it can be difficult for an ordinary user to guess where the necessary sections are located.

For example, to delete information stored in localStorage, select "offline website data" or "site preferences" in Firefox browser and then set the time interval to "everything". For IE, there is a special feature that works in HTML5. Here data is stored only during the life of the tabs that were opened at the time of their saving.

Also, HTML5 doesn't really strive to comply with cookie-friendly rules and regulations. For example, in localStorage, the server will be able to write information and read it through cross-domain frames, and this method is available even in the mode of completely disabling third-party cookies.

Object Caching

Everyone likes it when the browser does not "slow down", and sites are loaded quickly. Therefore, a local cache is created for frequently visited resources. It was created in order to temporarily store information from the pages of sites, which does not have to be re-requested in the event of a repeated request for the page. In principle, this tool is not intended to identify users, but there are many ways to use it for this purpose. For example, after a request to the site, the client receives a JavaScript document, inside which the identifier is stored, and a very distant date is indicated in the Expires / max-age = headers. As a result, a script with a label inside is written to the cache. Now you can access it from any resources on the Internet by simply sending a request.

It goes without saying that the browser will from time to time use requests in the If-Modified-Since header to check for any changes on the site. But if every time the server returns an answer that there are no changes through the special code 304 (Not modified), the file with the tag can be stored forever.

What else interesting can you say about the cache? There are no so-called "third-party" and "main" objects in it, as it works for ordinary "cookies". And if you completely abandon caching, browser performance can be noticeably reduced. At the same time, it is extremely difficult to automatically determine which resource is "cunning" and sends identifiers inside documents, since the number of JavaScript documents on the Internet is huge, and their complexity can be very different.

Of course, modern browsers allow you to completely clear the cache at any time. But in practice, we all do this quite rarely, if we perform this action at all.

ETagand Last-Modified

Sites are regularly updated, so for caching to work properly, you need a tool that allows the server to inform the browser about changes on the page. In the HTTP / 1.1 standard, there are two options for exchanging information about updates. In the first case, the date of changes in the file is analyzed, in the second, a special identifier called ETag is used.

When working with ETag, the server, in response to a client request, generates a document with a version tag in the header. On repeat visits to the site, the client includes the value of this header in the If-None-Match variable in the request, and thus reports the local version. If it is up to date, the server responds with a 304 code and the existing cached data is used when loading the page. If not, a new version of the document is sent to the client along with a new ETag. This way of working is somewhat similar to the use of HTTP cookies, when the server stores some information on local machines for the sole purpose of reading it on repeat visits.

The variant of working with Last-Modified assumes the ability to save the date of the last changes, for which a volume of 32 bits or more is allocated. When a page on the site is accessed, this information is sent to the server in a header called If-Modified-Since. An interesting point here is that most browsers do not check the correctness of the date format in any way. It is important to understand that ETag and Last-Modified objects are not deleted along with cookies or site data, only clearing the cache helps here.

HTML5 AppCache

The Application Cache tool indicates which fragments of a site page are saved on the user's disk and will be available, including offline. Storage and access rules are specified using special manifests. AppCache, like regular caching, stores unique data both inside the manifest itself and as part of resources that do not have storage time limits, unlike the usual cache.

We can say that AppCache is an intermediate solution between the traditional cache and the capabilities of HTML 5. In some browsers, it is possible to delete this information along with cookies, for others, you need to clear your browsing history and cache.

SDCH dictionaries

SDCH is a special data compression algorithm that was developed by Google. It uses a specially compiled dictionary provided by the server, as a result the client receives a higher compression than using deflate or Gzip.

The method is based on the fact that sites massively transmit large amounts of repetitive information to the client. These can be, for example, page feet and "headers", JS and CSS values, other elements that remain unchanged when moving from one section to another. To save resources, the server sends a special SDCH dictionary once when accessing the site, which contains duplicate values. Further, when accessing the page, the server sends links to the dictionary elements to the browser, and the client itself collects the page from them.

It is clear that you can add any information to such a dictionary, including unique identifiers. They are placed in ID dictionaries. If the user visits the site again, the server requests the corresponding information from the dictionary and receives it in the Avail-Dictionary header. You can specify a label in the content itself, and then use it, as when working with a regular cache.

Other storage options

There are other methods of leaving a "tag" for later displaying ads. For example, JavaScript and its analogues allow you to save identifiers in such a way that clearing the history and site data will not harm it in any way.

For example, such variables are successfully stored in sessionStorage or window.name. The most thorough browser cleaning will not always help here. If, during operations for deleting cookies and other stored information, the user left the tab with the site open, a second visit will lead to the token being sent to the server, and the user will be "tied" to the data received about him during the previous visit. JS works the same way. Basically, any open JavaScript element retains its state even if cookies and site data are removed. Moreover, the script may not just be on a specific site, but even hide in frames, web works and other similar elements. For example, if an additional iframe ad is loaded when visiting a page, deleting site data or history will not interfere with it at all. The JS script will keep the unique label for future use.

Protocols

In addition to the options described above for storing unique tags using caching, JavaScript or various plugins, modern browsers are equipped with other features that help them work faster, but at the same time allow you to save and request identifiers.

Origin Bound Certificates (aka ChannelID) is a feature that significantly increases the security of working with HTTPS. It stores special self-signed persistent documents called certificates, which allow the client to be identified during a secure connection. In this case, for each domain, its own certificate is generated and saved, designed to improve the security of work. But OVS websites are sometimes used for tracking, moreover, they do not take any additional actions that would be noticeable to the user. One example is the presence of a cryptographic cache. It is often used for an SSL handshake, but it may well act as a unique user label.

TLS also has similar mechanisms - session tickets and session identifiers. These tools are used in case the session was abruptly interrupted during the HTTPS connection. To prevent the user from going through the full handshake procedure again, cached data mechanisms are used. They allow you to determine the authenticity of the client's request if the interval between requests is small.

Another interesting tool is the browser's DNS cache. It stores the small information needed to improve the speed of validation and resolution of site names on repeated calls while protecting the system from DNS rebinding attacks. For example, 16 IP addresses are available. In this case, already 8-9 names, including encrypted ones, will allow identification of any device. But such a solution is significantly limited by the internal volume of the DNS cache, therefore there is a risk of conflicts between name resolution and the DNS provider.

Device parameters

All the previously described options were based on the installation of some unique identifier on the user's device, which the server can receive upon repeated requests. But there is another option for tracking users, less obvious. It is based on query or measurements of client device parameters. Each of the characteristics separately is nothing more than a few bits of data, but if you get several at the same time, you can track any computer or phone. In addition to the fact that the study of device parameters is much more difficult to track down and prohibit, this method allows you to identify a user who goes to the Internet from different browsers or uses the "incognito" mode.

Browser fingerprinting

One of the simplest tracking methods is to create a unique tag by combining parameters that are available in the browser. Each of them separately is not very interesting for analysis. But their combination helps to "recognize" and identify the user's browser:

  1. User-Agent. In this parameter, the server receives the version of the operating system, browser, and also some custom add-ons. If the User-Agent cannot be obtained or there are doubts about the correctness of the data, the browser version is determined by checking some features that have been changed or first appeared in a certain version.
  2. Time on the clock. If the system time is not synchronized with one of the standard time servers on the Internet, then the difference between the reference and the system clock gradually increases. Specialized JavaScript can measure this discrepancy with microsecond precision. But even in the case of synchronization, a small difference appears, it is invisible to a person, but the script will determine these microseconds.
  3. Data about CPU and GPU. The server can request from directly using GL_RENDERER or through tests and benchmarks that are implemented in JavaScript.
  4. Size of the monitor or monitors (if the system is multi-monitor), as well as a similar parameter for the browser window.
  5. Custom fonts installed in the system. You can get this list using getComputedStyleAPI.
  6. Availability and list of ActiveX controls, different plugins, Browser Helper objects, taking into account their versions. Obtained using navigator.plugins [] function. Also, for this purpose, they study HTTP headers, because many of them register themselves in this parameter.
  7. Information about installed software and extensions. For example, ad blockers noticeably change the content of the site pages for the server, based on the characteristics of these changes, you can determine which extension was used with which settings.

Network fingerprinting

Tracking unique users and parameters related to the architecture of the local network and features of network protocols.

This method will allow the user to be tracked regardless of the browsers they are using. They cannot be hidden using incognito modes, other private browsing options, or various security applications.

The list of this information includes:

  1. External IP address. This parameter is especially interesting when using IPv6 addresses. In many cases, the address is formed using the data of the MAC address, the last digits of the add-ons are taken from it, and they do not change when changing the provider. The system almost always numbers the ports for outgoing TCP / IP sequentially.
  2. Local IP-address, which is assigned to users when working with NAT or using an HTTP proxy. This parameter, together with external IP, almost always allows accurate user identification.
  3. Information from the X-Forwarded-For HTTP headers. Here you can get information about the proxy servers used by the system. This information, together with the real IP address, allows you to identify the user with an almost 100% guarantee. The IP itself is calculated by one of the proxy bypass methods.

Behavior analysis

Modern technologies make it possible to study not only identifier tags and equipment features, but also the behavioral factors of the person him\herself. These can be region settings and behavior features. Thus, it turns out to identify the user, even if he changes browsers or uses the "incognito" mode.

The following data are used to identify behavioral factors:

  1. Default language, system time zone, main encoding. All of this information is in the HTTP headers and is easy to obtain using a JS script.
  2. Views and cache. Query scripts examine the cache over time, figuring out the longest lived items. This is how frequently visited sites are determined. The easiest way to find out if a site is in the cache is to estimate the page load speed. This compares the expected time on the first visit with the time when information is taken from the cache. It is also possible to get a list of addresses from the browser history, although this will require some help from the user himself in modern browsers.
  3. Working with the keyboard (frequency and duration of pressing), features of mouse gestures, information from the accelerometer. Every person here is unique.

Any options for changing fonts and scaling the site, measuring the zoom level, using special features also help to determine the user. We study browser features that the user configures: blocking cookies, pop-ups and ads, DNS prefetching, flash security settings, etc. Paradoxically, if you diligently customize a convenient and secure work option for yourself, you, thereby, give a lot of information for your identification.

And this is only part of such information, the most popular and obvious. If you look deeper, there may be many more.

Conclusions

As you can see, in fact, there are many options for tracking users. Some are the consequences of release bugs or omissions, and theoretically can be fixed. Others are impossible to "close", unless completely changing the way computers, networks, browsers and software work. Some methods can be combated, for example, timely cleaning the browser and the system. Others you, with all your desire, will not notice, and it is almost impossible to fight them. Therefore, the most important rule of visiting the Internet: even if you use the best solutions for anonymous and secure browsing, remember that it is still possible to track you.

What are you waiting for?