浏览器指纹
An exploration of the math to determine the uniqueness of a browser fingerprint.
探索数学以确定浏览器指纹的唯一性。
In my previous post, I wrote about browser fingerprinting in a general way. This is an exploration of the math involved with determining uniqueness of a fingerprint. I initially took notes while reading Panopticlick’s excellent paper, and this is the result of that. Feel free to ignore this if you don’t want a deep dive.
在上一篇文章中 ,我以一般方式写了有关浏览器指纹的文章。 这是与确定指纹唯一性有关的数学探索。 最初,我在阅读Panopticlick的出色论文时做笔记,这就是结果。 如果您不希望深潜,请随意忽略。
In a browser fingerprint, some values hold more information than others. Let’s take a look at what we care about in a fingerprint.
在浏览器指纹中,某些值比其他值包含更多的信息。 让我们来看看我们在指纹中关心的是什么。
(Information)
Each piece of a fingerprint contains a certain amount of information.
每个指纹都包含一定量的信息 。
(Does a user have Javascript?)
(T/F): 1 bit of information
(T / F):1位信息
This is a boolean value, and therefore has one bit of information. It can be zero (false), or one (true).
这是一个布尔值,因此只有一位信息。 它可以是零(假)或一(真)。
(User Agent)
A variable length string: many bits of information
可变长度的字符串:很多信息
If my Chrome user agent is:
如果我的Chrome用户代理是:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36
This has 121 characters. In ASCII, each character is a 8 bits, or a byte (an interesting aside, ascii only needs seven bits, but eight is used as a standard). Therefore, the total information for my Chrome browser is (121 * 8 =) 968 bits.
它有121个字符。 在ASCII中,每个字符都是8位或一个字节(有趣的是, ASCII只需要7位 ,但8 位是标准字符)。 因此, 我的Chrome浏览器的总信息为 (121 * 8 =)968位。
However, although there is almost a KB of information, not all of it helps to uniquely identify you. My user agent on another machine is:
但是 ,尽管几乎有KB的信息,但并不是所有信息都能帮助您唯一地识别您。 我在另一台计算机上的用户代理是:
Mozilla/5.0 (X11; CrOS armv7l 12871.91.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.127 Safari/537.36
Much of this string is the same as above. We only care about the information that makes you stand out from the crowd.
该字符串大部分与上面的相同。 我们只关心使您在人群中脱颖而出的信息。
(Surprisal)
It took me a while to understand this concept. I thought it meant “A given browser has a percent chance of not being the expected value. The surprisal is the percent chance.” Not exactly. Surprisal is a measure of uniqueness. What it actually means is “The more surprisal a browser fingerprint has, the more likely it is to be distinguished from other browsers.” It is not a percent, but instead a value that can go up to infinity.
我花了一些时间来理解这个概念。 我认为这意味着“给定的浏览器有一定百分比的机会不能达到预期的价值。 意外的是机会百分比。” 不完全是。 惊喜是衡量独特性的标准 。 它的真正含义是“浏览器指纹越多,就越有可能与其他浏览器区分开。” 它不是百分比,而是可以上升到无穷大的值。
We measure surprisal in bits. Computers measure everything with bits (though that is only partly the reason they are used here). The information collected in a given fingerprint can be measured with bits.
我们用位来度量意外。 计算机用位来度量所有内容(尽管这仅部分是此处使用它们的原因)。 给定指纹中收集的信息可以用位来测量。
Surprisal can be thought of as an amount of information about the identity of the object that is being fingerprinted, where each bit of information cuts the number of possibilities in half. — Panopticlick
令人惊讶的可以看作是有关正在被指纹识别的对象的身份的信息量,其中每一位信息将可能性的数量减少了一半。 — Panopticlick
(A bit of math)
Suppose that we have a browser fingerprinting algorithm F(·), such that when new browser installations x come into being, the outputs of F(x) upon them follow a discrete probability density function P(fn), n∈[0,1, .., N]. Recall that the “self-information” or “surprisal” of a particular output from the algorithm is given by:
假设我们有一个浏览器指纹算法F(·),这样当新的浏览器安装x出现时,F(x)在它们上的输出遵循离散概率密度函数P(fn),n∈[0,1 ,..,N]。 回想一下,算法的特定输出的“自我信息”或“惊奇”由以下公式给出:
Equation for Surprisal (also called Information Content)
意外方程式(也称为信息内容)
The surprisal function looks a bit weird (with a negative sign). However, remember that probabilities are <= 1. Therefore, the log of a probability will be negative, so surprisal is >= 0. The slope steepens as X approaches zero (from the right), increasing towards infinity as it does. If the probability of the function is P(f) = 1/2
, then the surprisal will equal 1
. If P(f) = 1/4
, I(f) = 2
, if P(f) = 1/8
, I(f) = 3
, and so on.
意外功能看起来有些奇怪(带有负号)。 但是,请记住,概率为<=1。因此,概率的对数将为负,因此意外概率> =0。当X接近零(从右侧)时,斜率会变陡,并朝无穷大增加。 如果函数的概率为P(f) = 1/2
,则意外值将等于1
。 如果P(f) = 1/4
,则I(f) = 2
,如果P(f) = 1/8
,则I(f) = 3
,依此类推。
For our purposes, we only care about the values between 0 and 1
就我们的目的而言,我们只关心0到1之间的值
The paper states that ~84% of the browsers had unique fingerprints. There were a total of 470,161 samples taken. Let’s say that each unique instance has probability P(f) ~ 1/470161
, and therefore has ~ 19 bits of surprisal.
该论文指出,约84%的浏览器具有唯一的指纹。 总共抽取了470,161个样本。 假设每个唯一实例的概率为P(f) ~ 1/470161
,因此有〜19位意外。
From the paper: most browsers are uniquely identified
从本文中可以看出:大多数浏览器都是唯一标识的
If we look at the above graph, the most common fingerprint accounted for 1186 visitors. Each of these were different people, but their fingerprint was the same. The probability of that fingerprint was P(f) ~ 1186/470161
, and it would have ~ 8.6 bits of surprisal.
如果我们看上图,最常见的指纹是1186位访客。 每个人都是不同的人,但是他们的指纹是相同的。 该指纹的概率为P(f) ~ 1186/470161
,并且会有〜8.6位意外。
Disclaimer: the probability P(f) is not exactly correct. They took > 400,000 fingerprints, but the probability of your fingerprint being unique could be much lower. Out of all people on earth, it is unlikely that every 1 in 400,000 has the same fingerprint. The probability of could also be a bit higher, since we are making a few assumptions here. The paper goes a bit more into this.
免责声明:概率P(f)并不完全正确。 他们使用了超过40万个指纹,但是您的指纹具有唯一性的可能性可能要低得多。 在地球上的所有人中,不太可能每40万人口中就有1个人拥有相同的指纹。 的可能性也可能更高,因为我们在这里进行一些假设。 本文对此进行了更多介绍。
(Some background)
The concept of surprisal, or information content, as Wikipedia calls it, comes from a branch of mathematics called Information Theory. In 1948, Claude Shannon was working on a way to identify useful signals from surrounding noise. His method was to compare a given signal to the average of all signals. If something was close to average, it was probably uninteresting. The more distinctive a signal was, the more information value it had.
维基百科所称的惊奇或信息内容的概念来自数学的一个分支,即信息理论 。 1948年,克劳德·香农(Claude Shannon)正在研究一种从周围噪声中识别有用信号的方法。 他的方法是将给定信号与所有信号的平均值进行比较。 如果某件事接近平均水平,那可能就没意思了。 信号越独特,它所具有的信息价值就越大 。
For instance, the knowledge that some particular number will not be the winning number of a lottery provides very little information, because any particular chosen number will almost certainly not win. However, knowledge that a particular number will win a lottery has high value because it communicates the outcome of a very low probability event. — Wikipedia
例如,关于某个特定号码将不会成为彩票的中奖号码的知识所提供的信息很少,因为任何特定选择的号码几乎肯定不会中奖。 但是,知道特定数字将赢得彩票具有很高的价值,因为它传达了可能性非常低的事件的结果。 — 维基百科
With 19 bits of surprisal, these unique fingerprints communicate a lot of information.
这些独特的指纹具有19比特的惊喜,可以传达很多信息。
(Surprisal of Individual Fingerprint Features)
We took a look at some examples of surprisal for an entire fingerprint, now let’s take a look at surprisal for some of the pieces that make up the fingerprint. The paper does not give all of the data here, but they do give the entropy for many of the values.
我们看了整个指纹的一些意外变化示例,现在让我们看一下构成指纹的某些零件的意外变化。 本文并未在此处提供所有数据,但它们确实提供了许多值的熵 。
Entropy is just the average of all surprisals 熵只是所有意外的平均值
Disclaimer: The paper does not supply their full data, so some data I infer based on their anonymity sets.
免责声明:本文未提供其完整数据,因此我根据其匿名集推断出一些数据。
(Does a user have Javascript?)
There are two equal options: the information is boolean. There are also two options for surprisal, but they may not be equal!
有两个相等的选项:信息为布尔值。 惊喜还有两种选择,但可能不相等!
- Users without javascript: ~11,000
- Users with javascript: ~459,000
Therefore, the surprisal of each is:
因此,每个的惊喜是:
- I(no JS): ~5.42 bits
- I(has JS): ~0.03 bits
Even though they both convey the same amount of information, the surprisal is very different. If you have disabled JS, this will make you more identifiable. We can also compare this to the entropy:
尽管他们都传达同样的信息数量,surprisal是非常不同的。 如果您禁用了JS,这将使您更容易识别。 我们还可以将其与熵进行比较:
H(whether or not a user has JS enabled) = (11/470) * 5.42 + (459/470) * 0.03 H(JS)
= 0.16 bits
Since the surprisal of no JS
is significantly higher than entropy (the average), it is very useful relative to the crowd. The opposite is not at all useful relative to the average.
由于no JS
的惊喜明显高于熵(平均值),因此相对于人群来说,它非常有用。 相对于平均值而言,相反的方法根本没有用。
The interesting thing about Javascript is that it will have an effect on later values.
关于JavaScript的有趣之处在于它将对以后的值产生影响。
(User Agent)
In Appendix A of the paper, it states that the entropy of the user agent is 10 bits (the entropy is the average surprisal). Although a user agent may have 900+ bits of information (the one from Chrome above had 968), the difference between user agents is much less. User agents that have a surprisal above 10 bits will convey more unique information about the user.
在本文的附录A中,它声明用户代理的熵为10位(熵是平均意外值)。 尽管一个用户代理可能拥有900余位信息(上面的Chrome浏览器中有968位),但用户代理之间的差异要小得多。 超出10位的意外值的用户代理将传达有关该用户的更多唯一信息。
(Plugins)
Again from the paper: the entropy of plugins is 15 bits. We do not have the exact data for each user, but we can make an inference here. If the surprisal of your plugin list is greater than 15 bits, you deviate from the norm significantly. Conversely, if your surprisal is less, you will not stand out.
再次从本文中可以看出: 插件的熵为15位。 我们没有每个用户的确切数据,但是我们可以在此处进行推断。 如果您的插件列表超出了15位,则您将大大偏离规范。 相反,如果您的意外惊喜少了,您就不会脱颖而出。
This links back to having JS enabled. Every user who disables JS will not show plugins. This means that ~11,000 users have the same value for plugins. This was a problem for showing JS when there were only two options, but now there are many more possible values for plugins
. Looking at their example plugin list, there are well over 2000 characters for just one example! My plugins looks like this: "Chrome PDF Plugin; Portable Document Format; internal-pdf-viewer; ,Chrome PDF Viewer; ; mhjfbmdgcfjbbpaeojofohoefgiehjai; ,Native Client; ; internal-nacl-plugin; "
. With so many different options for the plugins, this means that the probability of having no JS is actually relatively high. With a high probability relative to the average, the surprisal is low. If you have Javascript disabled, you will not stand out from the crowd (with respect to plugins).
这链接到启用JS。 每个禁用JS的用户都不会显示插件。 这意味着〜11,000个用户具有相同的插件价值。 当只有两个选项时,这对于显示JS是一个问题,但是现在plugins
有更多可能的值。 查看他们的示例插件列表,仅一个示例就有超过2000个字符! 我的插件如下所示: "Chrome PDF Plugin; Portable Document Format; internal-pdf-viewer; ,Chrome PDF Viewer; ; mhjfbmdgcfjbbpaeojofohoefgiehjai; ,Native Client; ; internal-nacl-plugin; "
。 插件有很多不同的选项,这意味着没有JS的可能性实际上相对较高。 相对于平均值,可能性高,意外率低。 如果禁用了Javascript,就不会在人群中脱颖而出(就插件而言)。
We can infer the surprisal, of plugins, for users with no Javascript:
对于没有Javascript的用户,我们可以推断出plugins的意外情况:
- Users without javascript: ~11,000
The surprisal, for plugins, for a user with no Javascript, will once again be ~5.42. That may seem high, but remember that the average surprisal is 15 bits. Lets imagine a user who has a unique list of plugins. They would have P(f) ~ 1/470161
, and then have a surprisal of ~ 19 bits. Let's compare the total surprisal for each user.
对于没有Javascript的用户,对于plugins来说 , surprisal将再次为〜5.42。 这可能看起来很高,但请记住,平均额外费用为15位。 假设有一个唯一的插件列表的用户。 它们将具有P(f) ~ 1/470161
,然后具有〜19位的意外值。 让我们比较每个用户的总惊喜。
- I(has JS? | no Js) = 5.42 bits
- I(plugins | no JS) = 5.42 bits
- I(both | no JS) = 10.84 bits
- I(has JS? | has JS) = 0.03 bits
- I(plugins | has JS) = 18.84 bits
- I(both | no JS) = 18.87 bits
Although the user without Javascript had a higher surprisal initially, their total surprisal will be much lower.
尽管最初没有Javascript的用户有较高的附加费,但是他们的总附加费会低得多。
If a user has Javascript enabled, and they share their plugins with 127 other people, that user will still have a higher total surprisal.
如果用户启用了Javascript,并且他们与127个其他人共享了他们的插件,则该用户的总附加费仍然较高。
Blocking JS makes you stand out for one feature, but will make you blend in with any feature that requires JS to find the values.
阻止JS使您在一种功能中脱颖而出,但会使您与需要JS查找值的任何功能融合在一起。
(In Conclusion)
Math is hard.
数学很难。
Even though I made a few estimates from the data, we can still infer some interesting things.
即使我根据数据进行了一些估算,我们仍然可以推断出一些有趣的事情。
Sending more information does not necessarily mean you stand out.
发送更多信息并不一定意味着您脱颖而出。
Values in a fingerprint are not independent.
指纹中的值不是独立的。
Think about how you compare with the crowd. There are multiple crowds you can should consider.
想想你如何与人群比较。 您可以考虑多个人群。
(Iffy Statistics)
There are statisticians (and other amateurs) who are smarter than me, that may get angry at me. A few liberties were taken, and I wrote disclaimers about them. I think that my assumptions are within reason, and the conclusions I come to are also reasonable.
有些统计学家(和其他业余爱好者)比我聪明,可能会生我的气。 采取了一些自由措施,我对此发表了免责声明。 我认为我的假设是合理的,我得出的结论也是合理的。
However, I may not have caught all of the assumptions. Feel free to let me know if there is anything I missed. You can open an issue on the Github for my blog.
但是,我可能尚未掌握所有假设。 如有任何遗漏,请随时告诉我。 您可以在Github上为我的博客打开一个问题。
(Further Reading & Sources)
- How Unique is Your Browser? — a whitepaper by the EFF
您的浏览器有多独特? — EFF的白皮书 - Panopticlick — EFF will show your fingerprint to you
Panopticlick -EFF将向您显示指纹 - Entropy & Information Content on Wikipedia 维基百科上的熵和信息内容
Originally published at https://github.com.
最初发布在 https://github.com 。
浏览器指纹