方法1
统计逻辑:统计一个loginname的所有去重的通讯录数C,统计这个Loginname对应的每个设备对应的通讯录c1,c2,c3…cn;
X=(c1/c+c2/c+c3/c+….cn/c)/n=(c1+c2+…cn)/(c*n)
数据表现如下:
将3个以上设备计算出来的结果关联上黑名单库,在8092个loginname中,有3915个,
占48.4%。将是否黑名单当作Y值,X值为自变量进行回归:
Y与X之间的相关性为0.1229,显著性正相关。
回归统计 | ||||||||
Multiple R | 0.12294864 | |||||||
R Square | 0.01511637 | |||||||
Adjusted R Square | 0.01499461 | |||||||
标准误差 | 0.49600959 | |||||||
观测值 | 8091 | |||||||
方差分析 | ||||||||
df | SS | MS | F | Significance F | ||||
回归分析 | 1 | 30.54481 | 30.54481 | 124.153 | 1.26E-28 | |||
残差 | 8089 | 1990.1 | 0.246026 | |||||
总计 | 8090 | 2020.645 | ||||||
Coefficients | 标准误差 | t Stat | P-value | Lower 95% | Upper 95% | 下限 95.0% | 上限 95.0% | |
Intercept | 0.65722473 | 0.016506 | 39.81648 | 0 | 0.624868 | 0.689581 | 0.624868 | 0.689581 |
X Variable 1 | -0.2788014 | 0.025022 | -11.1424 | 1.26E-28 | -0.32785 | -0.22975 | -0.32785 | -0.22975 |
方法2
统计逻辑:计算出每2个设备之间通讯录的相同率,如Xab,Xac,Xad,Xcd…XnXn-1,关联上黑名单库,在当时7383的大于等于3个设备的loginname中,按照设备个数的不同分组,然后进行回归,
3个设备的loginname有6512个,其中黑名单3294个,占比50.6%,X与Y的关系程正弱相关,前2个之间的相同率(自变量)对Y的影响最大。
Multiple R | 0.119978 | |||||||
R Square | 0.014395 | |||||||
Adjusted R Square | 0.01394 | |||||||
标准误差 | 0.496507 | |||||||
观测值 | 6512 | |||||||
方差分析 | ||||||||
df | SS | MS | F | Significance F | ||||
回归分析 | 3 | 23.43156 | 7.810518 | 31.68321 | 2.52E-20 | |||
残差 | 6508 | 1604.347 | 0.246519 | |||||
总计 | 6511 | 1627.778 | ||||||
Coefficients | 标准误差 | t Stat | P-value | Lower 95% | Upper 95% | 下限 95.0% | 上限 95.0% | |
Intercept | 0.40805 | 0.011882 | 34.34126 | 1.1E-237 | 0.384757 | 0.431344 | 0.384757 | 0.431344 |
X Variable 1 | 0.115797 | 0.039147 | 2.958019 | 0.003107 | 0.039056 | 0.192537 | 0.039056 | 0.192537 |
X Variable 2 | 0.163985 | 0.039448 | 4.156987 | 3.27E-05 | 0.086654 | 0.241316 | 0.086654 | 0.241316 |
X Variable 3 | 0.087144 | 0.038996 | 2.234676 | 0.025472 | 0.010699 | 0.16359 | 0.010699 | 0.16359 |
4个设备的组,统计回归相对高0.05个百分点的相关性,但是相关性依然较弱,第一个和第6个变量起到决定性影响,5个以上的设备存在高度正相关,但是因为数据量太少,
不认为具有统计学代表意义。
回归统计 |
|
| ||||||||
Multiple R | 0.171505 |
|
| |||||||
R Square | 0.029414 |
|
| |||||||
Adjusted R Square | 0.020577 |
|
| |||||||
标准误差 | 0.493803 |
|
| |||||||
观测值 | 666 |
|
| |||||||
|
| |||||||||
方差分析 |
|
| ||||||||
df | SS | MS |
| F | Significance F |
| ||||
回归分析 | 6 | 4.869808 | 0.811635 |
| 3.32853 | 0.003088 |
| |||
残差 | 659 | 160.6918 | 0.243842 |
|
| |||||
总计 | 665 | 165.5616 |
|
| ||||||
|
| |||||||||
Coefficients | 标准误差 | t Stat |
| P-value | Lower 95% | Upper 95% |
| 下限 95.0% | 上限 95.0% | |
Intercept | 0.355842 | 0.071268 | 4.993029 |
| 7.62E-07 | 0.215903 | 0.495781 |
| 0.215903 | 0.495781 |
X Variable 1 | 0.209455 | 0.131209 | 1.596339 |
| 0.110893 | -0.04818 | 0.467094 |
| -0.04818 | 0.467094 |
X Variable 2 | 0.089427 | 0.136561 | 0.654849 |
| 0.512793 | -0.17872 | 0.357575 |
| -0.17872 | 0.357575 |
X Variable 3 | 0.09917 | 0.127629 | 0.777021 |
| 0.437425 | -0.15144 | 0.349778 |
| -0.15144 | 0.349778 |
X Variable 4 | -0.10775 | 0.144002 | -0.74823 |
| 0.454591 | -0.3905 | 0.175012 |
| -0.3905 | 0.175012 |
X Variable 5 | -0.14237 | 0.121381 | -1.17293 |
| 0.241249 | -0.38071 | 0.095969 |
| -0.38071 | 0.095969 |
方法3
没能获取用户更换设备的时间,分析业务逻辑,可以通过手机号登陆APP时候对应的版本号来间接代表更换设备的时间。成为自变量的X有:设备数,更换设备的次数,每2个设备更换的最大时间差,最小时间差,平均更换时间,更换设备时间差的标准差,累计更换时间,进行回归得到如下结果:设备累计更换时间和更换时间差的标准差成显著反比。
回归统计 | | |
| | | | | | |
Multiple R | 0.154438366 | | |
| | | | | |
R Square | 0.023851209 | | |
| | | | | |
Adjusted R Square | 0.023259245 | | |
| | | | | |
标准误差 | 0.492116372 | | |
| | | | | |
观测值 | 9901 | | |
| | | | | |
| | | |
| | | | | |
方差分析 | | | |
| | | | | |
df | SS | MS |
| F | Significance F | | | | |
回归分析 | 6 | 58.54663 | 9.757771812 |
| 40.29164794 | 9.68003E-49 | | | |
残差 | 9894 | 2396.114 | 0.242178524 |
| | | | | |
总计 | 9900 | 2454.661 |
| | | | |||
| | | |
| | | | | |
Coefficients | 标准误差 | t Stat |
| P-value | Lower 95% | Upper 95% | 下限 95.0% | 上限 95.0% | |
Intercept | 0.448879765 | 0.032716 | 13.72029481 |
| 1.87579E-42 | 0.384748795 | 0.513010734 | 0.384749 | 0.513011 |
X Variable 1 | 0.054091353 | 0.010083 | 5.364858051 |
| 8.28418E-08 | 0.034327536 | 0.073855171 | 0.034328 | 0.073855 |
X Variable 2 | -0.014132904 | 0.001316 | -10.74041732 |
| 9.23172E-27 | -0.016712261 | -0.011553547 | -0.01671 | -0.01155 |
X Variable 3 | 0.000177389 | 3.03E-05 | 5.85593915 |
| 4.8943E-09 | 0.00011801 | 0.000236767 | 0.000118 | 0.000237 |
X Variable 4 | 1.01037E-08 | 1.25E-09 | 8.067239472 |
| 8.02511E-16 | 7.64868E-09 | 1.25587E-08 | 7.65E-09 | 1.26E-08 |
X Variable 5 | -7.0701E-09 | 1.23E-09 | -5.741173492 |
| 9.67925E-09 | -9.48403E-09 | -4.65616E-09 | -9.5E-09 | -4.7E-09 |
X Variable 6 | -3.40952E-08 | 3.46E-09 | -9.855884854 |
| 8.23224E-23 | -4.08763E-08 | -2.73141E-08 | -4.1E-08 | -2.7E-08 |
方法4
回归统计 | ||||||||
Multiple R | 0.211204 | |||||||
R Square | 0.044607 | |||||||
Adjusted R Square | 0.043425 | |||||||
标准误差 | 0.488799 | |||||||
观测值 | 8091 | |||||||
方差分析 | ||||||||
df | SS | MS | F | Significance F | ||||
回归分析 | 10 | 90.13552 | 9.013552 | 37.72553 | 3.87E-73 | |||
残差 | 8080 | 1930.51 | 0.238924 | |||||
总计 | 8090 | 2020.645 | ||||||
Coefficients | 标准误差 | t Stat | P-value | Lower 95% | Upper 95% | 下限 95.0% | 上限 95.0% | |
Intercept | 0.824792 | 0.042672 | 19.32859 | 2.08E-81 | 0.741144 | 0.90844 | 0.741144 | 0.90844039 |
X Variable 1 | 0.000604 | 0.01188 | 0.050819 | 0.959471 | -0.02269 | 0.023893 | -0.02269 | 0.02389257 |
X Variable 2 | -0.01201 | 0.001585 | -7.57658 | 3.94E-14 | -0.01512 | -0.0089 | -0.01512 | -0.0089046 |
X Variable 3 | 0.000106 | 3.51E-05 | 3.01029 | 0.002618 | 3.68E-05 | 0.000174 | 3.68E-05 | 0.0001743 |
X Variable 4 | 1.11E-08 | 2.04E-09 | 5.45342 | 5.09E-08 | 7.12E-09 | 1.51E-08 | 7.12E-09 | 1.5115E-08 |
X Variable 5 | -7.5E-09 | 1.78E-09 | -4.239 | 2.27E-05 | -1.1E-08 | -4E-09 | -1.1E-08 | -4.05E-09 |
X Variable 6 | 4.7E-10 | 2.6E-09 | 0.181014 | 0.856361 | -4.6E-09 | 5.56E-09 | -4.6E-09 | 5.5622E-09 |
X Variable 7 | -2.2E-08 | 4.59E-09 | -4.85498 | 1.23E-06 | -3.1E-08 | -1.3E-08 | -3.1E-08 | -1.329E-08 |
X Variable 8 | 2.45E-11 | 1.14E-11 | 2.149089 | 0.031657 | 2.15E-12 | 4.68E-11 | 2.15E-12 | 4.68E-11 |
X Variable 9 | -0.00266 | 0.000444 | -5.99087 | 2.18E-09 | -0.00353 | -0.00179 | -0.00353 | -0.0017881 |
X Variable 10 | -0.24645 | 0.02494 | -9.88193 | 6.72E-23 | -0.29534 | -0.19757 | -0.29534 | -0.1975655 |
单个X对Y都有一定影响,现在将设备更换时间的频率、更换频率标准偏差、每个设备的平均更换周期、设备相同率几个维度组合成Xn变量回归,结果显示X对Y的总体复相关性已经有较大幅度增加,达到0.211204。
虽然这些变量对判定是否黑名单的决定性怡然较小,需要挖掘更多的用户特征进行丰富。
dc | dcs | abrankcount | maxtime | mintime | avgtime | stddevt | totaltime | interval | pc |
3 | 3 | 3 | 3237421 | -4064 | 2155571.333 | 1527094 | 6466714 | 13 | 0.457187 |
3 | 3 | 3 | 20561466 | 9930757 | 13707644 | 4854801 | 41122932 | 80 | 0.57037 |
3 | 3 | 3 | 18509989 | -19642609 | -755080 | 15578021 | 2265240 | 76 | 0.343496 |
3 | 3 | 3 | -14023 | -1893173 | -1262115.333 | 882553.1 | 3786346 | 8 | 0.335878 |
方法5
回归统计 | ||||||||
Multiple R | 0.506696 | |||||||
R Square | 0.256741 | |||||||
Adjusted R Square | 0.256666 | |||||||
标准误差 | 0.429309 | |||||||
观测值 | 9901 | |||||||
方差分析 | ||||||||
df | SS | MS | F | Significance F | ||||
回归分析 | 1 | 630.2114 | 630.2114 | 3419.367 | 0 | |||
残差 | 9899 | 1824.45 | 0.184306 | |||||
总计 | 9900 | 2454.661 | ||||||
Coefficients | 标准误差 | t Stat | P-value | Lower 95% | Upper 95% | 下限 95.0% | 上限 95.0% | |
Intercept | 0.760723 | 0.006787 | 112.093 | 0 | 0.74742 | 0.774026 | 0.74742 | 0.774026 |
X Variable 1 | -0.19139 | 0.003273 | -58.4754 | 0 | -0.1978 | -0.18497 | -0.1978 | -0.18497 |
求出每个手机更换的设备中是IOS的数量,回归出来的结果显示X单变量对Y的影响是显著正相关,达到0.506696
方法6
将以上所有变量加入,形成的特征如下:
ifblack | dc | dcs | abrankcount | maxtime | mintime | avgtime | stddevt | totaltime | interval | pc | deiosnum |
1 | 3 | 3 | 3 | 3237421 | -4064 | 2155571.333 | 1527094 | 6466714 | 13 | 0.457 | 2 |
0 | 3 | 3 | 3 | 20561466 | 9930757 | 13707644 | 4854801 | 41122932 | 80 | 0.57 | 2 |
1 | 3 | 3 | 3 | 18509989 | -19642609 | -755080 | 15578021 | 2265240 | 76 | 0.343 | 0 |
0 | 3 | 3 | 3 | -14023 | -1893173 | -1262115.333 | 882553.1 | 3786346 | 8 | 0.336 | 1 |
1 | 3 | 3 | 3 | -91956 | -724724 | -483149.3333 | 279151.3 | 1449448 | 3 | 0.944 | 1 |
1 | 3 | 3 | 3 | 412948 | -845102 | -288102.6667 | 523600.1 | 864308 | 4 | 0.344 | 0 |
0 | 3 | 3 | 3 | 5092634 | 1290609 | 3395089.333 | 1578617 | 10185268 | 20 | 0.335 | 1 |
回归分析相关复系数达到0.547115,较高的因果关系
回归统计 | ||||||||
Multiple R | 0.547115 | |||||||
R Square | 0.299335 | |||||||
Adjusted R Square | 0.298556 | |||||||
标准误差 | 0.417037 | |||||||
观测值 | 9901 | |||||||
方差分析 | ||||||||
df | SS | MS | F | Significance F | ||||
回归分析 | 11 | 734.7663 | 66.79693 | 384.067 | 0 | |||
残差 | 9889 | 1719.895 | 0.17392 | |||||
总计 | 9900 | 2454.661 | C | |||||
Coefficients | 标准误差 | t Stat | P-value | Lower 95% | Upper 95% | 下限 95.0% | 上限 95.0% | |
Intercept | 0.173297 | 0.033632 | 5.152794 | 2.62E-07 | 0.107372 | 0.239222 | 0.107372 | 0.239222 |
X Variable 1 | 0.190591 | 0.010059 | 18.94818 | 1.12E-78 | 0.170874 | 0.210307 | 0.170874 | 0.210307 |
X Variable 2 | -0.00304 | 0.001233 | -2.46576 | 0.013689 | -0.00546 | -0.00062 | -0.00546 | -0.00062 |
X Variable 3 | 1.19E-05 | 2.81E-05 | 0.423152 | 0.672194 | -4.3E-05 | 6.7E-05 | -4.3E-05 | 6.7E-05 |
X Variable 4 | 5.53E-09 | 1.51E-09 | 3.651315 | 0.000262 | 2.56E-09 | 8.49E-09 | 2.56E-09 | 8.49E-09 |
X Variable 5 | -6.4E-09 | 1.32E-09 | -4.80573 | 1.56E-06 | -8.9E-09 | -3.8E-09 | -8.9E-09 | -3.8E-09 |
X Variable 6 | 1.14E-09 | 1.92E-09 | 0.592423 | 0.553581 | -2.6E-09 | 4.91E-09 | -2.6E-09 | 4.91E-09 |
X Variable 7 | -2.4E-08 | 3.39E-09 | -7.09928 | 1.34E-12 | -3.1E-08 | -1.7E-08 | -3.1E-08 | -1.7E-08 |
X Variable 8 | 6.86E-12 | 8.73E-12 | 0.786271 | 0.431727 | -1E-11 | 2.4E-11 | -1E-11 | 2.4E-11 |
X Variable 9 | 0.000974 | 0.000332 | 2.938598 | 0.003305 | 0.000324 | 0.001624 | 0.000324 | 0.001624 |
X Variable 10 | 0.070007 | 0.013677 | 5.118475 | 3.14E-07 | 0.043197 | 0.096818 | 0.043197 | 0.096818 |
X Variable 11 | -0.21499 | 0.003476 | -61.8428 | 0 | -0.22181 | -0.20818 | -0.22181 | -0.20818 |