文章目录




1. 参数估计

进行描述性统计分析

# - 数据说明:本数据是地区房价增长率数据
# - 名称-中文含义
# - dis_name-小区名称
# - rate-房价同比增长率
import pandas as pd

house_price_gr = pd.read_csv(r'../data/house_price_gr.csv', encoding='gbk')
house_price_gr


dis_name

rate

0

东城区甘南小区

0.169747

1

东城区察慈小区

0.165484

2

东城区胡家园小区

0.141358

3

东城区台基厂小区

0.063197

4

东城区青年湖小区

0.101528

...

...

...

145

密云县沿湖小区

0.121524

146

密云县东菜园小区

0.104666

147

密云县花园小区

0.137225

148

开发区鹿鸣苑

0.073119

149

开发区星岛嘉园

0.048391

150 rows × 2 columns

house_price_gr.describe(include='all')


dis_name

rate

count

150

150.000000

unique

150

NaN

top

丰台区角门东里小区

NaN

freq

1

NaN

mean

NaN

0.110061

std

NaN

0.041333

min

NaN

0.029540

25%

NaN

0.080027

50%

NaN

0.104908

75%

NaN

0.140066

max

NaN

0.243743

直方图、核密度曲线图

import seaborn as sns
from scipy import stats

%matplotlib inline
sns.distplot(house_price_gr.rate, kde=True, fit=stats.norm) # Histograph
<matplotlib.axes._subplots.AxesSubplot at 0x2a0d19b2548>

数据分析__探索性统计分析2_数据

Q-Q图

import statsmodels.api as sm
from matplotlib import pyplot as plt

fig = sm.qqplot(house_price_gr.rate, fit=True, line='45')
fig.show()
e:\Anaconda3.5\envs\ccf3\lib\site-packages\ipykernel_launcher.py:5: UserWarning: Matplotlib is currently using module://ipykernel.pylab.backend_inline, which is a non-GUI backend, so cannot show the figure.
"""

数据分析__探索性统计分析2_5e_02

箱线图

house_price_gr.plot(kind='box') # Box Plots
<matplotlib.axes._subplots.AxesSubplot at 0x2a0d478e7c8>

数据分析__探索性统计分析2_5e_03

置信度区间估计

se = house_price_gr.rate.std() / len(house_price_gr) ** 0.5
LB = house_price_gr.rate.mean() - 1.98 * se
UB = house_price_gr.rate.mean() + 1.98 * se
(LB, UB)
(0.1033788285317501, 0.11674316487209627)
# 如果要求任意置信度下的置信区间的话,可以自己编一个函数
def confint(x, alpha=0.05):
n = len(x)
xb = x.mean()
df = n-1
tmp = (x.std() / n ** 0.5) * stats.t.ppf(1-alpha/2, df)
return {'Mean': xb, 'Degree of Freedom':df, 'LB':xb-tmp, 'UB':xb+tmp}

confint(house_price_gr.rate, 0.05)
{'Mean': 0.11006099670192318,
'Degree of Freedom': 149,
'LB': 0.10339228338892811,
'UB': 0.11672971001491825}
# 或者使用DescrStatsW
d1 = sm.stats.DescrStatsW(house_price_gr.rate)
d1.tconfint_mean(0.05)
(0.10339228338892814, 0.11672971001491828)

2. 假设检验与单样本T检验

当年住宅价格的增长率是否超过了10%的阈值

大、中、小

大数定律、中心极限定理、小概率事件

d1 = sm.stats.DescrStatsW(house_price_gr.rate)
print('t-statistic=%6.4f, p-value=%6.4f, df=%s' %d1.ttest_mean(0.1))
t-statistic=2.9812, p-value=0.0034, df=149.0

3. 两样本T检验

# 数据说明:本数据是一份汽车贷款数据

字段名

中文含义

id

id

Acc

是否开卡(1=已开通)

avg_exp

月均信用卡支出(元)

avg_exp_ln

月均信用卡支出的自然对数

gender

性别(男=1)

Age

年龄

Income

年收入(万元)

Ownrent

是否自有住房(有=1;无=0)

Selfempl

是否自谋职业(1=yes, 0=no)

dist_home_val

所住小区房屋均价(万元)

dist_avg_income

当地人均收入

high_avg

高出当地平均收入

edu_class

教育等级:小学及以下开通=0,中学=1,本科=2,研究生=3

creditcard = pd.read_csv(r'../data/creditcard_exp.csv', skipinitialspace=True)
skipinitialspace : boolean, default False
忽略分隔符后的空白(默认为False,即不忽略).
creditcard


id

Acc

avg_exp

avg_exp_ln

gender

Age

Income

Ownrent

Selfempl

dist_home_val

dist_avg_income

age2

high_avg

edu_class

0

19

1

1217.03

7.104169

1

40

16.03515

1

1

99.93

15.932789

1600

0.102361

3

1

5

1

1251.50

7.132098

1

32

15.84750

1

0

49.88

15.796316

1024

0.051184

2

2

95

0

NaN

NaN

1

36

8.40000

0

0

88.61

7.490000

1296

0.910000

1

3

86

1

856.57

6.752936

1

41

11.47285

1

0

16.10

11.275632

1681

0.197218

3

4

50

1

1321.83

7.186772

1

28

13.40915

1

0

100.39

13.346474

784

0.062676

2

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

95

96

0

NaN

NaN

0

22

1.56000

0

0

68.33

1.840000

484

-0.280000

1

96

43

1

593.92

6.386745

0

30

4.37960

0

0

124.23

5.040632

900

-0.661032

1

97

60

1

418.78

6.037346

0

21

3.49390

0

0

34.46

3.828842

441

-0.334942

1

98

28

1

163.18

5.094854

0

22

3.81590

0

0

63.27

3.997789

484

-0.181889

0

99

94

0

NaN

NaN

0

35

1.50000

0

0

109.16

1.930000

1225

-0.430000

1

100 rows × 14 columns

creditcard['Income'].groupby(creditcard['Acc']).describe()


count

mean

std

min

25%

50%

75%

max

Acc

0

30.0

3.149333

1.406482

1.5000

2.285000

2.905000

3.807500

8.40000

1

70.0

7.424706

3.077986

3.4939

5.175662

6.443525

8.494237

16.90015

第一步:方差齐次检验

Suc0 = creditcard[creditcard['Acc'] == 0]['Income']
Suc1 = creditcard[creditcard['Acc'] == 1]['Income']
leveneTestRes = stats.levene(Suc0, Suc1, center='median')
print('w-value=%6.4f, p-value=%6.4f' %leveneTestRes)
w-value=7.1829, p-value=0.0086

第二步:T-test

stats.stats.ttest_ind(Suc0, Suc1, equal_var=False)
# Or Try: sm.stats.ttest_ind(gender0, gender1, usevar='pooled')
Ttest_indResult(statistic=-9.529516968736448, pvalue=1.3263066753296544e-15)
# 测试一下性别对是月均消费的作用.
# 注意对缺失值得处理
creditcard['avg_exp'].groupby(creditcard['gender']).describe()


count

mean

std

min

25%

50%

75%

max

gender

0

50.0

925.7052

430.833365

163.18

593.3125

813.650

1204.7775

1992.39

1

20.0

1128.5310

462.281389

648.15

829.8600

1020.005

1238.2025

2430.03

female = creditcard[creditcard['gender'] == 0]['avg_exp'].dropna()
male = creditcard[creditcard['gender'] == 1]['avg_exp'].dropna()
leveneTestRes = stats.levene(female, male, center='median')
print('w-value=%6.4f, p-value=%6.4f' %leveneTestRes)
w-value=0.0683, p-value=0.7946
stats.stats.ttest_ind(female, male, equal_var=True)
Ttest_indResult(statistic=-1.742901386808629, pvalue=0.08587122878448449)

4. 方差分析

- 单因素方差分析

pd.set_option('display.max_columns', None) # 设置显示所有列
creditcard.groupby('edu_class')[['avg_exp']].describe().T


edu_class

0

1

2

3

avg_exp

count

2.000000

23.000000

23.000000

22.000000

mean

207.370000

641.937826

973.321304

1422.280909

std

62.494097

147.577741

229.163196

435.281442

min

163.180000

418.780000

610.250000

816.030000

25%

185.275000

525.595000

807.820000

1166.997500

50%

207.370000

593.920000

959.830000

1343.025000

75%

229.465000

736.140000

1075.270000

1661.412500

max

251.560000

987.660000

1472.820000

2430.030000

# 利用回归模型中的方差分析
import statsmodels.api as sm
from statsmodels.formula.api import ols

sm.stats.anova_lm(ols('avg_exp ~ C(edu_class)',data=creditcard).fit())


df

sum_sq

mean_sq

F

PR(>F)

C(edu_class)

3.0

8.126056e+06

2.708685e+06

31.825683

7.658362e-13

Residual

66.0

5.617263e+06

8.511005e+04

NaN

NaN

- 多因素方差分析

# 不考虑交互相
sm.stats.anova_lm(ols('avg_exp ~ C(edu_class)+C(gender)',data=creditcard).fit())


df

sum_sq

mean_sq

F

PR(>F)

C(edu_class)

3.0

8.126056e+06

2.708685e+06

31.578365

1.031496e-12

C(gender)

1.0

4.178273e+04

4.178273e+04

0.487111

4.877082e-01

Residual

65.0

5.575481e+06

8.577662e+04

NaN

NaN

# 考虑交互相
sm.stats.anova_lm(ols('avg_exp ~ C(edu_class)+C(gender)+C(edu_class)*C(gender)',data=creditcard).fit())


df

sum_sq

mean_sq

F

PR(>F)

C(edu_class)

3.0

8.126056e+06

2.708685e+06

33.839350

3.753889e-13

C(gender)

1.0

4.178273e+04

4.178273e+04

0.521988

4.726685e-01

C(edu_class):C(gender)

3.0

6.790792e+05

2.263597e+05

2.827891

4.557660e-02

Residual

63.0

5.042862e+06

8.004544e+04

NaN

NaN

5. 相关分析

相关性分析:“spearman”,“pearson” 和 “kendall”

# 散点图
creditcard.plot(x='Income', y='avg_exp', kind='scatter')
# 当发现散点图有发散的趋势时,首先需要对Y取对数,而且还应该尝试对X也取对数
creditcard.plot(x='Income', y='avg_exp_ln', kind='scatter')

# import numpy as np
# creditcard['Income_ln'] = np.log(creditcard['Income'])

creditcard[['avg_exp_ln', 'Income']].corr(method='pearson')


avg_exp_ln

Income

avg_exp_ln

1.00000

0.63489

Income

0.63489

1.00000

数据分析__探索性统计分析2_5e_04

数据分析__探索性统计分析2_方差分析_05

6. 卡方检验

cross_table = pd.crosstab(creditcard.edu_class, columns=creditcard.Acc)
# Or try this: accepts.pivot_table(index='bankruptcy_ind',columns='bad_ind', values='application_id', aggfunc='count')
cross_table


Acc

0

1

edu_class

0

16

2

1

14

23

2

0

23

3

0

22

cross_table_rowpct = cross_table.div(cross_table.sum(1),axis = 0)
cross_table_rowpct


Acc

0

1

edu_class

0

0.888889

0.111111

1

0.378378

0.621622

2

0.000000

1.000000

3

0.000000

1.000000

print('chisq = %6.4f\n p-value = %6.4f\n dof = %i\n expected_freq = %s'  %stats.chi2_contingency(cross_table))
chisq = 50.0930
p-value = 0.0000
dof = 3
expected_freq = [[ 5.4 12.6]
[11.1 25.9]
[ 6.9 16.1]
[ 6.6 15.4]]

总结

两变量关系检验方法综述

数据分析__探索性统计分析2_5e_06

有关系就是不独立,看有没有关系主要看二者均值是否相等。

先从描述性统计分析入手,可视化看均值是否一样?

参数估计

总体

样本(对总体具有代表性)

数据分析__探索性统计分析2_5e_07

点估计

区间估计(置信区间)

均值的标准差 = 标准误

数据分析__探索性统计分析2_方差分析_08

假设检验流程

1)原假设与备择假设

2)根据样本量确定显著度水平α

3)采集数据

4)

数据分析__探索性统计分析2_方差分析_09

t统计量 -->概率

p值>α 接受原假设

p值<α 拒绝原假设

显著的意思是:p值显著的小于α

关于p值和样本量的一系列问题的说明

1)统计学家无法给出大样本下(10000以上个样本)的p值阈值大小,所以传统经验失效。但是数据量大,可以通过分组、抽样等手段减小样本量,就可以用统计学处理了。

2)在机器学习算法中,只有线性回归和逻辑回归才涉及到p值的大小。此外,决策树的chard模型中由于使用了卡方检验,因此也要关注样本量。

3)显著度α 样本量n

n < 100 10%

100 < n < 500 5%

500 < n < 1000 1%

100 < n < 2000 0.1%

要求样本量n<5000

数据量很大时,看p值没有意义。