数据分析工程师_第03讲Pandas数据分析处理技能下篇

  • 数据分析工程师_第03讲Pandas数据分析处理技能(下篇)
  • 目录
  • 分组/Group by
  • 分组求和
  • 挑选一些列做统计运算
  • 按照Name排序
  • 频次/出现了多少次
  • 分组查看统计信息
  • 变换/transform
  • Series类型调用unique():查看一列中的不同的取值
  • Series类型调用value_counts():查看一列中的不同的取值,以及该取值出现的次数
  • 某一列有多少不同的取值 => unique
  • 某一列有多少不同的取值,它们分别出现了多少次 => value_counts()
  • 变换函数apply
  • lambda匿名函数
  • 对几列做操作
  • 总结
  • 数据的拼接与合并
  • 合并/merge
  • join基于index去合并数据的函数
  • 总结
  • 小案例:自行车租赁案例分析


数据分析工程师_第03讲Pandas数据分析处理技能(下篇)

目录
  • 分组/Groupby
  • 聚合/agg
  • 数据拼接/concat
  • 数据合并/merge/join
  • 小项目/projects
分组/Group by

举个例子,假设我们手头有一张公司每个员工的收入流水:

import pandas as pd
import numpy as np
salaries = pd.DataFrame({
    'Name':['BOSS','HanMeimei','HanMeimei','Han','BOSS','BOSS','HanMeimei','BOSS'],
    'Year':[2016,2016,2016,2016,2017,2017,2017,2017],
    'Salary':[40000,5000,4000,3000,38000,42000,6000,39000],
    'Bonus':[12000,3000,3500,1200,16000,18000,7000,21000]
})
salaries



Bonus

Name

Salary

Year

0

12000

BOSS

40000

2016

1

3000

HanMeimei

5000

2016

2

3500

HanMeimei

4000

2016

3

1200

Han

3000

2016

4

16000

BOSS

38000

2017

5

18000

BOSS

42000

2017

6

7000

HanMeimei

6000

2017

7

21000

BOSS

39000

2017

group by实际上就是分组,通过某个字段不同,进行数据分组划分

group_by_name = salaries.groupby('Name')
type(group_by_name)
pandas.core.groupby.DataFrameGroupBy
# 取不同的分组
for item in group_by_name:
    print(item)
    print(item[0]) #分组对象名称
    print(item[1]) #分组数据
    print("\n")
('BOSS',    Bonus  Name  Salary  Year
0  12000  BOSS   40000  2016
4  16000  BOSS   38000  2017
5  18000  BOSS   42000  2017
7  21000  BOSS   39000  2017)
BOSS
   Bonus  Name  Salary  Year
0  12000  BOSS   40000  2016
4  16000  BOSS   38000  2017
5  18000  BOSS   42000  2017
7  21000  BOSS   39000  2017

(‘Han’, Bonus Name Salary Year
 3 1200 Han 3000 2016)
 Han
 Bonus Name Salary Year
 3 1200 Han 3000 2016
 (‘HanMeimei’, Bonus Name Salary Year
 1 3000 HanMeimei 5000 2016
 2 3500 HanMeimei 4000 2016
 6 7000 HanMeimei 6000 2017)
 HanMeimei
 Bonus Name Salary Year
 1 3000 HanMeimei 5000 2016
 2 3500 HanMeimei 4000 2016
 6 7000 HanMeimei 6000 2017


groupby分组之后你可以去做一些统计聚会操作

分组求和
group_by_name.sum() #求和



Bonus

Salary

Year

Name

BOSS

67000

159000

8067

Han

1200

3000

2016

HanMeimei

13500

15000

6049

group_by_name.mean() #求平均



Bonus

Salary

Year

Name

BOSS

16750.0

39750.0

2016.750000

Han

1200.0

3000.0

2016.000000

HanMeimei

4500.0

5000.0

2016.333333

挑选一些列做统计运算
group_by_name[['Bonus','Salary']].sum()



Bonus

Salary

Name

BOSS

67000

159000

Han

1200

3000

HanMeimei

13500

15000

按照Name排序
salaries.groupby('Name', sort=False).agg(sum)



Bonus

Salary

Year

Name

BOSS

67000

159000

8067

HanMeimei

13500

15000

6049

Han

1200

3000

2016

salaries.groupby('Name', sort=False).sum()



Bonus

Salary

Year

Name

BOSS

67000

159000

8067

HanMeimei

13500

15000

6049

Han

1200

3000

2016

# sum:求和  mean:#求平均    median:求中位数
salaries.groupby('Name').median()



Bonus

Salary

Year

Name

BOSS

17000

39500

2017

Han

1200

3000

2016

HanMeimei

3500

5000

2016

频次/出现了多少次
salaries.groupby('Name').size()  #出现了多少次
Name
BOSS         4
Han          1
HanMeimei    3
dtype: int64
salaries.info() #基本信息
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 4 columns):
Bonus     8 non-null int64
Name      8 non-null object
Salary    8 non-null int64
Year      8 non-null int64
dtypes: int64(3), object(1)
memory usage: 336.0+ bytes
salaries.describe() #描述统计信息



Bonus

Salary

Year

count

8.000000

8.000000

8.000000

mean

10212.500000

22125.000000

2016.500000

std

7581.262145

18893.971072

0.534522

min

1200.000000

3000.000000

2016.000000

25%

3375.000000

4750.000000

2016.000000

50%

9500.000000

22000.000000

2016.500000

75%

16500.000000

39250.000000

2017.000000

max

21000.000000

42000.000000

2017.000000

分组查看统计信息
salaries.groupby('Name').describe()  #按照name分组来查看统计信息



Bonus

Salary

Year

count

mean

std

min

25%

50%

75%

max

count

mean

...

75%

max

count

mean

std

min

25%

50%

75%

max

Name

BOSS

4.0

16750.0

3774.917218

12000.0

15000.0

17000.0

18750.0

21000.0

4.0

39750.0

...

40500.0

42000.0

4.0

2016.750000

0.50000

2016.0

2016.75

2017.0

2017.0

2017.0

Han

1.0

1200.0

NaN

1200.0

1200.0

1200.0

1200.0

1200.0

1.0

3000.0

...

3000.0

3000.0

1.0

2016.000000

NaN

2016.0

2016.00

2016.0

2016.0

2016.0

HanMeimei

3.0

4500.0

2179.449472

3000.0

3250.0

3500.0

5250.0

7000.0

3.0

5000.0

...

5500.0

6000.0

3.0

2016.333333

0.57735

2016.0

2016.00

2016.0

2016.5

2017.0

3 rows × 24 columns

salaries.groupby('Name')[['Bonus', 'Salary']].agg(['sum', 'mean', 'std', 'median'])



Bonus

Salary

sum

mean

std

median

sum

mean

std

median

Name

BOSS

67000

16750

3774.917218

17000

159000

39750

1707.825128

39500

Han

1200

1200

NaN

1200

3000

3000

NaN

3000

HanMeimei

13500

4500

2179.449472

3500

15000

5000

1000.000000

5000

salaries.groupby('Name')[['Bonus','Salary']].agg([np.sum, np.mean, np.std, np.median])



Bonus

Salary

sum

mean

std

median

sum

mean

std

median

Name

BOSS

67000

16750

3774.917218

17000

159000

39750

1707.825128

39500

Han

1200

1200

NaN

1200

3000

3000

NaN

3000

HanMeimei

13500

4500

2179.449472

3500

15000

5000

1000.000000

5000

salaries.groupby('Name')[['Bonus']].agg([np.sum, np.mean, np.std, np.median])



Bonus

sum

mean

std

median

Name

BOSS

67000

16750

3774.917218

17000

Han

1200

1200

NaN

1200

HanMeimei

13500

4500

2179.449472

3500

type(salaries['Bonus'])
pandas.core.series.Series
type(salaries[['Bonus']])
pandas.core.frame.DataFrame
变换/transform
nvda = pd.read_csv('1_pandas_part2_data/pandas_part2_data/NVDA.csv', index_col=0, parse_dates=['Date'])
# 把第一列设置为索引列   解析日期列
nvda.head()  #查看前五行



Open

High

Low

Close

Adj Close

Volume

Date

1999-01-22

1.750000

1.953125

1.552083

1.640625

1.523430

67867200

1999-01-25

1.770833

1.833333

1.640625

1.812500

1.683028

12762000

1999-01-26

1.833333

1.869792

1.645833

1.671875

1.552448

8580000

1999-01-27

1.677083

1.718750

1.583333

1.666667

1.547611

6109200

1999-01-28

1.666667

1.677083

1.651042

1.661458

1.542776

5688000

nvda.loc[:,'year'] = nvda.index.year
nvda.head()



Open

High

Low

Close

Adj Close

Volume

year

Date

1999-01-22

1.750000

1.953125

1.552083

1.640625

1.523430

67867200

1999

1999-01-25

1.770833

1.833333

1.640625

1.812500

1.683028

12762000

1999

1999-01-26

1.833333

1.869792

1.645833

1.671875

1.552448

8580000

1999

1999-01-27

1.677083

1.718750

1.583333

1.666667

1.547611

6109200

1999

1999-01-28

1.666667

1.677083

1.651042

1.661458

1.542776

5688000

1999

Series类型调用unique():查看一列中的不同的取值
Series类型调用value_counts():查看一列中的不同的取值,以及该取值出现的次数
某一列有多少不同的取值 => unique
#求year这列中不同的取值
nvda['year'].unique()
array([1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009,
       2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017], dtype=int64)
某一列有多少不同的取值,它们分别出现了多少次 => value_counts()
# 求year这列中不同取值分别出现的次数
nvda['year'].value_counts()
2008    253
2015    252
2011    252
2004    252
2016    252
2005    252
2009    252
2013    252
2000    252
2002    252
2010    252
2014    252
2003    252
2006    251
2007    251
2012    250
2001    248
1999    239
2017    138
Name: year, dtype: int64
tmp = nvda['year'].value_counts().to_frame()#Series转成DataFrame
tmp = tmp.reset_index()
tmp.columns = ['year','count']
tmp.head()



year

count

0

2008

253

1

2015

252

2

2011

252

3

2004

252

4

2016

252

tmp.sort_values(by='year').head()



year

count

17

1999

239

8

2000

252

16

2001

248

9

2002

252

12

2003

252

nvda.head()



Open

High

Low

Close

Adj Close

Volume

year

Date

1999-01-22

1.750000

1.953125

1.552083

1.640625

1.523430

67867200

1999

1999-01-25

1.770833

1.833333

1.640625

1.812500

1.683028

12762000

1999

1999-01-26

1.833333

1.869792

1.645833

1.671875

1.552448

8580000

1999

1999-01-27

1.677083

1.718750

1.583333

1.666667

1.547611

6109200

1999

1999-01-28

1.666667

1.677083

1.651042

1.661458

1.542776

5688000

1999

nvda.groupby('year').agg(['mean', 'std'])



Open

High

Low

Close

Adj Close

Volume

mean

std

mean

std

mean

std

mean

std

mean

std

mean

std

year

1999

1.950782

0.588882

2.007317

0.614302

1.883559

0.571658

1.947230

0.601041

1.808134

0.558107

6.433220e+06

8.142949e+06

2000

8.781084

2.999908

9.222697

3.114186

8.360522

2.904761

8.778826

3.013104

8.151729

2.797869

1.104182e+07

7.985374e+06

2001

13.091254

3.839777

13.600750

3.829838

12.680548

3.830944

13.181552

3.833637

12.239956

3.559789

2.782387e+07

1.384318e+07

2002

9.690344

6.561287

9.955093

6.664226

9.344391

6.375212

9.614749

6.519053

8.927940

6.053379

3.168655e+07

1.558742e+07

2003

5.902434

1.461862

6.042659

1.491260

5.764960

1.423422

5.900344

1.459852

5.478865

1.355570

2.430220e+07

1.899657e+07

2004

6.484735

1.467445

6.608810

1.482036

6.353558

1.444797

6.465913

1.456575

6.004034

1.352528

1.706331e+07

1.191968e+07

2005

9.512381

1.580061

9.659656

1.591274

9.353175

1.571138

9.513823

1.589762

8.834223

1.476201

1.542825e+07

9.623837e+06

2006

18.057902

3.675092

18.425126

3.718616

17.720279

3.657584

18.095963

3.700960

16.803316

3.436590

1.534446e+07

6.616879e+06

2007

27.762045

6.111437

28.251673

6.225662

27.206056

5.902620

27.724542

6.087681

25.744098

5.652820

1.514562e+07

5.818216e+06

2008

16.004308

6.862760

16.426245

6.964528

15.521462

6.696381

15.945613

6.811527

14.806572

6.324960

2.022721e+07

8.552974e+06

2009

11.825119

2.638097

12.114762

2.612499

11.565952

2.640537

11.850873

2.631664

11.004331

2.443677

1.919821e+07

8.291987e+06

2010

13.576349

2.888884

13.802659

2.904905

13.318532

2.843065

13.563175

2.884261

12.594318

2.678230

1.853295e+07

8.434693e+06

2011

16.912540

3.404884

17.267540

3.490368

16.512143

3.305507

16.887540

3.404032

15.681214

3.160872

2.289352e+07

1.270114e+07

2012

13.526200

1.176957

13.717400

1.191775

13.319800

1.165419

13.507880

1.185139

12.551166

1.091736

1.207757e+07

5.050116e+06

2013

14.173571

1.251508

14.329802

1.253287

14.035278

1.253372

14.189127

1.250883

13.412278

1.260152

8.843986e+06

4.202323e+06

2014

18.543056

1.293283

18.745476

1.283480

18.348214

1.276038

18.547064

1.284932

17.875053

1.312833

7.098902e+06

3.140560e+06

2015

23.680595

4.106327

23.979524

4.152229

23.411071

4.079351

23.718254

4.128879

23.262283

4.154678

7.756520e+06

3.933075e+06

2016

53.630833

21.714540

54.415397

22.182621

52.895119

21.263517

53.761190

21.803927

53.475737

21.824367

1.107062e+07

7.547056e+06

2017

120.481305

22.027821

122.300725

22.510244

118.402754

21.281863

120.547971

21.991898

120.436863

22.056290

1.907742e+07

1.073342e+07

def my_transform(x):
    return (x-x.mean())/x.std()
tmp_arr = np.array(range(10))
tmp_arr
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
my_transform(tmp_arr)
array([-1.5666989 , -1.21854359, -0.87038828, -0.52223297, -0.17407766,
        0.17407766,  0.52223297,  0.87038828,  1.21854359,  1.5666989 ])
tranformed = nvda.groupby('year').transform(my_transform)
tranformed.head()



Open

High

Low

Close

Adj Close

Volume

Date

1999-01-22

-0.340955

-0.088217

-0.579850

-0.510124

-0.510124

7.544438

1999-01-25

-0.305578

-0.283222

-0.424964

-0.224161

-0.224161

0.777210

1999-01-26

-0.199444

-0.223871

-0.415854

-0.458130

-0.458131

0.263637

1999-01-27

-0.464778

-0.469747

-0.525185

-0.466795

-0.466798

-0.039791

1999-01-28

-0.482465

-0.537575

-0.406741

-0.475462

-0.475461

-0.091517

%matplotlib inline
compare_df = pd.DataFrame({'Origin':nvda['Adj Close'], 'Transformed':tranformed['Adj Close']})
compare_df.head()



Origin

Transformed

Date

1999-01-22

1.523430

-0.510124

1999-01-25

1.683028

-0.224161

1999-01-26

1.552448

-0.458131

1999-01-27

1.547611

-0.466798

1999-01-28

1.542776

-0.475461

compare_df.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x2354e970c18>

数据分析工程师能力模型 数据分析工程师技能_Pandas数据分析

compare_df['Transformed'].plot()
<matplotlib.axes._subplots.AxesSubplot at 0x235517b0978>

数据分析工程师能力模型 数据分析工程师技能_Pandas数据分析_02

变换函数apply
salaries



Bonus

Name

Salary

Year

0

12000

BOSS

40000

2016

1

3000

HanMeimei

5000

2016

2

3500

HanMeimei

4000

2016

3

1200

Han

3000

2016

4

16000

BOSS

38000

2017

5

18000

BOSS

42000

2017

6

7000

HanMeimei

6000

2017

7

21000

BOSS

39000

2017

salaries.loc[:,'tmp_col'] = (salaries['Salary']*2-1500)/0.8
salaries



Bonus

Name

Salary

Year

tmp_col

0

12000

BOSS

40000

2016

98125.0

1

3000

HanMeimei

5000

2016

10625.0

2

3500

HanMeimei

4000

2016

8125.0

3

1200

Han

3000

2016

5625.0

4

16000

BOSS

38000

2017

93125.0

5

18000

BOSS

42000

2017

103125.0

6

7000

HanMeimei

6000

2017

13125.0

7

21000

BOSS

39000

2017

95625.0

def trans(x):
    return (x*2-1500)/0.8
salaries.loc[:,'tmp_col2'] = salaries['Salary'].apply(trans)
salaries



Bonus

Name

Salary

Year

tmp_col

tmp_col2

0

12000

BOSS

40000

2016

98125.0

98125.0

1

3000

HanMeimei

5000

2016

10625.0

10625.0

2

3500

HanMeimei

4000

2016

8125.0

8125.0

3

1200

Han

3000

2016

5625.0

5625.0

4

16000

BOSS

38000

2017

93125.0

93125.0

5

18000

BOSS

42000

2017

103125.0

103125.0

6

7000

HanMeimei

6000

2017

13125.0

13125.0

7

21000

BOSS

39000

2017

95625.0

95625.0

def trans2(x):
    if x=='Han':
        return 'HanXiaoyang'
    else:
        return x
salaries.loc[:,'full_name'] = salaries['Name'].apply(trans2)
salaries



Bonus

Name

Salary

Year

tmp_col

tmp_col2

full_name

0

12000

BOSS

40000

2016

98125.0

98125.0

BOSS

1

3000

HanMeimei

5000

2016

10625.0

10625.0

HanMeimei

2

3500

HanMeimei

4000

2016

8125.0

8125.0

HanMeimei

3

1200

Han

3000

2016

5625.0

5625.0

HanXiaoyang

4

16000

BOSS

38000

2017

93125.0

93125.0

BOSS

5

18000

BOSS

42000

2017

103125.0

103125.0

BOSS

6

7000

HanMeimei

6000

2017

13125.0

13125.0

HanMeimei

7

21000

BOSS

39000

2017

95625.0

95625.0

BOSS

help(pd.Series.apply)
Help on function apply in module pandas.core.series:

apply(self, func, convert_dtype=True, args=(), **kwds)
    Invoke function on values of Series. Can be ufunc (a NumPy function
    that applies to the entire Series) or a Python function that only works
    on single values
    
    Parameters
    ----------
    func : function
    convert_dtype : boolean, default True
        Try to find better dtype for elementwise function results. If
        False, leave as dtype=object
    args : tuple
        Positional arguments to pass to function in addition to the value
    Additional keyword arguments will be passed as keywords to the function
    
    Returns
    -------
    y : Series or DataFrame if func returns a Series
    
    See also
    --------
    Series.map: For element-wise operations
    Series.agg: only perform aggregating type operations
    Series.transform: only perform transformating type operations
    
    Examples
    --------
    
    Create a series with typical summer temperatures for each city.
    
    >>> import pandas as pd
    >>> import numpy as np
    >>> series = pd.Series([20, 21, 12], index=['London',
    ... 'New York','Helsinki'])
    >>> series
    London      20
    New York    21
    Helsinki    12
    dtype: int64
    
    Square the values by defining a function and passing it as an
    argument to ``apply()``.
    
    >>> def square(x):
    ...     return x**2
    >>> series.apply(square)
    London      400
    New York    441
    Helsinki    144
    dtype: int64
    
    Square the values by passing an anonymous function as an
    argument to ``apply()``.
    
    >>> series.apply(lambda x: x**2)
    London      400
    New York    441
    Helsinki    144
    dtype: int64
    
    Define a custom function that needs additional positional
    arguments and pass these additional arguments using the
    ``args`` keyword.
    
    >>> def subtract_custom_value(x, custom_value):
    ...     return x-custom_value
    
    >>> series.apply(subtract_custom_value, args=(5,))
    London      15
    New York    16
    Helsinki     7
    dtype: int64
    
    Define a custom function that takes keyword arguments
    and pass these arguments to ``apply``.
    
    >>> def add_custom_values(x, **kwargs):
    ...     for month in kwargs:
    ...         x+=kwargs[month]
    ...         return x
    
    >>> series.apply(add_custom_values, june=30, july=20, august=25)
    London      95
    New York    96
    Helsinki    87
    dtype: int64
    
    Use a function from the Numpy library.
    
    >>> series.apply(np.log)
    London      2.995732
    New York    3.044522
    Helsinki    2.484907
    dtype: float64


salaries



Bonus

Name

Salary

Year

tmp_col

tmp_col2

full_name

0

12000

BOSS

40000

2016

98125.0

98125.0

BOSS

1

3000

HanMeimei

5000

2016

10625.0

10625.0

HanMeimei

2

3500

HanMeimei

4000

2016

8125.0

8125.0

HanMeimei

3

1200

Han

3000

2016

5625.0

5625.0

HanXiaoyang

4

16000

BOSS

38000

2017

93125.0

93125.0

BOSS

5

18000

BOSS

42000

2017

103125.0

103125.0

BOSS

6

7000

HanMeimei

6000

2017

13125.0

13125.0

HanMeimei

7

21000

BOSS

39000

2017

95625.0

95625.0

BOSS

gender = 'male'
sex = '男' if gender=='male' else '女'
sex
'男'
lambda匿名函数
salaries.loc[:,'new_name'] = salaries['Name'].apply(lambda x: 'HanXiaoyang' if x=='han' else x)
对几列做操作
# 如果是boss,返回工资+奖金,其他人返回工资
def my_fun(name, salary, bonus):
    if name=='BOSS':
        return salary+bonus
    else:
        return salary
salaries.loc[:,'my_s_result'] = list(map(lambda x,y,z:my_fun(x,y,z), \
                                         salaries['Name'],\
                                         salaries['Salary'],\
                                         salaries['Bonus']))
总结
  • groupby取分组内容
  • groupby分组之后做统计计算agg([np.sum,‘median’,‘std’])
  • groupby之后describe、transform
  • apply对列做变换(定义一个函数)
  • 附加:对多列做变换,map(lambda x,y,z,a:my_fun(x,y,z,a), df[‘x’], df[‘y’]…)

数据的拼接与合并

  • concat
  • merge
  • join
df1 = pd.DataFrame({'apts':[55000,60000], 'cars':[200000,300000]}, index=['Shanghai','Beijing'])
df1



apts

cars

Shanghai

55000

200000

Beijing

60000

300000

df2 = pd.DataFrame({'apts':[35000, 45000], 'cars':[150000, 180000]}, index=['Hangzhou','Guangzhou'])
df2



apts

cars

Hangzhou

35000

150000

Guangzhou

45000

180000

df3 = pd.DataFrame({'apts':[30000, 10000], 'cars':[120000, 100000]}, index=['Nanjing','Chongqing'])
df3



apts

cars

Nanjing

30000

120000

Chongqing

10000

100000

# concat
result = pd.concat([df1,df2,df3])
result



apts

cars

Shanghai

55000

200000

Beijing

60000

300000

Hangzhou

35000

150000

Guangzhou

45000

180000

Nanjing

30000

120000

Chongqing

10000

100000

#行对齐去拼接
pd.concat([df1,df2,df3], axis=1)



apts

cars

apts

cars

apts

cars

Beijing

60000.0

300000.0

NaN

NaN

NaN

NaN

Chongqing

NaN

NaN

NaN

NaN

10000.0

100000.0

Guangzhou

NaN

NaN

45000.0

180000.0

NaN

NaN

Hangzhou

NaN

NaN

35000.0

150000.0

NaN

NaN

Nanjing

NaN

NaN

NaN

NaN

30000.0

120000.0

Shanghai

55000.0

200000.0

NaN

NaN

NaN

NaN

#列对齐拼接
pd.concat([df1,df2,df3], axis=0)



apts

cars

Shanghai

55000

200000

Beijing

60000

300000

Hangzhou

35000

150000

Guangzhou

45000

180000

Nanjing

30000

120000

Chongqing

10000

100000

#append
df1.append(df1)



apts

cars

Shanghai

55000

200000

Beijing

60000

300000

Shanghai

55000

200000

Beijing

60000

300000

result



apts

cars

Shanghai

55000

200000

Beijing

60000

300000

Hangzhou

35000

150000

Guangzhou

45000

180000

Nanjing

30000

120000

Chongqing

10000

100000

合并/merge
def my_trans_apts(x):
    if x<45000:
        return 45000
    else:
        return 60000
result.loc[:,'new_apts'] = result['apts'].apply(my_trans_apts)
result



apts

cars

new_apts

Shanghai

55000

200000

60000

Beijing

60000

300000

60000

Hangzhou

35000

150000

45000

Guangzhou

45000

180000

60000

Nanjing

30000

120000

45000

Chongqing

10000

100000

45000

new_df = pd.DataFrame({'new_apts':[60000,45000], 'bonus':[100000, 50000]})
new_df



bonus

new_apts

0

100000

60000

1

50000

45000

pd.merge(result, new_df, on=['new_apts'], how='inner')



apts

cars

new_apts

bonus

0

55000

200000

60000

100000

1

60000

300000

60000

100000

2

45000

180000

60000

100000

3

35000

150000

45000

50000

4

30000

120000

45000

50000

5

10000

100000

45000

50000

new_df2 = pd.DataFrame({'new_apts':[65000,45000], 'bonus':[100000, 50000]})
new_df2



bonus

new_apts

0

100000

65000

1

50000

45000

result



apts

cars

new_apts

Shanghai

55000

200000

60000

Beijing

60000

300000

60000

Hangzhou

35000

150000

45000

Guangzhou

45000

180000

60000

Nanjing

30000

120000

45000

Chongqing

10000

100000

45000

pd.merge(result, new_df2, on=['new_apts'], how='left')



apts

cars

new_apts

bonus

0

55000

200000

60000

NaN

1

60000

300000

60000

NaN

2

35000

150000

45000

50000.0

3

45000

180000

60000

NaN

4

30000

120000

45000

50000.0

5

10000

100000

45000

50000.0

pd.merge(result, new_df2, on=['new_apts'], how='right')



apts

cars

new_apts

bonus

0

35000.0

150000.0

45000

50000

1

30000.0

120000.0

45000

50000

2

10000.0

100000.0

45000

50000

3

NaN

NaN

65000

100000

df1



apts

cars

Shanghai

55000

200000

Beijing

60000

300000

df2



apts

cars

Hangzhou

35000

150000

Guangzhou

45000

180000

df3



apts

cars

Nanjing

30000

120000

Chongqing

10000

100000

df4 = pd.DataFrame({'salaries':[10000,30000,30000,20000,15000]}, index=['Suzhou', 'Beijing','Shanghai','Guangzhou','Tianjin'])
df4



salaries

Suzhou

10000

Beijing

30000

Shanghai

30000

Guangzhou

20000

Tianjin

15000

join基于index去合并数据的函数
df1.join(df4)



apts

cars

salaries

Shanghai

55000

200000

30000

Beijing

60000

300000

30000

df2.join(df4)



apts

cars

salaries

Hangzhou

35000

150000

NaN

Guangzhou

45000

180000

20000.0

help(pd.DataFrame.join)
Help on function join in module pandas.core.frame:

join(self, other, on=None, how='left', lsuffix='', rsuffix='', sort=False)
    Join columns with other DataFrame either on index or on a key
    column. Efficiently Join multiple DataFrame objects by index at once by
    passing a list.
    
    Parameters
    ----------
    other : DataFrame, Series with name field set, or list of DataFrame
        Index should be similar to one of the columns in this one. If a
        Series is passed, its name attribute must be set, and that will be
        used as the column name in the resulting joined DataFrame
    on : column name, tuple/list of column names, or array-like
        Column(s) in the caller to join on the index in other,
        otherwise joins index-on-index. If multiples
        columns given, the passed DataFrame must have a MultiIndex. Can
        pass an array as the join key if not already contained in the
        calling DataFrame. Like an Excel VLOOKUP operation
    how : {'left', 'right', 'outer', 'inner'}, default: 'left'
        How to handle the operation of the two objects.
    
        * left: use calling frame's index (or column if on is specified)
        * right: use other frame's index
        * outer: form union of calling frame's index (or column if on is
          specified) with other frame's index, and sort it
          lexicographically
        * inner: form intersection of calling frame's index (or column if
          on is specified) with other frame's index, preserving the order
          of the calling's one
    lsuffix : string
        Suffix to use from left frame's overlapping columns
    rsuffix : string
        Suffix to use from right frame's overlapping columns
    sort : boolean, default False
        Order result DataFrame lexicographically by the join key. If False,
        the order of the join key depends on the join type (how keyword)
    
    Notes
    -----
    on, lsuffix, and rsuffix options are not supported when passing a list
    of DataFrame objects
    
    Examples
    --------
    >>> caller = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'],
    ...                        'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})
    
    >>> caller
        A key
    0  A0  K0
    1  A1  K1
    2  A2  K2
    3  A3  K3
    4  A4  K4
    5  A5  K5
    
    >>> other = pd.DataFrame({'key': ['K0', 'K1', 'K2'],
    ...                       'B': ['B0', 'B1', 'B2']})
    
    >>> other
        B key
    0  B0  K0
    1  B1  K1
    2  B2  K2
    
    Join DataFrames using their indexes.
    
    >>> caller.join(other, lsuffix='_caller', rsuffix='_other')
    
    >>>     A key_caller    B key_other
        0  A0         K0   B0        K0
        1  A1         K1   B1        K1
        2  A2         K2   B2        K2
        3  A3         K3  NaN       NaN
        4  A4         K4  NaN       NaN
        5  A5         K5  NaN       NaN


If we want to join using the key columns, we need to set key to be
the index in both caller and other. The joined DataFrame will have
key as its index.

>>> caller.set_index('key').join(other.set_index('key'))
    
    >>>      A    B
        key
        K0   A0   B0
        K1   A1   B1
        K2   A2   B2
        K3   A3  NaN
        K4   A4  NaN
        K5   A5  NaN
    
    Another option to join using the key columns is to use the on
    parameter. DataFrame.join always uses other's index but we can use any
    column in the caller. This method preserves the original caller's
    index in the result.
    
    >>> caller.join(other.set_index('key'), on='key')
    
    >>>     A key    B
        0  A0  K0   B0
        1  A1  K1   B1
        2  A2  K2   B2
        3  A3  K3  NaN
        4  A4  K4  NaN
        5  A5  K5  NaN


See also
--------
DataFrame.merge : For column(s)-on-columns(s) operations

Returns
    -------
    joined : DataFrame


总结
  • concat:拼接,axies指定拼接的维度
  • merge:基于某个列去做关联
  • join:基于index去做数据合并

小案例:自行车租赁案例分析

bikes = pd.read_csv('1_pandas_part2_data/pandas_part2_data/bikes.csv', sep=';', encoding='latin1', \
                    parse_dates=['Date'], index_col='Date')
bikes.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 310 entries, 2012-01-01 to 2012-05-11
Data columns (total 9 columns):
Berri 1                                310 non-null int64
Brébeuf (données non disponibles)      0 non-null float64
Côte-Sainte-Catherine                  310 non-null int64
Maisonneuve 1                          310 non-null int64
Maisonneuve 2                          310 non-null int64
du Parc                                310 non-null int64
Pierre-Dupuy                           310 non-null int64
Rachel1                                310 non-null int64
St-Urbain (données non disponibles)    0 non-null float64
dtypes: float64(2), int64(7)
memory usage: 24.2 KB
bikes.head()



Berri 1

Brébeuf (données non disponibles)

Côte-Sainte-Catherine

Maisonneuve 1

Maisonneuve 2

du Parc

Pierre-Dupuy

Rachel1

St-Urbain (données non disponibles)

Date

2012-01-01

35

NaN

0

38

51

26

10

16

NaN

2012-02-01

83

NaN

1

68

153

53

6

43

NaN

2012-03-01

135

NaN

2

104

248

89

3

58

NaN

2012-04-01

144

NaN

1

116

318

111

8

61

NaN

2012-05-01

197

NaN

2

124

330

97

13

95

NaN

bikes.shape
(310, 9)
#dropna去空,默认是去除有缺失值的行
bikes.dropna()



Berri 1

Brébeuf (données non disponibles)

Côte-Sainte-Catherine

Maisonneuve 1

Maisonneuve 2

du Parc

Pierre-Dupuy

Rachel1

St-Urbain (données non disponibles)

Date

bikes.dropna(axis=1, how='all').head()#将存在这一列的所有数据都缺失值的列去掉



Berri 1

Côte-Sainte-Catherine

Maisonneuve 1

Maisonneuve 2

du Parc

Pierre-Dupuy

Rachel1

Date

2012-01-01

35

0

38

51

26

10

16

2012-02-01

83

1

68

153

53

6

43

2012-03-01

135

2

104

248

89

3

58

2012-04-01

144

1

116

318

111

8

61

2012-05-01

197

2

124

330

97

13

95

bikes.shape
(310, 9)
bikes.dropna(axis=1, how='all', inplace=True)
bikes.shape
(310, 7)
bikes.loc[:,'weekday'] = bikes.index.weekday
bikes.head()



Berri 1

Côte-Sainte-Catherine

Maisonneuve 1

Maisonneuve 2

du Parc

Pierre-Dupuy

Rachel1

weekday

Date

2012-01-01

35

0

38

51

26

10

16

6

2012-02-01

83

1

68

153

53

6

43

2

2012-03-01

135

2

104

248

89

3

58

3

2012-04-01

144

1

116

318

111

8

61

6

2012-05-01

197

2

124

330

97

13

95

1

weekday_counts = bikes.groupby('weekday').agg(sum)
weekday_counts.head()



Berri 1

Côte-Sainte-Catherine

Maisonneuve 1

Maisonneuve 2

du Parc

Pierre-Dupuy

Rachel1

weekday

0

132446

57940

90828

163042

89338

41524

126215

1

119895

52113

80865

145389

79585

35967

114622

2

146785

64189

99674

177105

96340

45103

130796

3

147630

61855

102801

177285

93386

46600

135268

4

150183

61432

102317

181651

95731

47272

143115

%matplotlib inline
weekday_counts['Berri 1'].plot(kind='bar')
<matplotlib.axes._subplots.AxesSubplot at 0x7f157e27c4e0>

数据分析工程师能力模型 数据分析工程师技能_数据分析工程师能力模型_03