数据分析工程师_第03讲Pandas数据分析处理技能下篇
- 数据分析工程师_第03讲Pandas数据分析处理技能(下篇)
- 目录
- 分组/Group by
- 分组求和
- 挑选一些列做统计运算
- 按照Name排序
- 频次/出现了多少次
- 分组查看统计信息
- 变换/transform
- Series类型调用unique():查看一列中的不同的取值
- Series类型调用value_counts():查看一列中的不同的取值,以及该取值出现的次数
- 某一列有多少不同的取值 => unique
- 某一列有多少不同的取值,它们分别出现了多少次 => value_counts()
- 变换函数apply
- lambda匿名函数
- 对几列做操作
- 总结
- 数据的拼接与合并
- 合并/merge
- join基于index去合并数据的函数
- 总结
- 小案例:自行车租赁案例分析
数据分析工程师_第03讲Pandas数据分析处理技能(下篇)
目录
- 分组/Groupby
- 聚合/agg
- 数据拼接/concat
- 数据合并/merge/join
- 小项目/projects
分组/Group by
举个例子,假设我们手头有一张公司每个员工的收入流水:
import pandas as pd
import numpy as npsalaries = pd.DataFrame({
'Name':['BOSS','HanMeimei','HanMeimei','Han','BOSS','BOSS','HanMeimei','BOSS'],
'Year':[2016,2016,2016,2016,2017,2017,2017,2017],
'Salary':[40000,5000,4000,3000,38000,42000,6000,39000],
'Bonus':[12000,3000,3500,1200,16000,18000,7000,21000]
})salaries
Bonus | Name | Salary | Year | |
0 | 12000 | BOSS | 40000 | 2016 |
1 | 3000 | HanMeimei | 5000 | 2016 |
2 | 3500 | HanMeimei | 4000 | 2016 |
3 | 1200 | Han | 3000 | 2016 |
4 | 16000 | BOSS | 38000 | 2017 |
5 | 18000 | BOSS | 42000 | 2017 |
6 | 7000 | HanMeimei | 6000 | 2017 |
7 | 21000 | BOSS | 39000 | 2017 |
group by实际上就是分组,通过某个字段不同,进行数据分组划分
group_by_name = salaries.groupby('Name')type(group_by_name)pandas.core.groupby.DataFrameGroupBy# 取不同的分组
for item in group_by_name:
print(item)
print(item[0]) #分组对象名称
print(item[1]) #分组数据
print("\n")('BOSS', Bonus Name Salary Year
0 12000 BOSS 40000 2016
4 16000 BOSS 38000 2017
5 18000 BOSS 42000 2017
7 21000 BOSS 39000 2017)
BOSS
Bonus Name Salary Year
0 12000 BOSS 40000 2016
4 16000 BOSS 38000 2017
5 18000 BOSS 42000 2017
7 21000 BOSS 39000 2017
(‘Han’, Bonus Name Salary Year
3 1200 Han 3000 2016)
Han
Bonus Name Salary Year
3 1200 Han 3000 2016
(‘HanMeimei’, Bonus Name Salary Year
1 3000 HanMeimei 5000 2016
2 3500 HanMeimei 4000 2016
6 7000 HanMeimei 6000 2017)
HanMeimei
Bonus Name Salary Year
1 3000 HanMeimei 5000 2016
2 3500 HanMeimei 4000 2016
6 7000 HanMeimei 6000 2017
groupby分组之后你可以去做一些统计聚会操作
分组求和
group_by_name.sum() #求和
Bonus | Salary | Year | |
Name | |||
BOSS | 67000 | 159000 | 8067 |
Han | 1200 | 3000 | 2016 |
HanMeimei | 13500 | 15000 | 6049 |
group_by_name.mean() #求平均
Bonus | Salary | Year | |
Name | |||
BOSS | 16750.0 | 39750.0 | 2016.750000 |
Han | 1200.0 | 3000.0 | 2016.000000 |
HanMeimei | 4500.0 | 5000.0 | 2016.333333 |
挑选一些列做统计运算
group_by_name[['Bonus','Salary']].sum()
Bonus | Salary | |
Name | ||
BOSS | 67000 | 159000 |
Han | 1200 | 3000 |
HanMeimei | 13500 | 15000 |
按照Name排序
salaries.groupby('Name', sort=False).agg(sum)
Bonus | Salary | Year | |
Name | |||
BOSS | 67000 | 159000 | 8067 |
HanMeimei | 13500 | 15000 | 6049 |
Han | 1200 | 3000 | 2016 |
salaries.groupby('Name', sort=False).sum()
Bonus | Salary | Year | |
Name | |||
BOSS | 67000 | 159000 | 8067 |
HanMeimei | 13500 | 15000 | 6049 |
Han | 1200 | 3000 | 2016 |
# sum:求和 mean:#求平均 median:求中位数salaries.groupby('Name').median()
Bonus | Salary | Year | |
Name | |||
BOSS | 17000 | 39500 | 2017 |
Han | 1200 | 3000 | 2016 |
HanMeimei | 3500 | 5000 | 2016 |
频次/出现了多少次
salaries.groupby('Name').size() #出现了多少次Name
BOSS 4
Han 1
HanMeimei 3
dtype: int64salaries.info() #基本信息<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 4 columns):
Bonus 8 non-null int64
Name 8 non-null object
Salary 8 non-null int64
Year 8 non-null int64
dtypes: int64(3), object(1)
memory usage: 336.0+ bytessalaries.describe() #描述统计信息
Bonus | Salary | Year | |
count | 8.000000 | 8.000000 | 8.000000 |
mean | 10212.500000 | 22125.000000 | 2016.500000 |
std | 7581.262145 | 18893.971072 | 0.534522 |
min | 1200.000000 | 3000.000000 | 2016.000000 |
25% | 3375.000000 | 4750.000000 | 2016.000000 |
50% | 9500.000000 | 22000.000000 | 2016.500000 |
75% | 16500.000000 | 39250.000000 | 2017.000000 |
max | 21000.000000 | 42000.000000 | 2017.000000 |
分组查看统计信息
salaries.groupby('Name').describe() #按照name分组来查看统计信息
Bonus | Salary | Year | |||||||||||||||||||
count | mean | std | min | 25% | 50% | 75% | max | count | mean | ... | 75% | max | count | mean | std | min | 25% | 50% | 75% | max | |
Name | |||||||||||||||||||||
BOSS | 4.0 | 16750.0 | 3774.917218 | 12000.0 | 15000.0 | 17000.0 | 18750.0 | 21000.0 | 4.0 | 39750.0 | ... | 40500.0 | 42000.0 | 4.0 | 2016.750000 | 0.50000 | 2016.0 | 2016.75 | 2017.0 | 2017.0 | 2017.0 |
Han | 1.0 | 1200.0 | NaN | 1200.0 | 1200.0 | 1200.0 | 1200.0 | 1200.0 | 1.0 | 3000.0 | ... | 3000.0 | 3000.0 | 1.0 | 2016.000000 | NaN | 2016.0 | 2016.00 | 2016.0 | 2016.0 | 2016.0 |
HanMeimei | 3.0 | 4500.0 | 2179.449472 | 3000.0 | 3250.0 | 3500.0 | 5250.0 | 7000.0 | 3.0 | 5000.0 | ... | 5500.0 | 6000.0 | 3.0 | 2016.333333 | 0.57735 | 2016.0 | 2016.00 | 2016.0 | 2016.5 | 2017.0 |
3 rows × 24 columns
salaries.groupby('Name')[['Bonus', 'Salary']].agg(['sum', 'mean', 'std', 'median'])
Bonus | Salary | |||||||
sum | mean | std | median | sum | mean | std | median | |
Name | ||||||||
BOSS | 67000 | 16750 | 3774.917218 | 17000 | 159000 | 39750 | 1707.825128 | 39500 |
Han | 1200 | 1200 | NaN | 1200 | 3000 | 3000 | NaN | 3000 |
HanMeimei | 13500 | 4500 | 2179.449472 | 3500 | 15000 | 5000 | 1000.000000 | 5000 |
salaries.groupby('Name')[['Bonus','Salary']].agg([np.sum, np.mean, np.std, np.median])
Bonus | Salary | |||||||
sum | mean | std | median | sum | mean | std | median | |
Name | ||||||||
BOSS | 67000 | 16750 | 3774.917218 | 17000 | 159000 | 39750 | 1707.825128 | 39500 |
Han | 1200 | 1200 | NaN | 1200 | 3000 | 3000 | NaN | 3000 |
HanMeimei | 13500 | 4500 | 2179.449472 | 3500 | 15000 | 5000 | 1000.000000 | 5000 |
salaries.groupby('Name')[['Bonus']].agg([np.sum, np.mean, np.std, np.median])
Bonus | ||||
sum | mean | std | median | |
Name | ||||
BOSS | 67000 | 16750 | 3774.917218 | 17000 |
Han | 1200 | 1200 | NaN | 1200 |
HanMeimei | 13500 | 4500 | 2179.449472 | 3500 |
type(salaries['Bonus'])pandas.core.series.Seriestype(salaries[['Bonus']])pandas.core.frame.DataFrame变换/transform
nvda = pd.read_csv('1_pandas_part2_data/pandas_part2_data/NVDA.csv', index_col=0, parse_dates=['Date'])
# 把第一列设置为索引列 解析日期列nvda.head() #查看前五行
Open | High | Low | Close | Adj Close | Volume | |
Date | ||||||
1999-01-22 | 1.750000 | 1.953125 | 1.552083 | 1.640625 | 1.523430 | 67867200 |
1999-01-25 | 1.770833 | 1.833333 | 1.640625 | 1.812500 | 1.683028 | 12762000 |
1999-01-26 | 1.833333 | 1.869792 | 1.645833 | 1.671875 | 1.552448 | 8580000 |
1999-01-27 | 1.677083 | 1.718750 | 1.583333 | 1.666667 | 1.547611 | 6109200 |
1999-01-28 | 1.666667 | 1.677083 | 1.651042 | 1.661458 | 1.542776 | 5688000 |
nvda.loc[:,'year'] = nvda.index.yearnvda.head()
Open | High | Low | Close | Adj Close | Volume | year | |
Date | |||||||
1999-01-22 | 1.750000 | 1.953125 | 1.552083 | 1.640625 | 1.523430 | 67867200 | 1999 |
1999-01-25 | 1.770833 | 1.833333 | 1.640625 | 1.812500 | 1.683028 | 12762000 | 1999 |
1999-01-26 | 1.833333 | 1.869792 | 1.645833 | 1.671875 | 1.552448 | 8580000 | 1999 |
1999-01-27 | 1.677083 | 1.718750 | 1.583333 | 1.666667 | 1.547611 | 6109200 | 1999 |
1999-01-28 | 1.666667 | 1.677083 | 1.651042 | 1.661458 | 1.542776 | 5688000 | 1999 |
Series类型调用unique():查看一列中的不同的取值
Series类型调用value_counts():查看一列中的不同的取值,以及该取值出现的次数
某一列有多少不同的取值 => unique
#求year这列中不同的取值
nvda['year'].unique()array([1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009,
2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017], dtype=int64)某一列有多少不同的取值,它们分别出现了多少次 => value_counts()
# 求year这列中不同取值分别出现的次数
nvda['year'].value_counts()2008 253
2015 252
2011 252
2004 252
2016 252
2005 252
2009 252
2013 252
2000 252
2002 252
2010 252
2014 252
2003 252
2006 251
2007 251
2012 250
2001 248
1999 239
2017 138
Name: year, dtype: int64tmp = nvda['year'].value_counts().to_frame()#Series转成DataFrame
tmp = tmp.reset_index()
tmp.columns = ['year','count']tmp.head()
year | count | |
0 | 2008 | 253 |
1 | 2015 | 252 |
2 | 2011 | 252 |
3 | 2004 | 252 |
4 | 2016 | 252 |
tmp.sort_values(by='year').head()
year | count | |
17 | 1999 | 239 |
8 | 2000 | 252 |
16 | 2001 | 248 |
9 | 2002 | 252 |
12 | 2003 | 252 |
nvda.head()
Open | High | Low | Close | Adj Close | Volume | year | |
Date | |||||||
1999-01-22 | 1.750000 | 1.953125 | 1.552083 | 1.640625 | 1.523430 | 67867200 | 1999 |
1999-01-25 | 1.770833 | 1.833333 | 1.640625 | 1.812500 | 1.683028 | 12762000 | 1999 |
1999-01-26 | 1.833333 | 1.869792 | 1.645833 | 1.671875 | 1.552448 | 8580000 | 1999 |
1999-01-27 | 1.677083 | 1.718750 | 1.583333 | 1.666667 | 1.547611 | 6109200 | 1999 |
1999-01-28 | 1.666667 | 1.677083 | 1.651042 | 1.661458 | 1.542776 | 5688000 | 1999 |
nvda.groupby('year').agg(['mean', 'std'])
Open | High | Low | Close | Adj Close | Volume | |||||||
mean | std | mean | std | mean | std | mean | std | mean | std | mean | std | |
year | ||||||||||||
1999 | 1.950782 | 0.588882 | 2.007317 | 0.614302 | 1.883559 | 0.571658 | 1.947230 | 0.601041 | 1.808134 | 0.558107 | 6.433220e+06 | 8.142949e+06 |
2000 | 8.781084 | 2.999908 | 9.222697 | 3.114186 | 8.360522 | 2.904761 | 8.778826 | 3.013104 | 8.151729 | 2.797869 | 1.104182e+07 | 7.985374e+06 |
2001 | 13.091254 | 3.839777 | 13.600750 | 3.829838 | 12.680548 | 3.830944 | 13.181552 | 3.833637 | 12.239956 | 3.559789 | 2.782387e+07 | 1.384318e+07 |
2002 | 9.690344 | 6.561287 | 9.955093 | 6.664226 | 9.344391 | 6.375212 | 9.614749 | 6.519053 | 8.927940 | 6.053379 | 3.168655e+07 | 1.558742e+07 |
2003 | 5.902434 | 1.461862 | 6.042659 | 1.491260 | 5.764960 | 1.423422 | 5.900344 | 1.459852 | 5.478865 | 1.355570 | 2.430220e+07 | 1.899657e+07 |
2004 | 6.484735 | 1.467445 | 6.608810 | 1.482036 | 6.353558 | 1.444797 | 6.465913 | 1.456575 | 6.004034 | 1.352528 | 1.706331e+07 | 1.191968e+07 |
2005 | 9.512381 | 1.580061 | 9.659656 | 1.591274 | 9.353175 | 1.571138 | 9.513823 | 1.589762 | 8.834223 | 1.476201 | 1.542825e+07 | 9.623837e+06 |
2006 | 18.057902 | 3.675092 | 18.425126 | 3.718616 | 17.720279 | 3.657584 | 18.095963 | 3.700960 | 16.803316 | 3.436590 | 1.534446e+07 | 6.616879e+06 |
2007 | 27.762045 | 6.111437 | 28.251673 | 6.225662 | 27.206056 | 5.902620 | 27.724542 | 6.087681 | 25.744098 | 5.652820 | 1.514562e+07 | 5.818216e+06 |
2008 | 16.004308 | 6.862760 | 16.426245 | 6.964528 | 15.521462 | 6.696381 | 15.945613 | 6.811527 | 14.806572 | 6.324960 | 2.022721e+07 | 8.552974e+06 |
2009 | 11.825119 | 2.638097 | 12.114762 | 2.612499 | 11.565952 | 2.640537 | 11.850873 | 2.631664 | 11.004331 | 2.443677 | 1.919821e+07 | 8.291987e+06 |
2010 | 13.576349 | 2.888884 | 13.802659 | 2.904905 | 13.318532 | 2.843065 | 13.563175 | 2.884261 | 12.594318 | 2.678230 | 1.853295e+07 | 8.434693e+06 |
2011 | 16.912540 | 3.404884 | 17.267540 | 3.490368 | 16.512143 | 3.305507 | 16.887540 | 3.404032 | 15.681214 | 3.160872 | 2.289352e+07 | 1.270114e+07 |
2012 | 13.526200 | 1.176957 | 13.717400 | 1.191775 | 13.319800 | 1.165419 | 13.507880 | 1.185139 | 12.551166 | 1.091736 | 1.207757e+07 | 5.050116e+06 |
2013 | 14.173571 | 1.251508 | 14.329802 | 1.253287 | 14.035278 | 1.253372 | 14.189127 | 1.250883 | 13.412278 | 1.260152 | 8.843986e+06 | 4.202323e+06 |
2014 | 18.543056 | 1.293283 | 18.745476 | 1.283480 | 18.348214 | 1.276038 | 18.547064 | 1.284932 | 17.875053 | 1.312833 | 7.098902e+06 | 3.140560e+06 |
2015 | 23.680595 | 4.106327 | 23.979524 | 4.152229 | 23.411071 | 4.079351 | 23.718254 | 4.128879 | 23.262283 | 4.154678 | 7.756520e+06 | 3.933075e+06 |
2016 | 53.630833 | 21.714540 | 54.415397 | 22.182621 | 52.895119 | 21.263517 | 53.761190 | 21.803927 | 53.475737 | 21.824367 | 1.107062e+07 | 7.547056e+06 |
2017 | 120.481305 | 22.027821 | 122.300725 | 22.510244 | 118.402754 | 21.281863 | 120.547971 | 21.991898 | 120.436863 | 22.056290 | 1.907742e+07 | 1.073342e+07 |
def my_transform(x):
return (x-x.mean())/x.std()tmp_arr = np.array(range(10))tmp_arrarray([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])my_transform(tmp_arr)array([-1.5666989 , -1.21854359, -0.87038828, -0.52223297, -0.17407766,
0.17407766, 0.52223297, 0.87038828, 1.21854359, 1.5666989 ])tranformed = nvda.groupby('year').transform(my_transform)tranformed.head()
Open | High | Low | Close | Adj Close | Volume | |
Date | ||||||
1999-01-22 | -0.340955 | -0.088217 | -0.579850 | -0.510124 | -0.510124 | 7.544438 |
1999-01-25 | -0.305578 | -0.283222 | -0.424964 | -0.224161 | -0.224161 | 0.777210 |
1999-01-26 | -0.199444 | -0.223871 | -0.415854 | -0.458130 | -0.458131 | 0.263637 |
1999-01-27 | -0.464778 | -0.469747 | -0.525185 | -0.466795 | -0.466798 | -0.039791 |
1999-01-28 | -0.482465 | -0.537575 | -0.406741 | -0.475462 | -0.475461 | -0.091517 |
%matplotlib inlinecompare_df = pd.DataFrame({'Origin':nvda['Adj Close'], 'Transformed':tranformed['Adj Close']})compare_df.head()
Origin | Transformed | |
Date | ||
1999-01-22 | 1.523430 | -0.510124 |
1999-01-25 | 1.683028 | -0.224161 |
1999-01-26 | 1.552448 | -0.458131 |
1999-01-27 | 1.547611 | -0.466798 |
1999-01-28 | 1.542776 | -0.475461 |
compare_df.plot()<matplotlib.axes._subplots.AxesSubplot at 0x2354e970c18>
compare_df['Transformed'].plot()<matplotlib.axes._subplots.AxesSubplot at 0x235517b0978>
变换函数apply
salaries
Bonus | Name | Salary | Year | |
0 | 12000 | BOSS | 40000 | 2016 |
1 | 3000 | HanMeimei | 5000 | 2016 |
2 | 3500 | HanMeimei | 4000 | 2016 |
3 | 1200 | Han | 3000 | 2016 |
4 | 16000 | BOSS | 38000 | 2017 |
5 | 18000 | BOSS | 42000 | 2017 |
6 | 7000 | HanMeimei | 6000 | 2017 |
7 | 21000 | BOSS | 39000 | 2017 |
salaries.loc[:,'tmp_col'] = (salaries['Salary']*2-1500)/0.8salaries
Bonus | Name | Salary | Year | tmp_col | |
0 | 12000 | BOSS | 40000 | 2016 | 98125.0 |
1 | 3000 | HanMeimei | 5000 | 2016 | 10625.0 |
2 | 3500 | HanMeimei | 4000 | 2016 | 8125.0 |
3 | 1200 | Han | 3000 | 2016 | 5625.0 |
4 | 16000 | BOSS | 38000 | 2017 | 93125.0 |
5 | 18000 | BOSS | 42000 | 2017 | 103125.0 |
6 | 7000 | HanMeimei | 6000 | 2017 | 13125.0 |
7 | 21000 | BOSS | 39000 | 2017 | 95625.0 |
def trans(x):
return (x*2-1500)/0.8salaries.loc[:,'tmp_col2'] = salaries['Salary'].apply(trans)salaries
Bonus | Name | Salary | Year | tmp_col | tmp_col2 | |
0 | 12000 | BOSS | 40000 | 2016 | 98125.0 | 98125.0 |
1 | 3000 | HanMeimei | 5000 | 2016 | 10625.0 | 10625.0 |
2 | 3500 | HanMeimei | 4000 | 2016 | 8125.0 | 8125.0 |
3 | 1200 | Han | 3000 | 2016 | 5625.0 | 5625.0 |
4 | 16000 | BOSS | 38000 | 2017 | 93125.0 | 93125.0 |
5 | 18000 | BOSS | 42000 | 2017 | 103125.0 | 103125.0 |
6 | 7000 | HanMeimei | 6000 | 2017 | 13125.0 | 13125.0 |
7 | 21000 | BOSS | 39000 | 2017 | 95625.0 | 95625.0 |
def trans2(x):
if x=='Han':
return 'HanXiaoyang'
else:
return xsalaries.loc[:,'full_name'] = salaries['Name'].apply(trans2)salaries
Bonus | Name | Salary | Year | tmp_col | tmp_col2 | full_name | |
0 | 12000 | BOSS | 40000 | 2016 | 98125.0 | 98125.0 | BOSS |
1 | 3000 | HanMeimei | 5000 | 2016 | 10625.0 | 10625.0 | HanMeimei |
2 | 3500 | HanMeimei | 4000 | 2016 | 8125.0 | 8125.0 | HanMeimei |
3 | 1200 | Han | 3000 | 2016 | 5625.0 | 5625.0 | HanXiaoyang |
4 | 16000 | BOSS | 38000 | 2017 | 93125.0 | 93125.0 | BOSS |
5 | 18000 | BOSS | 42000 | 2017 | 103125.0 | 103125.0 | BOSS |
6 | 7000 | HanMeimei | 6000 | 2017 | 13125.0 | 13125.0 | HanMeimei |
7 | 21000 | BOSS | 39000 | 2017 | 95625.0 | 95625.0 | BOSS |
help(pd.Series.apply)Help on function apply in module pandas.core.series:
apply(self, func, convert_dtype=True, args=(), **kwds)
Invoke function on values of Series. Can be ufunc (a NumPy function
that applies to the entire Series) or a Python function that only works
on single values
Parameters
----------
func : function
convert_dtype : boolean, default True
Try to find better dtype for elementwise function results. If
False, leave as dtype=object
args : tuple
Positional arguments to pass to function in addition to the value
Additional keyword arguments will be passed as keywords to the function
Returns
-------
y : Series or DataFrame if func returns a Series
See also
--------
Series.map: For element-wise operations
Series.agg: only perform aggregating type operations
Series.transform: only perform transformating type operations
Examples
--------
Create a series with typical summer temperatures for each city.
>>> import pandas as pd
>>> import numpy as np
>>> series = pd.Series([20, 21, 12], index=['London',
... 'New York','Helsinki'])
>>> series
London 20
New York 21
Helsinki 12
dtype: int64
Square the values by defining a function and passing it as an
argument to ``apply()``.
>>> def square(x):
... return x**2
>>> series.apply(square)
London 400
New York 441
Helsinki 144
dtype: int64
Square the values by passing an anonymous function as an
argument to ``apply()``.
>>> series.apply(lambda x: x**2)
London 400
New York 441
Helsinki 144
dtype: int64
Define a custom function that needs additional positional
arguments and pass these additional arguments using the
``args`` keyword.
>>> def subtract_custom_value(x, custom_value):
... return x-custom_value
>>> series.apply(subtract_custom_value, args=(5,))
London 15
New York 16
Helsinki 7
dtype: int64
Define a custom function that takes keyword arguments
and pass these arguments to ``apply``.
>>> def add_custom_values(x, **kwargs):
... for month in kwargs:
... x+=kwargs[month]
... return x
>>> series.apply(add_custom_values, june=30, july=20, august=25)
London 95
New York 96
Helsinki 87
dtype: int64
Use a function from the Numpy library.
>>> series.apply(np.log)
London 2.995732
New York 3.044522
Helsinki 2.484907
dtype: float64salaries
Bonus | Name | Salary | Year | tmp_col | tmp_col2 | full_name | |
0 | 12000 | BOSS | 40000 | 2016 | 98125.0 | 98125.0 | BOSS |
1 | 3000 | HanMeimei | 5000 | 2016 | 10625.0 | 10625.0 | HanMeimei |
2 | 3500 | HanMeimei | 4000 | 2016 | 8125.0 | 8125.0 | HanMeimei |
3 | 1200 | Han | 3000 | 2016 | 5625.0 | 5625.0 | HanXiaoyang |
4 | 16000 | BOSS | 38000 | 2017 | 93125.0 | 93125.0 | BOSS |
5 | 18000 | BOSS | 42000 | 2017 | 103125.0 | 103125.0 | BOSS |
6 | 7000 | HanMeimei | 6000 | 2017 | 13125.0 | 13125.0 | HanMeimei |
7 | 21000 | BOSS | 39000 | 2017 | 95625.0 | 95625.0 | BOSS |
gender = 'male'
sex = '男' if gender=='male' else '女'sex'男'lambda匿名函数
salaries.loc[:,'new_name'] = salaries['Name'].apply(lambda x: 'HanXiaoyang' if x=='han' else x)对几列做操作
# 如果是boss,返回工资+奖金,其他人返回工资
def my_fun(name, salary, bonus):
if name=='BOSS':
return salary+bonus
else:
return salarysalaries.loc[:,'my_s_result'] = list(map(lambda x,y,z:my_fun(x,y,z), \
salaries['Name'],\
salaries['Salary'],\
salaries['Bonus']))总结
- groupby取分组内容
- groupby分组之后做统计计算agg([np.sum,‘median’,‘std’])
- groupby之后describe、transform
- apply对列做变换(定义一个函数)
- 附加:对多列做变换,map(lambda x,y,z,a:my_fun(x,y,z,a), df[‘x’], df[‘y’]…)
数据的拼接与合并
- concat
- merge
- join
df1 = pd.DataFrame({'apts':[55000,60000], 'cars':[200000,300000]}, index=['Shanghai','Beijing'])df1
apts | cars | |
Shanghai | 55000 | 200000 |
Beijing | 60000 | 300000 |
df2 = pd.DataFrame({'apts':[35000, 45000], 'cars':[150000, 180000]}, index=['Hangzhou','Guangzhou'])df2
apts | cars | |
Hangzhou | 35000 | 150000 |
Guangzhou | 45000 | 180000 |
df3 = pd.DataFrame({'apts':[30000, 10000], 'cars':[120000, 100000]}, index=['Nanjing','Chongqing'])df3
apts | cars | |
Nanjing | 30000 | 120000 |
Chongqing | 10000 | 100000 |
# concat
result = pd.concat([df1,df2,df3])result
apts | cars | |
Shanghai | 55000 | 200000 |
Beijing | 60000 | 300000 |
Hangzhou | 35000 | 150000 |
Guangzhou | 45000 | 180000 |
Nanjing | 30000 | 120000 |
Chongqing | 10000 | 100000 |
#行对齐去拼接
pd.concat([df1,df2,df3], axis=1)
apts | cars | apts | cars | apts | cars | |
Beijing | 60000.0 | 300000.0 | NaN | NaN | NaN | NaN |
Chongqing | NaN | NaN | NaN | NaN | 10000.0 | 100000.0 |
Guangzhou | NaN | NaN | 45000.0 | 180000.0 | NaN | NaN |
Hangzhou | NaN | NaN | 35000.0 | 150000.0 | NaN | NaN |
Nanjing | NaN | NaN | NaN | NaN | 30000.0 | 120000.0 |
Shanghai | 55000.0 | 200000.0 | NaN | NaN | NaN | NaN |
#列对齐拼接
pd.concat([df1,df2,df3], axis=0)
apts | cars | |
Shanghai | 55000 | 200000 |
Beijing | 60000 | 300000 |
Hangzhou | 35000 | 150000 |
Guangzhou | 45000 | 180000 |
Nanjing | 30000 | 120000 |
Chongqing | 10000 | 100000 |
#append
df1.append(df1)
apts | cars | |
Shanghai | 55000 | 200000 |
Beijing | 60000 | 300000 |
Shanghai | 55000 | 200000 |
Beijing | 60000 | 300000 |
result
apts | cars | |
Shanghai | 55000 | 200000 |
Beijing | 60000 | 300000 |
Hangzhou | 35000 | 150000 |
Guangzhou | 45000 | 180000 |
Nanjing | 30000 | 120000 |
Chongqing | 10000 | 100000 |
合并/merge
def my_trans_apts(x):
if x<45000:
return 45000
else:
return 60000result.loc[:,'new_apts'] = result['apts'].apply(my_trans_apts)result
apts | cars | new_apts | |
Shanghai | 55000 | 200000 | 60000 |
Beijing | 60000 | 300000 | 60000 |
Hangzhou | 35000 | 150000 | 45000 |
Guangzhou | 45000 | 180000 | 60000 |
Nanjing | 30000 | 120000 | 45000 |
Chongqing | 10000 | 100000 | 45000 |
new_df = pd.DataFrame({'new_apts':[60000,45000], 'bonus':[100000, 50000]})new_df
bonus | new_apts | |
0 | 100000 | 60000 |
1 | 50000 | 45000 |
pd.merge(result, new_df, on=['new_apts'], how='inner')
apts | cars | new_apts | bonus | |
0 | 55000 | 200000 | 60000 | 100000 |
1 | 60000 | 300000 | 60000 | 100000 |
2 | 45000 | 180000 | 60000 | 100000 |
3 | 35000 | 150000 | 45000 | 50000 |
4 | 30000 | 120000 | 45000 | 50000 |
5 | 10000 | 100000 | 45000 | 50000 |
new_df2 = pd.DataFrame({'new_apts':[65000,45000], 'bonus':[100000, 50000]})new_df2
bonus | new_apts | |
0 | 100000 | 65000 |
1 | 50000 | 45000 |
result
apts | cars | new_apts | |
Shanghai | 55000 | 200000 | 60000 |
Beijing | 60000 | 300000 | 60000 |
Hangzhou | 35000 | 150000 | 45000 |
Guangzhou | 45000 | 180000 | 60000 |
Nanjing | 30000 | 120000 | 45000 |
Chongqing | 10000 | 100000 | 45000 |
pd.merge(result, new_df2, on=['new_apts'], how='left')
apts | cars | new_apts | bonus | |
0 | 55000 | 200000 | 60000 | NaN |
1 | 60000 | 300000 | 60000 | NaN |
2 | 35000 | 150000 | 45000 | 50000.0 |
3 | 45000 | 180000 | 60000 | NaN |
4 | 30000 | 120000 | 45000 | 50000.0 |
5 | 10000 | 100000 | 45000 | 50000.0 |
pd.merge(result, new_df2, on=['new_apts'], how='right')
apts | cars | new_apts | bonus | |
0 | 35000.0 | 150000.0 | 45000 | 50000 |
1 | 30000.0 | 120000.0 | 45000 | 50000 |
2 | 10000.0 | 100000.0 | 45000 | 50000 |
3 | NaN | NaN | 65000 | 100000 |
df1
apts | cars | |
Shanghai | 55000 | 200000 |
Beijing | 60000 | 300000 |
df2
apts | cars | |
Hangzhou | 35000 | 150000 |
Guangzhou | 45000 | 180000 |
df3
apts | cars | |
Nanjing | 30000 | 120000 |
Chongqing | 10000 | 100000 |
df4 = pd.DataFrame({'salaries':[10000,30000,30000,20000,15000]}, index=['Suzhou', 'Beijing','Shanghai','Guangzhou','Tianjin'])df4
salaries | |
Suzhou | 10000 |
Beijing | 30000 |
Shanghai | 30000 |
Guangzhou | 20000 |
Tianjin | 15000 |
join基于index去合并数据的函数
df1.join(df4)
apts | cars | salaries | |
Shanghai | 55000 | 200000 | 30000 |
Beijing | 60000 | 300000 | 30000 |
df2.join(df4)
apts | cars | salaries | |
Hangzhou | 35000 | 150000 | NaN |
Guangzhou | 45000 | 180000 | 20000.0 |
help(pd.DataFrame.join)Help on function join in module pandas.core.frame:
join(self, other, on=None, how='left', lsuffix='', rsuffix='', sort=False)
Join columns with other DataFrame either on index or on a key
column. Efficiently Join multiple DataFrame objects by index at once by
passing a list.
Parameters
----------
other : DataFrame, Series with name field set, or list of DataFrame
Index should be similar to one of the columns in this one. If a
Series is passed, its name attribute must be set, and that will be
used as the column name in the resulting joined DataFrame
on : column name, tuple/list of column names, or array-like
Column(s) in the caller to join on the index in other,
otherwise joins index-on-index. If multiples
columns given, the passed DataFrame must have a MultiIndex. Can
pass an array as the join key if not already contained in the
calling DataFrame. Like an Excel VLOOKUP operation
how : {'left', 'right', 'outer', 'inner'}, default: 'left'
How to handle the operation of the two objects.
* left: use calling frame's index (or column if on is specified)
* right: use other frame's index
* outer: form union of calling frame's index (or column if on is
specified) with other frame's index, and sort it
lexicographically
* inner: form intersection of calling frame's index (or column if
on is specified) with other frame's index, preserving the order
of the calling's one
lsuffix : string
Suffix to use from left frame's overlapping columns
rsuffix : string
Suffix to use from right frame's overlapping columns
sort : boolean, default False
Order result DataFrame lexicographically by the join key. If False,
the order of the join key depends on the join type (how keyword)
Notes
-----
on, lsuffix, and rsuffix options are not supported when passing a list
of DataFrame objects
Examples
--------
>>> caller = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'],
... 'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})
>>> caller
A key
0 A0 K0
1 A1 K1
2 A2 K2
3 A3 K3
4 A4 K4
5 A5 K5
>>> other = pd.DataFrame({'key': ['K0', 'K1', 'K2'],
... 'B': ['B0', 'B1', 'B2']})
>>> other
B key
0 B0 K0
1 B1 K1
2 B2 K2
Join DataFrames using their indexes.
>>> caller.join(other, lsuffix='_caller', rsuffix='_other')
>>> A key_caller B key_other
0 A0 K0 B0 K0
1 A1 K1 B1 K1
2 A2 K2 B2 K2
3 A3 K3 NaN NaN
4 A4 K4 NaN NaN
5 A5 K5 NaN NaN
If we want to join using the key columns, we need to set key to be
the index in both caller and other. The joined DataFrame will have
key as its index.
>>> caller.set_index('key').join(other.set_index('key'))
>>> A B
key
K0 A0 B0
K1 A1 B1
K2 A2 B2
K3 A3 NaN
K4 A4 NaN
K5 A5 NaN
Another option to join using the key columns is to use the on
parameter. DataFrame.join always uses other's index but we can use any
column in the caller. This method preserves the original caller's
index in the result.
>>> caller.join(other.set_index('key'), on='key')
>>> A key B
0 A0 K0 B0
1 A1 K1 B1
2 A2 K2 B2
3 A3 K3 NaN
4 A4 K4 NaN
5 A5 K5 NaN
See also
--------
DataFrame.merge : For column(s)-on-columns(s) operations
Returns
-------
joined : DataFrame总结
- concat:拼接,axies指定拼接的维度
- merge:基于某个列去做关联
- join:基于index去做数据合并
小案例:自行车租赁案例分析
bikes = pd.read_csv('1_pandas_part2_data/pandas_part2_data/bikes.csv', sep=';', encoding='latin1', \
parse_dates=['Date'], index_col='Date')bikes.info()<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 310 entries, 2012-01-01 to 2012-05-11
Data columns (total 9 columns):
Berri 1 310 non-null int64
Brébeuf (données non disponibles) 0 non-null float64
Côte-Sainte-Catherine 310 non-null int64
Maisonneuve 1 310 non-null int64
Maisonneuve 2 310 non-null int64
du Parc 310 non-null int64
Pierre-Dupuy 310 non-null int64
Rachel1 310 non-null int64
St-Urbain (données non disponibles) 0 non-null float64
dtypes: float64(2), int64(7)
memory usage: 24.2 KBbikes.head()
Berri 1 | Brébeuf (données non disponibles) | Côte-Sainte-Catherine | Maisonneuve 1 | Maisonneuve 2 | du Parc | Pierre-Dupuy | Rachel1 | St-Urbain (données non disponibles) | |
Date | |||||||||
2012-01-01 | 35 | NaN | 0 | 38 | 51 | 26 | 10 | 16 | NaN |
2012-02-01 | 83 | NaN | 1 | 68 | 153 | 53 | 6 | 43 | NaN |
2012-03-01 | 135 | NaN | 2 | 104 | 248 | 89 | 3 | 58 | NaN |
2012-04-01 | 144 | NaN | 1 | 116 | 318 | 111 | 8 | 61 | NaN |
2012-05-01 | 197 | NaN | 2 | 124 | 330 | 97 | 13 | 95 | NaN |
bikes.shape(310, 9)#dropna去空,默认是去除有缺失值的行
bikes.dropna()
Berri 1 | Brébeuf (données non disponibles) | Côte-Sainte-Catherine | Maisonneuve 1 | Maisonneuve 2 | du Parc | Pierre-Dupuy | Rachel1 | St-Urbain (données non disponibles) | |
Date |
bikes.dropna(axis=1, how='all').head()#将存在这一列的所有数据都缺失值的列去掉
Berri 1 | Côte-Sainte-Catherine | Maisonneuve 1 | Maisonneuve 2 | du Parc | Pierre-Dupuy | Rachel1 | |
Date | |||||||
2012-01-01 | 35 | 0 | 38 | 51 | 26 | 10 | 16 |
2012-02-01 | 83 | 1 | 68 | 153 | 53 | 6 | 43 |
2012-03-01 | 135 | 2 | 104 | 248 | 89 | 3 | 58 |
2012-04-01 | 144 | 1 | 116 | 318 | 111 | 8 | 61 |
2012-05-01 | 197 | 2 | 124 | 330 | 97 | 13 | 95 |
bikes.shape(310, 9)bikes.dropna(axis=1, how='all', inplace=True)bikes.shape(310, 7)bikes.loc[:,'weekday'] = bikes.index.weekdaybikes.head()
Berri 1 | Côte-Sainte-Catherine | Maisonneuve 1 | Maisonneuve 2 | du Parc | Pierre-Dupuy | Rachel1 | weekday | |
Date | ||||||||
2012-01-01 | 35 | 0 | 38 | 51 | 26 | 10 | 16 | 6 |
2012-02-01 | 83 | 1 | 68 | 153 | 53 | 6 | 43 | 2 |
2012-03-01 | 135 | 2 | 104 | 248 | 89 | 3 | 58 | 3 |
2012-04-01 | 144 | 1 | 116 | 318 | 111 | 8 | 61 | 6 |
2012-05-01 | 197 | 2 | 124 | 330 | 97 | 13 | 95 | 1 |
weekday_counts = bikes.groupby('weekday').agg(sum)weekday_counts.head()
Berri 1 | Côte-Sainte-Catherine | Maisonneuve 1 | Maisonneuve 2 | du Parc | Pierre-Dupuy | Rachel1 | |
weekday | |||||||
0 | 132446 | 57940 | 90828 | 163042 | 89338 | 41524 | 126215 |
1 | 119895 | 52113 | 80865 | 145389 | 79585 | 35967 | 114622 |
2 | 146785 | 64189 | 99674 | 177105 | 96340 | 45103 | 130796 |
3 | 147630 | 61855 | 102801 | 177285 | 93386 | 46600 | 135268 |
4 | 150183 | 61432 | 102317 | 181651 | 95731 | 47272 | 143115 |
%matplotlib inline
weekday_counts['Berri 1'].plot(kind='bar')<matplotlib.axes._subplots.AxesSubplot at 0x7f157e27c4e0>
















