意见箱
恒创运营部门将仔细参阅您的意见和建议,必要时将通过预留邮箱与您保持联络。感谢您的支持!
意见/建议
提交建议

pandas 之 groupby 聚合函数

来源:恒创科技 编辑:恒创科技编辑部
2022-08-22 14:40:25

数据分析重点. 同维度下,对不同字段聚合 groupbby(key).agg({'字段1':'aggfunc1', '字段1':'aggfunc2''..}

import numpy as np 
import pandas as pd
聚合函数

Aggregations refer to any data transformation that produces scalar values from arrays(输入是数组, 输出是标量值). The preceding examples have used several of them, including mean, count, min, and sum You may wonder what is going on when you invoke mean() on a GroupBy object, Many common aggregations such as those found in Table 10-1, have optimized implementations. However, you are not limited to only this set of methods.

countsummeanmedianstd, varmin, maxprodfirst, last

You can use aggregations of your own devising and additionally call any method that is also dedined on the grouped object.
For example, you might recall that quantile computes sample quantiles of a Series or a DataFrame.


pandas 之 groupby 聚合函数

While quantile is not explicitly implemented for GroupBy, it's a Series method an thus available for use. Internally, GroupBy efficiently slices up the Series, call piece.quantile(0.9) for each piece, and then assembles those results together into the result object:

您可以使用您自己设计的聚合,并额外调用在分组对象上也禁用的任何方法。例如,您可能还记得分位数计算序列或数据流的样本分位数。虽然分位数没有显式地为GroupBy实现,但它是一个系列方法,因此可以使用。在内部,GroupBy有效地分割该系列,为每个片段调用piece.quantile(0.9),然后将这些结果组合到result对象中

df = pd.DataFrame({
'key1': 'a a b b a'.split(),
'key2': ['one', 'two', 'one', 'two', 'one'],
'data1': np.random.randn(5),
'data2': np.random.randn(5)
})

df

key1

key2

data1

data2

0

a

one

1.296733

-0.756093

1

a

two

-1.389859

-1.027718

2

b

one

-0.846801

-0.802681

3

b

two

1.200620

-1.328187

4

a

one

0.002991

-1.223807

grouped = df.groupby('key1')

grouped['data1'].quantile(0.9) # 0.9分位数
key1
a 1.037985
b 0.995878
Name: data1, dtype: float64

To use your own aggregation functions, pass any function that aggregates an array to the aggregate or agg method

def peak_to_peak(arr):
"""计算数组的极差"""
return arr.max() - arr.min()

grouped.agg(peak_to_peak) # 计算各组类的极差, 类似apply

size

tip

tip_pct

total_bill

day

smoker

Fri

No

1

2.00

0.067349

10.29

Yes

3

3.73

0.159925

34.42

Sat

No

3

8.00

0.235193

41.08

Yes

4

9.00

0.290095

47.74

Sun

No

4

4.99

0.193226

39.40

Yes

3

5.00

0.644685

38.10

Thur

No

5

5.45

0.193350

33.68

Yes

2

3.00

0.151240

32.77

You may notice that some methods like describe also work, even though they are not aggregations, strictly speaking(鸭子类型吗?):

grouped.describe()  # 描述分组信息, 只会对数值相关的哦

data1

data2

count

mean

std

min

25%

50%

75%

max

count

mean

std

min

25%

50%

75%

max

key1

a

3.0

-0.030045

1.343601

-1.389859

-0.693434

0.002991

0.649862

1.296733

3.0

-1.002539

0.234871

-1.223807

-1.125762

-1.027718

-0.891905

-0.756093

b

2.0

0.176910

1.447745

-0.846801

-0.334946

0.176910

0.688765

1.200620

2.0

-1.065434

0.371589

-1.328187

-1.196811

-1.065434

-0.934057

-0.802681

I will explain in more detail what has happend here in Section 10.3 Apply: General split-apply-combine on page 302

列的多功能扩展

Let's return to the tipping dataset from earlier examples. After loading it with read_csv, we add a tipping percentage column tip_pct

新增一列, tip所占的百分比

tips = pd.read_csv('../examples/tips.csv')

tips.info()

tips.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 6 columns):
total_bill 244 non-null float64
tip 244 non-null float64
smoker 244 non-null object
day 244 non-null object
time 244 non-null object
size 244 non-null int64
dtypes: float64(2), int64(1), object(3)
memory usage: 11.5+ KB

total_bill

tip

smoker

day

time

size

0

16.99

1.01

No

Sun

Dinner

2

1

10.34

1.66

No

Sun

Dinner

3

2

21.01

3.50

No

Sun

Dinner

3

3

23.68

3.31

No

Sun

Dinner

2

4

24.59

3.61

No

Sun

Dinner

4

"新增一列 tip_pct"

tips['tip_pct'] = tips['tip'] / tips['total_bill']

tips[:6]
'新增一列 tip_pct'

total_bill

tip

smoker

day

time

size

tip_pct

0

16.99

1.01

No

Sun

Dinner

2

0.059447

1

10.34

1.66

No

Sun

Dinner

3

0.160542

2

21.01

3.50

No

Sun

Dinner

3

0.166587

3

23.68

3.31

No

Sun

Dinner

2

0.139780

4

24.59

3.61

No

Sun

Dinner

4

0.146808

5

25.29

4.71

No

Sun

Dinner

4

0.186240

As you've already seen, aggregating a Series or all of the columns of a DataFrame is a matter of using aggregate with the desired function or calling a method like mean or std. However, you may want to aggregate using a different function depending o the column, or multiple functions at once. Fortunately, this is possible to do, which i'll illustrate through a number of examples. First, i'll group the tips by day and smoker:

对Series, DataFrame 的某列or所有的一个聚合, 通常会使用一个聚合函数去映射, 然而, 你可能想要总使用不同的函数去映射不同的列. 幸运的是,这是可以做到的,我将通过一些例子来说明这一点. 首先,我将把每天的小费和吸烟者分成两组

grouped = tips.groupby(['day', 'smoker'])

Note that for descriptive statistics like those in Table 10-1, you can pass the name of the function a s a string:

grouped_pct = grouped['tip_pct']
grouped_pct.agg('mean')
day   smoker
Fri No 0.151650
Yes 0.174783
Sat No 0.158048
Yes 0.147906
Sun No 0.160113
Yes 0.187250
Thur No 0.160298
Yes 0.163863
Name: tip_pct, dtype: float64
# cj  分组统计一波
grouped_pct.agg('count')
day   smoker
Fri No 4
Yes 15
Sat No 45
Yes 42
Sun No 57
Yes 19
Thur No 45
Yes 17
Name: tip_pct, dtype: int64

If you pass a list of functions or function names instead, you get back a DataFrame with column names taken from the functions:

"对1or多个列, 进行1or多个聚合, 并排展开, 厉害了"

grouped_pct.agg(['mean', 'std', peak_to_peak])
'对1or多个列, 进行1or多个聚合, 并排展开, 厉害了'

mean

std

peak_to_peak

day

smoker

Fri

No

0.151650

0.028123

0.067349

Yes

0.174783

0.051293

0.159925

Sat

No

0.158048

0.039767

0.235193

Yes

0.147906

0.061375

0.290095

Sun

No

0.160113

0.042347

0.193226

Yes

0.187250

0.154134

0.644685

Thur

No

0.160298

0.038774

0.193350

Yes

0.163863

0.039389

0.151240

Here we passed a list of aggregations functions to agg to evaluate indepedently on the data groups.

You don't need to accept the names that GroupBy gives to the columns; notably(尤其) lambda functions have the name <lambda which makes them hard to identify(you can see for yourself by looking at a function's __ name__ attribute.) Thus, if you pass a list of (name, function) tuples, the first element of each tuple will be used as the DataFrame column names.
(you can think of a list of 2-tuple as an ordered mapping)

"给分组字段取别名"
grouped_pct.agg([('foo', 'mean'), ('bar', np.std)])
'给分组字段取别名'

foo

bar

day

smoker

Fri

No

0.151650

0.028123

Yes

0.174783

0.051293

Sat

No

0.158048

0.039767

Yes

0.147906

0.061375

Sun

No

0.160113

0.042347

Yes

0.187250

0.154134

Thur

No

0.160298

0.038774

Yes

0.163863

0.039389

With a DataFrame you have more options, as you can specify a list of functions to apply to all of the columns or different functions per column (对不同的列进行不同的函数映射apply). To start, suppose we wanted to compute the same three statistics for the tip_pct and total_bill columns:

functions = ['count', 'mean', 'max']

"实现对任意字段的任意操作, 分别"
result = grouped['tip_pct', 'total_bill'].agg(functions)

result
'实现对任意字段的任意操作, 分别'

tip_pct

total_bill

count

mean

max

count

mean

max

day

smoker

Fri

No

4

0.151650

0.187735

4

18.420000

22.75

Yes

15

0.174783

0.263480

15

16.813333

40.17

Sat

No

45

0.158048

0.291990

45

19.661778

48.33

Yes

42

0.147906

0.325733

42

21.276667

50.81

Sun

No

57

0.160113

0.252672

57

20.506667

48.17

Yes

19

0.187250

0.710345

19

24.120000

45.35

Thur

No

45

0.160298

0.266312

45

17.113111

41.19

Yes

17

0.163863

0.241255

17

19.190588

43.11

As you can see, the resulting DataFrame has hierarchical columns, the same as you would get aggregating each column separately and using concat to glue(粘合) the results together using the column names as the keys argument.

result['tip_pct']  # 多层索引的选取哦

count

mean

max

day

smoker

Fri

No

4

0.151650

0.187735

Yes

15

0.174783

0.263480

Sat

No

45

0.158048

0.291990

Yes

42

0.147906

0.325733

Sun

No

57

0.160113

0.252672

Yes

19

0.187250

0.710345

Thur

No

45

0.160298

0.266312

Yes

17

0.163863

0.241255

As befor, a list of tuples with custom names can be passed:

ftuples = [('Durchschnitt', 'mean'), ('Abweichung', np.var)]

grouped['tip_pct', 'total_bill'].agg(ftuples)

tip_pct

total_bill

Durchschnitt

Abweichung

Durchschnitt

Abweichung

day

smoker

Fri

No

0.151650

0.000791

18.420000

25.596333

Yes

0.174783

0.002631

16.813333

82.562438

Sat

No

0.158048

0.001581

19.661778

79.908965

Yes

0.147906

0.003767

21.276667

101.387535

Sun

No

0.160113

0.001793

20.506667

66.099980

Yes

0.187250

0.023757

24.120000

109.046044

Thur

No

0.160298

0.001503

17.113111

59.625081

Yes

0.163863

0.001551

19.190588

69.808518

Now suppose you wanted to apply potentially different functions to one or more of the columns. To do this, pass a dict to agg that contains a mapping of column names to any of the function specifications listed so far:

grouped.agg({'tip':np.max, 'size': 'sum'})

tip

size

day

smoker

Fri

No

3.50

9

Yes

4.73

31

Sat

No

9.00

115

Yes

10.00

104

Sun

No

6.00

167

Yes

6.50

49

Thur

No

6.70

112

Yes

5.00

40

grouped.agg({'tip_pct':['min', 'max', 'mean', 'std', 'sum'], 
'size':'sum'})

tip_pct

size

min

max

mean

std

sum

sum

day

smoker

Fri

No

0.120385

0.187735

0.151650

0.028123

0.606602

9

Yes

0.103555

0.263480

0.174783

0.051293

2.621746

31

Sat

No

0.056797

0.291990

0.158048

0.039767

7.112145

115

Yes

0.035638

0.325733

0.147906

0.061375

6.212055

104

Sun

No

0.059447

0.252672

0.160113

0.042347

9.126438

167

Yes

0.065660

0.710345

0.187250

0.154134

3.557756

49

Thur

No

0.072961

0.266312

0.160298

0.038774

7.213414

112

Yes

0.090014

0.241255

0.163863

0.039389

2.785676

40

A DataFrame will have hierarchical columns only if multiple functions are applied to at least one column.

结果去掉行索引as_index=False

In all of the examples up until now, the aggregated data comes back with an index, potentially hierarchical, composed from the unique group key combinations. Since this isn't always describe, you can diable this behavior in most cases by passing as_index=False to groupby:

tips.groupby(['day', 'smoker'], as_index=False).mean()

day

smoker

total_bill

tip

size

tip_pct

0

Fri

No

18.420000

2.812500

2.250000

0.151650

1

Fri

Yes

16.813333

2.714000

2.066667

0.174783

2

Sat

No

19.661778

3.102889

2.555556

0.158048

3

Sat

Yes

21.276667

2.875476

2.476190

0.147906

4

Sun

No

20.506667

3.167895

2.929825

0.160113

5

Sun

Yes

24.120000

3.516842

2.578947

0.187250

6

Thur

No

17.113111

2.673778

2.488889

0.160298

7

Thur

Yes

19.190588

3.030000

2.352941

0.163863

Of course, it's always possible to obtain the result in this format by calling reset_index on the result. Using the as_index=False method avoids some unnecessary computations.

耐心和恒心, 总会获得回报的.



上一篇: 租用美国服务器:潜在的风险与应对策略。 下一篇: PC/SC Winscard SCardTransmit error 0x00000016 returned