语言基础
Python
Pandas

数值统计与聚合

# 数值型统计运算

这些统计操作只对元素类型为数值型的对象有效。

方法	说明
sum	总和
mean	平均数
median	算术中位数（50%分位数）
min、max	最小值和最大值
idxmin、idxmax	最小值和最大值的索引值
argmin、argmax	最小值和最大值的索引位置（整数）
mad	根据平均值计算平均绝对离差
var	方差
std	标准差
count	飞NAN值的数量
value_counts	计算词频或频率
describe	针对Series或DataFrame的列做汇总统计
skew	样本值的偏度
kurt	样本值的峰度
cumsum	样本值的累计和
cummin、cummax	样本值的累计最小值和最大值
cumprod	样本值的累计积
diff	计算一阶差分
pct_change	pct_change

# 一元统计

# `.sum()`

DataFrame.sum(axis='index')

axis：'index'-沿列加，'columns'-沿行加

import numpy as np
import pandas as pd

1
2

df = pd.DataFrame([[1,2],[3,5]], index=['a','b'], columns = ['A','B'])
df

1
2

	A	B
a	1	2
b	3	5

df.sum()  # 按列加

A    4
B    7
dtype: int64

df.sum(axis = 'columns')  # 按行加

a    3
b    8
dtype: int64

# `.mean(), .std(), .var()`

均值、标准差、方差

# .max(), .min(), .median(), idmax(), idmin()

最大、最小、中值

df.mad(axis = 'index')

A    0.75
B    0.75
dtype: float64

# 二元统计

计算任意两列直接的统计量，返回以列索引为新行索引和列索引的 DataFrame

# `.cov()`

DataFrame.cov(min_periods=None)

min_periods：每一列去除 NaN 后，要求能够参与运算的最少元素个数。

df1 = pd.DataFrame([[1,2],[2,0]], columns = ['B','C'])
df1

1
2

	B	C
0	1	2
1	2	0

df1.cov()

	B	C
B	0.5	-1.0
C	-1.0	2.0

# `.corr()`

	B	C
B	1.0	-1.0
C	-1.0	1.0

# `.corrwith()`

corr 是自身列之间的关系，而这个函数可以对不同的 DataFrame 进行运算，不要要记得运算发生在同名列和同索引的行之间。

DataFrame.corrwith(other, axis=0, drop=False)

other：另一个 DataFrame 或 Series
axis：'index'或'columns'
drop：是否丢掉结果中的 NaN

df1 = pd.DataFrame([[1,2],[2,0],[2,3]],index = [0,1,2],columns = ['B','C'])
df1

1
2

	B	C
0	1	2
1	2	0
2	2	3

df

	A	B
a	1	2
b	3	5

df.corrwith(df1)  #只对 同名列 和 同名行 进行计算

B   NaN
A   NaN
C   NaN
dtype: float64

s = pd.Series([1,2], index = [0,1], name = 'B')
s

1
2

0    1
1    2
Name: B, dtype: int64

df

	A	B
a	1	2
b	3	5

df.corrwith(s)

A   NaN
B   NaN
dtype: float64

# 类型型统计运算

# `value_counts()`

不适合 DataFrame。

Series/Index.value_counts(normalize=False, ascending=False, bins=None)

normalize：True or False，计算频次或者频率比；
ascending：True or False，排序方式，默认降序；
bins：int，pd.cut 的一种快捷操作，对连续数值型效果好；

s = pd.Series([1,2,1,2,1,3])
s

1
2

0    1
1    2
2    1
3    2
4    1
5    3
dtype: int64

s.value_counts()

1    3
2    2
3    1
dtype: int64

s.value_counts(ascending = True)

3    1
2    2
1    3
dtype: int64

s.value_counts( bins = 2)   # bins按照int平均分割，左开右闭，左侧外延1%以包含最左值

(0.997, 2.0]    5
(2.0, 3.0]      1
dtype: int64

# `.count()`

计算统计每一类 non-NaN 元素个数，这个函数可以快速了解哪些特征或哪些样本缺失比较严重。

DataFrame.count(axis=0)

axis: 0-查看列，1-查看行；

df

	A	B
a	1	2
b	3	5

df.count(axis = 0)

A    2
B    2
dtype: int64

type(df.count(axis = 1))

pandas.core.series.Series

上次更新: 2023/11/01, 03:11:44

← 合并数据集分组聚合→

数值统计与聚合

# 数值型统计运算

# 一元统计

# .sum()

# .mean(), .std(), .var()

# .max(), .min(), .median(), idmax(), idmin()

# 二元统计

# .cov()

# .corr()

# .corrwith()