pandas dataframe 选择行数大于 > x 的所有行
pandas dataframe pick all rows with row-count greater than > x
如何 select 所有行数 >= 2 的行?
我有以下 pandas 数据框。
df = pd.DataFrame({"date": ["2000-01-03", "2000-01-04", "2000-01-04", "2000-01-04", "2000-01-04",
"2000-01-03", "2000-01-04", "2000-01-05", "2000-01-05",
"2000-01-03", "2000-01-05", "2000-01-05",
"2000-01-04", "2000-01-05"],
"sym": ["A", "A", "A", "A", "A" ,"B", "B","B", "B" ,"C", "C", "C", "D", "E"],
"val1": [1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 2, 2],
"val2": [2, 2, 2, 2, 2, 2, 3, 3, 3, 1, 1, 2, 2, 2]
})
df
date sym val1 val2
0 2000-01-03 A 1 2
1 2000-01-04 A 1 2
2 2000-01-04 A 1 2
3 2000-01-04 A 1 2
4 2000-01-04 A 1 2
5 2000-01-03 B 2 2
6 2000-01-04 B 2 3
7 2000-01-05 B 2 3
8 2000-01-05 B 2 3
9 2000-01-03 C 3 1
10 2000-01-05 C 3 1
11 2000-01-05 C 3 2
12 2000-01-04 D 2 2
13 2000-01-05 E 2 2
我申请了
df.groupby(['date', 'sym'], as_index=False).mean().sort_values(['sym','date'])
为每个符号的给定日期计算 val1、val2 的平均值。
date sym val1 val2
0 2000-01-03 A 1.0 2.0
3 2000-01-04 A 1.0 2.0
1 2000-01-03 B 2.0 2.0
4 2000-01-04 B 2.0 3.0
6 2000-01-05 B 2.0 3.0
2 2000-01-03 C 3.0 1.0
7 2000-01-05 C 3.0 1.5
5 2000-01-04 D 2.0 2.0
8 2000-01-05 E 2.0 2.0
接下来,我需要 select 行数 >= 2 的“sym”的所有行。
在此示例中,生成的 df 将是来自 sym=A,B,C
的所有行
期望输出:
date sym val1 val2
0 2000-01-03 A 1.0 2.0
3 2000-01-04 A 1.0 2.0
1 2000-01-03 B 2.0 2.0
4 2000-01-04 B 2.0 3.0
6 2000-01-05 B 2.0 3.0
2 2000-01-03 C 3.0 1.0
7 2000-01-05 C 3.0 1.5
我尝试了 groupby、pivot、count 的组合,但没有成功。
参见:
import pandas as pd
df = pd.DataFrame({"date": ["2000-01-03", "2000-01-04",
"2000-01-04", "2000-01-04",
"2000-01-04", "2000-01-03",
"2000-01-04", "2000-01-05",
"2000-01-05", "2000-01-03",
"2000-01-05", "2000-01-05",
"2000-01-04", "2000-01-05"],
"sym": ["A", "A", "A", "A", "A", "B",
"B", "B", "B", "C", "C", "C",
"D", "E"],
"val1": [1, 1, 1, 1, 1, 2, 2, 2, 2, 3,
3, 3, 2, 2],
"val2": [2, 2, 2, 2, 2, 2, 3, 3, 3, 1,
1, 2, 2, 2]
})
df = df \
.groupby(['date', 'sym'], as_index=False) \
.mean() \
.sort_values(['sym', 'date'])
df = df[df['sym'].map(df['sym'].value_counts()) >= 2]
print(df)
输出:
date sym val1 val2
0 2000-01-03 A 1.0 2.0
3 2000-01-04 A 1.0 2.0
1 2000-01-03 B 2.0 2.0
4 2000-01-04 B 2.0 3.0
6 2000-01-05 B 2.0 3.0
2 2000-01-03 C 3.0 1.0
7 2000-01-05 C 3.0 1.5
如何 select 所有行数 >= 2 的行?
我有以下 pandas 数据框。
df = pd.DataFrame({"date": ["2000-01-03", "2000-01-04", "2000-01-04", "2000-01-04", "2000-01-04",
"2000-01-03", "2000-01-04", "2000-01-05", "2000-01-05",
"2000-01-03", "2000-01-05", "2000-01-05",
"2000-01-04", "2000-01-05"],
"sym": ["A", "A", "A", "A", "A" ,"B", "B","B", "B" ,"C", "C", "C", "D", "E"],
"val1": [1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 2, 2],
"val2": [2, 2, 2, 2, 2, 2, 3, 3, 3, 1, 1, 2, 2, 2]
})
df
date sym val1 val2
0 2000-01-03 A 1 2
1 2000-01-04 A 1 2
2 2000-01-04 A 1 2
3 2000-01-04 A 1 2
4 2000-01-04 A 1 2
5 2000-01-03 B 2 2
6 2000-01-04 B 2 3
7 2000-01-05 B 2 3
8 2000-01-05 B 2 3
9 2000-01-03 C 3 1
10 2000-01-05 C 3 1
11 2000-01-05 C 3 2
12 2000-01-04 D 2 2
13 2000-01-05 E 2 2
我申请了
df.groupby(['date', 'sym'], as_index=False).mean().sort_values(['sym','date'])
为每个符号的给定日期计算 val1、val2 的平均值。
date sym val1 val2
0 2000-01-03 A 1.0 2.0
3 2000-01-04 A 1.0 2.0
1 2000-01-03 B 2.0 2.0
4 2000-01-04 B 2.0 3.0
6 2000-01-05 B 2.0 3.0
2 2000-01-03 C 3.0 1.0
7 2000-01-05 C 3.0 1.5
5 2000-01-04 D 2.0 2.0
8 2000-01-05 E 2.0 2.0
接下来,我需要 select 行数 >= 2 的“sym”的所有行。 在此示例中,生成的 df 将是来自 sym=A,B,C
的所有行期望输出:
date sym val1 val2
0 2000-01-03 A 1.0 2.0
3 2000-01-04 A 1.0 2.0
1 2000-01-03 B 2.0 2.0
4 2000-01-04 B 2.0 3.0
6 2000-01-05 B 2.0 3.0
2 2000-01-03 C 3.0 1.0
7 2000-01-05 C 3.0 1.5
我尝试了 groupby、pivot、count 的组合,但没有成功。
参见:
import pandas as pd
df = pd.DataFrame({"date": ["2000-01-03", "2000-01-04",
"2000-01-04", "2000-01-04",
"2000-01-04", "2000-01-03",
"2000-01-04", "2000-01-05",
"2000-01-05", "2000-01-03",
"2000-01-05", "2000-01-05",
"2000-01-04", "2000-01-05"],
"sym": ["A", "A", "A", "A", "A", "B",
"B", "B", "B", "C", "C", "C",
"D", "E"],
"val1": [1, 1, 1, 1, 1, 2, 2, 2, 2, 3,
3, 3, 2, 2],
"val2": [2, 2, 2, 2, 2, 2, 3, 3, 3, 1,
1, 2, 2, 2]
})
df = df \
.groupby(['date', 'sym'], as_index=False) \
.mean() \
.sort_values(['sym', 'date'])
df = df[df['sym'].map(df['sym'].value_counts()) >= 2]
print(df)
输出:
date sym val1 val2
0 2000-01-03 A 1.0 2.0
3 2000-01-04 A 1.0 2.0
1 2000-01-03 B 2.0 2.0
4 2000-01-04 B 2.0 3.0
6 2000-01-05 B 2.0 3.0
2 2000-01-03 C 3.0 1.0
7 2000-01-05 C 3.0 1.5