如何使用 pandas 在 csv 中对数据帧进行代表性采样
how to representative sampling a dataframe in csv using pandas
我有一个如下所示的数据框,
print (df)
column 1 column 2 column 3
0 mobile a Blanks
1 mobile b Blanks
2 mobile c cricket
3 laptop d cricket
4 phone e football
5 phone NaN football
6 phone g football
7 phone h football
关于 c1 列,我只想要一行没有空格
应用抽样方法后,df 应为
c1 c2 c3
mobile a Blanks
mobile c cricket
laptop d cricket
phone g football
请告诉我哪种采样方法适用于此。
首先删除 dropna
的所有 NaN
行。
如果只需要按 column 1
和 column 3
分组的随机行,您可以使用 groupby
with custom function with iloc
for random position by numpy.random.choice
df = df.dropna()
df = df.groupby(['column 1','column 3'], as_index=False) \
.apply(lambda x: x.iloc[np.random.choice(np.arange(len(x)), 1)]) \
.reset_index(drop=True)
print (df)
column 1 column 2 column 3
0 laptop d cricket
1 mobile b Blanks
2 mobile c cricket
3 phone h football
或使用sample
:
df = df.groupby(['column 1','column 3'], as_index=False) \
.apply(lambda x: x.sample(n=1)) \
.reset_index(drop=True)
print (df)
column 1 column 2 column 3
0 laptop d cricket
1 mobile b Blanks
2 mobile c cricket
3 phone g football
此处代码:
import pandas as pd
df = pd.read_table('44652428.tsv')
print(df.groupby('column 1').first().reset_index())
这里输出:
column 1 column 2 column 3
0 laptop d cricket
1 mobile a Blanks
2 phone e football
这里输入44652428.tsv
:
column 1 column 2 column 3
mobile a Blanks
mobile b Blanks
mobile c cricket
laptop d cricket
phone e football
phone NaN football
phone g football
phone h football
此处链接到 read_table, groupby and reset_index 上的文档。
我有一个如下所示的数据框,
print (df)
column 1 column 2 column 3
0 mobile a Blanks
1 mobile b Blanks
2 mobile c cricket
3 laptop d cricket
4 phone e football
5 phone NaN football
6 phone g football
7 phone h football
关于 c1 列,我只想要一行没有空格 应用抽样方法后,df 应为
c1 c2 c3
mobile a Blanks
mobile c cricket
laptop d cricket
phone g football
请告诉我哪种采样方法适用于此。
首先删除 dropna
的所有 NaN
行。
如果只需要按 column 1
和 column 3
分组的随机行,您可以使用 groupby
with custom function with iloc
for random position by numpy.random.choice
df = df.dropna()
df = df.groupby(['column 1','column 3'], as_index=False) \
.apply(lambda x: x.iloc[np.random.choice(np.arange(len(x)), 1)]) \
.reset_index(drop=True)
print (df)
column 1 column 2 column 3
0 laptop d cricket
1 mobile b Blanks
2 mobile c cricket
3 phone h football
或使用sample
:
df = df.groupby(['column 1','column 3'], as_index=False) \
.apply(lambda x: x.sample(n=1)) \
.reset_index(drop=True)
print (df)
column 1 column 2 column 3
0 laptop d cricket
1 mobile b Blanks
2 mobile c cricket
3 phone g football
此处代码:
import pandas as pd
df = pd.read_table('44652428.tsv')
print(df.groupby('column 1').first().reset_index())
这里输出:
column 1 column 2 column 3
0 laptop d cricket
1 mobile a Blanks
2 phone e football
这里输入44652428.tsv
:
column 1 column 2 column 3
mobile a Blanks
mobile b Blanks
mobile c cricket
laptop d cricket
phone e football
phone NaN football
phone g football
phone h football
此处链接到 read_table, groupby and reset_index 上的文档。