根据非空值的数量过滤 Pandas DataFrame
Filtering Pandas DataFrame based on number of non-null values
我正在分析一项消费者调查并使用 Pandas 进行数据清理。我有一个问题,参与者可以回答他们看到广告的频率 ('Daily': 1, 'Multiple times a week': 2, 'Once a week':3, 'Once a year': 4, 'Never': 5).
如果参与者回答说他们至少每周 (1,2,3) 会看到广告,他们将根据他们是否面对这些产品类别以及频率看到一系列全新的问题。调查系统不会询问所有类别,但会随机进行调查,因此会询问 4 个类别。包含答案的 DataFrame 如下所示:
Respondent ID
Question Frequency Ads
Question Product 1
Question Product 2
Question Product 3
...
Question Product 19
Question Product 20
1
5
2
4
3
2
1
2
3
5
4
1
1
3
5
2
5
1
5
3
5
2
5
1
4
5
5
2
因此,我想按应有的方式过滤填写问卷的受访者。回答说他们至少每周面对广告的受访者应该至少回答 4 个产品问题。我已经尝试过以下代码:
data = data[((data['Question Frequency Ads'].isin([1,2,3])) & (data['Question Product 1'].isnull() + data['Question Product 2'].isnull() + data['Question Product 3'].isnull() + ... + data['Question Product 20'].isnull())) == (20-4)]
我意识到这段代码不起作用,因为添加这些布尔表达式时,您只会得到一个 True/False 值,而没有显示有多少 True/False 值的整数。谁能帮我找到这个问题的正确表达方式?
IIUC,试试:
- 保留问题频率广告位于
[1, 2, 3]
和 中的所有行
- 其中有答案的问题数量恰好是 4。
>>> >>> data[data["Question Frequency Ads"].isin([1,2,3]) & data.filter(like="Question Product").count(1).eq(4)]
或者:
>>> data[data["Question Frequency Ads"].isin([1,2,3]) & data.drop(["Respondent ID", "Question Frequency Ads"], axis=1).count(1).eq(4)]
您可以使用:
m1 = df['Question Frequency Ads'].between(1,3)
m2 = df.filter(like='Question Product').notnull().sum(1).ge(4)
df.loc[m1&m2, 'valid'] = 'valid'
输出:
Respondent ID Question Frequency Ads Question Product 1 \
0 1 5 NaN
1 2 4 NaN
2 3 2 1.0
3 4 1 1.0
4 5 1 NaN
5 5 1 4.0
Question Product 2 Question Product 3 ... Question Product 19 \
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 2.0 3.0 NaN 5.0
3 NaN 3.0 NaN 5.0
4 5.0 3.0 NaN 5.0
5 5.0 NaN NaN 5.0
Question Product 20 valid
0 NaN NaN
1 NaN NaN
2 NaN valid
3 2.0 valid
4 2.0 valid
5 2.0 valid
更新
IIUC:
m1 = df['Question Frequency Ads'].le(3)
m2 = df.iloc[:, 2:].notna().sum(axis=1).ge(4)
out = df[~m1 | (m1 & m2)]
~m1
:如果Question Frequency Ads > 3
,我们保留受访者,因为他不必回答问题。
m1 & m2
:如果Question Frequency Ads <= 3
,我们只会保留至少回答了 4 个问题的受访者。
旧答案
您可以使用:
out = df[df['Question Frequency Ads'].le(3) & df.iloc[:, 2:].notna().sum(axis=1).ge(4)]
输出:
Respondent ID
Question Frequency Ads
Question Product 1
Question Product 2
Question Product 3
Question Product 19
Question Product 20
3
2
1
2
3
5
4
1
1
3
5
2
5
1
5
3
5
2
5
1
4
5
5
2
这里有一个不同的方法:
def is_int(x):
try:
int(x)
except:
return False
return True
def is_valid(line):
count = 0
for el in line[2:]:
if is_int(el):
count+=1
return count == 4
df[df[df['Question Frequency Ads'].between(1,3)].apply(is_valid, axis=1)]
我正在分析一项消费者调查并使用 Pandas 进行数据清理。我有一个问题,参与者可以回答他们看到广告的频率 ('Daily': 1, 'Multiple times a week': 2, 'Once a week':3, 'Once a year': 4, 'Never': 5).
如果参与者回答说他们至少每周 (1,2,3) 会看到广告,他们将根据他们是否面对这些产品类别以及频率看到一系列全新的问题。调查系统不会询问所有类别,但会随机进行调查,因此会询问 4 个类别。包含答案的 DataFrame 如下所示:
Respondent ID | Question Frequency Ads | Question Product 1 | Question Product 2 | Question Product 3 | ... | Question Product 19 | Question Product 20 |
---|---|---|---|---|---|---|---|
1 | 5 | ||||||
2 | 4 | ||||||
3 | 2 | 1 | 2 | 3 | 5 | ||
4 | 1 | 1 | 3 | 5 | 2 | ||
5 | 1 | 5 | 3 | 5 | 2 | ||
5 | 1 | 4 | 5 | 5 | 2 |
因此,我想按应有的方式过滤填写问卷的受访者。回答说他们至少每周面对广告的受访者应该至少回答 4 个产品问题。我已经尝试过以下代码:
data = data[((data['Question Frequency Ads'].isin([1,2,3])) & (data['Question Product 1'].isnull() + data['Question Product 2'].isnull() + data['Question Product 3'].isnull() + ... + data['Question Product 20'].isnull())) == (20-4)]
我意识到这段代码不起作用,因为添加这些布尔表达式时,您只会得到一个 True/False 值,而没有显示有多少 True/False 值的整数。谁能帮我找到这个问题的正确表达方式?
IIUC,试试:
- 保留问题频率广告位于
[1, 2, 3]
和 中的所有行
- 其中有答案的问题数量恰好是 4。
>>> >>> data[data["Question Frequency Ads"].isin([1,2,3]) & data.filter(like="Question Product").count(1).eq(4)]
或者:
>>> data[data["Question Frequency Ads"].isin([1,2,3]) & data.drop(["Respondent ID", "Question Frequency Ads"], axis=1).count(1).eq(4)]
您可以使用:
m1 = df['Question Frequency Ads'].between(1,3)
m2 = df.filter(like='Question Product').notnull().sum(1).ge(4)
df.loc[m1&m2, 'valid'] = 'valid'
输出:
Respondent ID Question Frequency Ads Question Product 1 \
0 1 5 NaN
1 2 4 NaN
2 3 2 1.0
3 4 1 1.0
4 5 1 NaN
5 5 1 4.0
Question Product 2 Question Product 3 ... Question Product 19 \
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 2.0 3.0 NaN 5.0
3 NaN 3.0 NaN 5.0
4 5.0 3.0 NaN 5.0
5 5.0 NaN NaN 5.0
Question Product 20 valid
0 NaN NaN
1 NaN NaN
2 NaN valid
3 2.0 valid
4 2.0 valid
5 2.0 valid
更新
IIUC:
m1 = df['Question Frequency Ads'].le(3)
m2 = df.iloc[:, 2:].notna().sum(axis=1).ge(4)
out = df[~m1 | (m1 & m2)]
~m1
:如果Question Frequency Ads > 3
,我们保留受访者,因为他不必回答问题。m1 & m2
:如果Question Frequency Ads <= 3
,我们只会保留至少回答了 4 个问题的受访者。
旧答案 您可以使用:
out = df[df['Question Frequency Ads'].le(3) & df.iloc[:, 2:].notna().sum(axis=1).ge(4)]
输出:
Respondent ID | Question Frequency Ads | Question Product 1 | Question Product 2 | Question Product 3 | Question Product 19 | Question Product 20 |
---|---|---|---|---|---|---|
3 | 2 | 1 | 2 | 3 | 5 | |
4 | 1 | 1 | 3 | 5 | 2 | |
5 | 1 | 5 | 3 | 5 | 2 | |
5 | 1 | 4 | 5 | 5 | 2 |
这里有一个不同的方法:
def is_int(x):
try:
int(x)
except:
return False
return True
def is_valid(line):
count = 0
for el in line[2:]:
if is_int(el):
count+=1
return count == 4
df[df[df['Question Frequency Ads'].between(1,3)].apply(is_valid, axis=1)]