根据另一列中的值创建新的指标列
Create new indicator columns based on values in another column
我有一些数据如下所示:
import pandas as pd
fruits = ['apple', 'pear', 'peach']
df = pd.DataFrame({'col1':['i want an apple', 'i hate pears', 'please buy a peach and an apple', 'I want squash']})
print(df.head())
col1
0 i want an apple
1 i hate pears
2 please buy a peach and an apple
3 I want squash
我需要一个解决方案,为 fruits
中的每个项目创建一个列,并给出一个 1 或 0 值来指示 col
是否包含该值。理想情况下,输出将如下所示:
goal_df = pd.DataFrame({'col1':['i want an apple', 'i hate pears', 'please buy a peach and an apple', 'I want squash'],
'apple': [1, 0, 1, 0],
'pear': [0, 1, 0, 0],
'peach': [0, 0, 1, 0]})
print(goal_df.head())
col1 apple pear peach
0 i want an apple 1 0 0
1 i hate pears 0 1 0
2 please buy a peach and an apple 1 0 1
3 I want squash 0 0 0
我试过了,但没用:
for i in fruits:
if df['col1'].str.contains(i):
df[i] = 1
else:
df[i] = 0
您可以将下面的内容用于 apple 列,对其他人也可以这样做
def has_apple(st):
if "apple" in st.lower():
return 1
return 0
df['apple'] = df['col1'].apply(has_apple)
items = ['apple', 'pear', 'peach']
for it in items:
df[it] = df['col1'].str.contains(it, case=False).astype(int)
输出:
>>> df
col1 apple pear peach
0 i want an apple 1 0 0
1 i hate pears 0 1 0
2 please buy a peach and an apple 1 0 1
3 I want squash 0 0 0
使用str.extractall
提取单词,然后pd.crosstab
:
pattern = f"({'|'.join(fruits)})"
s = df['col1'].str.extractall(pattern)
df[fruits] = (pd.crosstab(s.index.get_level_values(0), s[0].values)
.re_index(index=df.index, columns=fruits, fill_value=0)
)
输出:
col1 apple pear peach
0 i want an apple 1 0 0
1 i hate pears 0 1 0
2 please buy a peach and an apple 1 0 1
3 I want squash 0 0 0
尝试使用 numpy
库中的 np.where
:
fruit = ['apple', 'pear', 'peach']
for i in fruit:
df[i] = np.where(df.col1.str.contains(i), 1, 0)
尝试:
- 使用
str.extractall
获取所有匹配的水果
- 使用
pd.get_dummies
获取指标值
join
到原始 DataFrame
matches = pd.get_dummies(df["col1"].str.extractall(f"({'|'.join(fruits)})")[0].droplevel(1, 0))
output = df.join(matches.groupby(level=0).sum()).fillna(0)
>>> output
col1 apple peach pear
0 i want an apple 1.0 0.0 0.0
1 i hate pears 0.0 0.0 1.0
2 please buy a peach and an apple 1.0 1.0 0.0
3 I want squash 0.0 0.0 0.0
我想到了另一个完全不同的单行:
df[items] = df['col1'].str.findall('|'.join(items)).str.join('|').str.get_dummies('|')
输出:
>>> df
col1 apple pear peach
0 i want an apple 1 0 0
1 i hate pears 0 0 1
2 please buy a peach and an apple 1 1 0
3 I want squash 0 0 0
我有一些数据如下所示:
import pandas as pd
fruits = ['apple', 'pear', 'peach']
df = pd.DataFrame({'col1':['i want an apple', 'i hate pears', 'please buy a peach and an apple', 'I want squash']})
print(df.head())
col1
0 i want an apple
1 i hate pears
2 please buy a peach and an apple
3 I want squash
我需要一个解决方案,为 fruits
中的每个项目创建一个列,并给出一个 1 或 0 值来指示 col
是否包含该值。理想情况下,输出将如下所示:
goal_df = pd.DataFrame({'col1':['i want an apple', 'i hate pears', 'please buy a peach and an apple', 'I want squash'],
'apple': [1, 0, 1, 0],
'pear': [0, 1, 0, 0],
'peach': [0, 0, 1, 0]})
print(goal_df.head())
col1 apple pear peach
0 i want an apple 1 0 0
1 i hate pears 0 1 0
2 please buy a peach and an apple 1 0 1
3 I want squash 0 0 0
我试过了,但没用:
for i in fruits:
if df['col1'].str.contains(i):
df[i] = 1
else:
df[i] = 0
您可以将下面的内容用于 apple 列,对其他人也可以这样做
def has_apple(st):
if "apple" in st.lower():
return 1
return 0
df['apple'] = df['col1'].apply(has_apple)
items = ['apple', 'pear', 'peach']
for it in items:
df[it] = df['col1'].str.contains(it, case=False).astype(int)
输出:
>>> df
col1 apple pear peach
0 i want an apple 1 0 0
1 i hate pears 0 1 0
2 please buy a peach and an apple 1 0 1
3 I want squash 0 0 0
使用str.extractall
提取单词,然后pd.crosstab
:
pattern = f"({'|'.join(fruits)})"
s = df['col1'].str.extractall(pattern)
df[fruits] = (pd.crosstab(s.index.get_level_values(0), s[0].values)
.re_index(index=df.index, columns=fruits, fill_value=0)
)
输出:
col1 apple pear peach
0 i want an apple 1 0 0
1 i hate pears 0 1 0
2 please buy a peach and an apple 1 0 1
3 I want squash 0 0 0
尝试使用 numpy
库中的 np.where
:
fruit = ['apple', 'pear', 'peach']
for i in fruit:
df[i] = np.where(df.col1.str.contains(i), 1, 0)
尝试:
- 使用
str.extractall
获取所有匹配的水果
- 使用
pd.get_dummies
获取指标值 join
到原始 DataFrame
matches = pd.get_dummies(df["col1"].str.extractall(f"({'|'.join(fruits)})")[0].droplevel(1, 0))
output = df.join(matches.groupby(level=0).sum()).fillna(0)
>>> output
col1 apple peach pear
0 i want an apple 1.0 0.0 0.0
1 i hate pears 0.0 0.0 1.0
2 please buy a peach and an apple 1.0 1.0 0.0
3 I want squash 0.0 0.0 0.0
我想到了另一个完全不同的单行:
df[items] = df['col1'].str.findall('|'.join(items)).str.join('|').str.get_dummies('|')
输出:
>>> df
col1 apple pear peach
0 i want an apple 1 0 0
1 i hate pears 0 0 1
2 please buy a peach and an apple 1 1 0
3 I want squash 0 0 0