Python DataFrame:一次性编码包含特定子字符串的行

Python DataFrame: One-Hot Encode Rows Containing a Specific Substring

我有一个包含字符串的 DataFrame。我想通过one-hot编码创建另一个指示字符串是否包含特定月份的DataFrame。

以下面为例:

data = {
'User': ['1', '2', '3', '4']
'Months': ['January; February', 'March; August', 'October; January', 'August, December']}


df = pd.DataFrame(data, columns = ['User','Months'])

我希望生成以下类型的 DataFrame:

         | January | August |
User | 1 |    1    |    0   |
     | 2 |    0    |    1   |
     | 3 |    1    |    0   |
     | 4 |    0    |    1   |

我尝试了以下方法,但出现值错误,而且它也不会生成所需的 DataFrame:

if df[df['Months'].str.contains('January')]:
    print("1")
else:
    print("0")

提前致谢!

df = pd.concat([df["User"], df.Months.str.split(r"[,;]")], axis=1).explode(
    "Months"
)
print(pd.crosstab(df["User"], df["Months"]))

打印:

Months   August   December   February   January  August  January  March  October
User                                                                            
1             0          0          1         0       0        1      0        0
2             1          0          0         0       0        0      1        0
3             0          0          0         1       0        0      0        1
4             0          1          0         0       1        0      0        0

您可以使用 series.str.extract first to extract the specific substrings and use it with get_dummies 然后 join 返回:

l = ['January','August']
out = df[['User']].join(
pd.get_dummies(df['Months'].str.extract(f"({'|'.join(l)})",expand=False)))

print(out)

  User  August  January
0    1       0        1
1    2       1        0
2    3       0        1
3    4       1        0