Python DataFrame：一次性编码包含特定子字符串的行

Question

我有一个包含字符串的 DataFrame。我想通过one-hot编码创建另一个指示字符串是否包含特定月份的DataFrame。

以下面为例：

data = {
'User': ['1', '2', '3', '4']
'Months': ['January; February', 'March; August', 'October; January', 'August, December']}


df = pd.DataFrame(data, columns = ['User','Months'])

我希望生成以下类型的 DataFrame：

         | January | August |
User | 1 |    1    |    0   |
     | 2 |    0    |    1   |
     | 3 |    1    |    0   |
     | 4 |    0    |    1   |

我尝试了以下方法，但出现值错误，而且它也不会生成所需的 DataFrame：

if df[df['Months'].str.contains('January')]:
    print("1")
else:
    print("0")

提前致谢！

Answer 1

df = pd.concat([df["User"], df.Months.str.split(r"[,;]")], axis=1).explode(
    "Months"
)
print(pd.crosstab(df["User"], df["Months"]))

打印：

Months   August   December   February   January  August  January  March  October
User                                                                            
1             0          0          1         0       0        1      0        0
2             1          0          0         0       0        0      1        0
3             0          0          0         1       0        0      0        1
4             0          1          0         0       1        0      0        0

Answer 2

您可以使用 series.str.extract first to extract the specific substrings and use it with get_dummies 然后 join 返回：

l = ['January','August']
out = df[['User']].join(
pd.get_dummies(df['Months'].str.extract(f"({'|'.join(l)})",expand=False)))

print(out)

  User  August  January
0    1       0        1
1    2       1        0
2    3       0        1
3    4       1        0

Python DataFrame：一次性编码包含特定子字符串的行

Python DataFrame: One-Hot Encode Rows Containing a Specific Substring

python

substring

dataframe

pandas

one-hot-encoding