Python DataFrame:一次性编码包含特定子字符串的行
Python DataFrame: One-Hot Encode Rows Containing a Specific Substring
我有一个包含字符串的 DataFrame。我想通过one-hot编码创建另一个指示字符串是否包含特定月份的DataFrame。
以下面为例:
data = {
'User': ['1', '2', '3', '4']
'Months': ['January; February', 'March; August', 'October; January', 'August, December']}
df = pd.DataFrame(data, columns = ['User','Months'])
我希望生成以下类型的 DataFrame:
| January | August |
User | 1 | 1 | 0 |
| 2 | 0 | 1 |
| 3 | 1 | 0 |
| 4 | 0 | 1 |
我尝试了以下方法,但出现值错误,而且它也不会生成所需的 DataFrame:
if df[df['Months'].str.contains('January')]:
print("1")
else:
print("0")
提前致谢!
df = pd.concat([df["User"], df.Months.str.split(r"[,;]")], axis=1).explode(
"Months"
)
print(pd.crosstab(df["User"], df["Months"]))
打印:
Months August December February January August January March October
User
1 0 0 1 0 0 1 0 0
2 1 0 0 0 0 0 1 0
3 0 0 0 1 0 0 0 1
4 0 1 0 0 1 0 0 0
您可以使用 series.str.extract
first to extract the specific substrings and use it with get_dummies
然后 join
返回:
l = ['January','August']
out = df[['User']].join(
pd.get_dummies(df['Months'].str.extract(f"({'|'.join(l)})",expand=False)))
print(out)
User August January
0 1 0 1
1 2 1 0
2 3 0 1
3 4 1 0
我有一个包含字符串的 DataFrame。我想通过one-hot编码创建另一个指示字符串是否包含特定月份的DataFrame。
以下面为例:
data = {
'User': ['1', '2', '3', '4']
'Months': ['January; February', 'March; August', 'October; January', 'August, December']}
df = pd.DataFrame(data, columns = ['User','Months'])
我希望生成以下类型的 DataFrame:
| January | August |
User | 1 | 1 | 0 |
| 2 | 0 | 1 |
| 3 | 1 | 0 |
| 4 | 0 | 1 |
我尝试了以下方法,但出现值错误,而且它也不会生成所需的 DataFrame:
if df[df['Months'].str.contains('January')]:
print("1")
else:
print("0")
提前致谢!
df = pd.concat([df["User"], df.Months.str.split(r"[,;]")], axis=1).explode(
"Months"
)
print(pd.crosstab(df["User"], df["Months"]))
打印:
Months August December February January August January March October
User
1 0 0 1 0 0 1 0 0
2 1 0 0 0 0 0 1 0
3 0 0 0 1 0 0 0 1
4 0 1 0 0 1 0 0 0
您可以使用 series.str.extract
first to extract the specific substrings and use it with get_dummies
然后 join
返回:
l = ['January','August']
out = df[['User']].join(
pd.get_dummies(df['Months'].str.extract(f"({'|'.join(l)})",expand=False)))
print(out)
User August January
0 1 0 1
1 2 1 0
2 3 0 1
3 4 1 0