如何将具有 key/value 对的 pandas 数据框列拆分为多个列?

How to split a pandas dataframe column with key/value pairs into multiple columns?

情况

我有 运行 Google 的 NLP 情绪分析,它返回 'sentiment' 列,其中包含 magnitudescore 的键值对,根据以下内容:

情绪分析结果

这是我在情绪列中针对数据框 df03 的结果。

index text02 sentiment
01 Max Muncy is great! magnitude: 0.8999999761581421\nscore: 0.8999999761581421
02 The worst Dodger is Max muncy. magnitude: 0.800000011920929\nscore: -0.800000011920929
03 Max Muncy was great, but not so much now. magnitude: 0.4000000059604645\nscore: -0.4000000059604645
04 What a fantastic guy, that Max muncy. magnitude: 0.8999999761581421\nscore: 0.8999999761581421

目标

我想将 sentiment 列拆分为两列,标题为 sentiment - magnitudesentiment - score,并相应列出列值。

数据格式为换行分隔:

magnitude: 0.8999999761581421\nscore: 0.899999…

所以我正在尝试 Series.str.split 方法,如下所示:

df03['sentiment'].str.split(pat="\n", expand=True)

我对 ReGex 不是很熟悉,但确实注意到 \n 代表 line feed,因此认为这是为 pat 参数插入的正确值。

结果是返回所有值NaN.

index 0
01 NaN
02 NaN
03 NaN
04 NaN

我尝试了几种不同的方法,但 none 奏效了。 df03['sentiment'].str.split(r"\n", expand=True) df03['sentiment'].str.split(pat=r"\n", expand=True)

我认为问题是 \ 正在创建某种使 n 无效的正则表达式转义,但我在 regexr.com 上没有看到任何东西来证实这一点。

还有个问题就是把magnitudescore这两个词拆分出来放在headers栏里,不知道expand=True会不会包括与否。

非常感谢任何关于我做错了什么以及应该在哪里进行故障排除的意见。

道格

附加

原始创建的数据框:

index text02
01 Max Muncy is great!
02 The worst Dodger is Max muncy.
03 Max Muncy was great, but not so much now.
04 What a fantastic guy, that Max muncy.

df03['sentiment']

01    magnitude: 0.8999999761581421\nscore: 0.899999...
02    magnitude: 0.800000011920929\nscore: -0.800000...
03    magnitude: 0.4000000059604645\nscore: -0.40000...
04    magnitude: 0.8999999761581421\nscore: 0.899999...
Name: sentiment, dtype: object

附加 02

运行这个

df03['sentiment'].astype(str).str.split(pat=r"\n| ", expand=True)

返回了这个(不确定如何像上面的表格一样格式化)

|index|0|1|2|
|---|---|---|---|
|01|magnitude:|0\.8999999761581421
score:|0\.8999999761581421
|
|02|magnitude:|0\.800000011920929
score:|-0\.800000011920929
|
|03|magnitude:|0\.4000000059604645
score:|-0\.4000000059604645
|
|04|magnitude:|0\.8999999761581421
score:|0\.8999999761581421
|

您需要像这样指定正则表达式(带有两个斜杠,并作为原始字符串):

df['sentiment'].str.split(pat=r"\n", expand=True)

此处 dfdf['sentiment'] 的计算结果为:

df
index text02 sentiment
1 Max Muncy is great! magnitude: 0.8999999761581421\nscore: 0.89999...
2 The worst Dodger is Max muncy. magnitude: 0.800000011920929\nscore: -0.80000...
3 Max Muncy was great, but not so much now. magnitude: 0.4000000059604645\nscore: -0.4000...
4 What a fantastic guy, that Max muncy. magnitude: 0.8999999761581421\nscore: 0.89999...
df['sentiment']
index             
1    magnitude: 0\.8999999761581421\nscore: 0.89999...
2    magnitude: 0\.800000011920929\nscore: -0.80000...
3    magnitude: 0\.4000000059604645\nscore: -0.4000...
4    magnitude: 0\.8999999761581421\nscore: 0.89999...
Name: sentiment, dtype: object

(我认为是你的 df03)。

有了这些输入,df['sentiment'].str.split(pat=r"\n", expand=True) 给出:

index 0 1
1 magnitude: 0.8999999761581421 score: 0.8999999761581421
2 magnitude: 0.800000011920929 score: -0.800000011920929
3 magnitude: 0.4000000059604645 score: -0.4000000059604645
4 magnitude: 0.8999999761581421 score: 0.8999999761581421

要将列重命名为 MagnitudeScore,并从数据框中删除这些字符串,您可以修改正则表达式以在换行符或 space,然后重命名列。然后,仅选择您要保留的那些,给出:

new = df['sentiment'].str.split(pat=r"\n| ", expand=True)
new.columns = ["", "Magnitude", "", "Score"]
new[["Magnitude", "Score"]]
index Magnitude Score
1 0.8999999761581421 0.8999999761581421
2 0.800000011920929 -0.800000011920929
3 0.4000000059604645 -0.4000000059604645
4 0.8999999761581421 0.8999999761581421

附录

OP 必须进行一些额外的调整才能达到我得到的结果。他们使用 astype(str) 将值显式转换为字符串,并在完成后完全删除正则表达式:

new = df['sentiment'].astype(str).str.split(expand=True)

默认情况下,Series.str.split() 在任何白色上拆分space,这听起来像是正在使用的实际输入有一些不寻常的格式,其中最后一个单元格包含换行符,但没有将其表示为\n;没有真正看到原作,还是有点不清楚。