如何在 Python pandas 中将一列中包含多个句子的文本拆分为多行?
How do I split text with multiple sentences in a column into multiple rows in Python pandas?
我正在尝试将“评论”列拆分为包含每个句子的多行。我使用以下 Whosebug 线程作为参考,因为它往往会给出类似的结果。
引用 Link: pandas: How do I split text in a column into multiple rows?
数据帧的示例数据如下。
Id 团队 Food_Text
1 X 食物很好。煮得很好。可口的!
2 X 我讨厌鱿鱼。食物煮得不好。确实如此。
3 X 请不要在这边做任何事
4 Y 我喜欢鱼。真棒美味。
5 Y 适合甜点。肉不好吃
'Food_Text' 的每条记录可以是多个句子,以句号或句号分隔。我使用了以下代码
import numpy as np
import pandas as pd
survey_data = pd.read_csv("Food_Dummy.csv")
survey_text = survey_data[['Id','Team','Food_Text']]
# Getting s as pandas series which has split on full stop and new sentence a new line
s = survey_text["Food_Text"].str.split('.').apply(pd.Series,1).stack()
s.index = s.index.droplevel(-1) # to line up with df's index
s.name = 'Food_Text' # needs a name to join
# There are blank or emplty cell values after above process. Removing them
s.replace('', np.nan, inplace=True)
s.dropna(inplace=True)
x=s.to_frame(name='Food_Text1')
x.head(10)
# Joining should ideally get me proper output. But I am getting original dataframe instead of split one.
survey_text.join(x)
survey_text.head(10)
我不确定为什么连接没有给我一个包含更多行数的正确数据框。根据拆分索引重复其他列。所以 Id=1 有 3 个句子,所以我们应该有 3 个记录,所有其他数据都相同,并且 Food_Text 列包含来自 ID=1 的评论的新句子。其他记录也类似。
提前感谢您的帮助!
问候,
索希尔沙
在您放入代码的示例中,打印了 join
的结果,因此如果您想更改 survey_text 的值,代码应为:
survey_text = survey_text.join(x)
或者如果您想简化代码,下面的代码就可以了:
import numpy as np
import pandas as pd
survey_data = pd.read_csv("Food_Dummy.csv")
survey_text = survey_data[['Id','Team','Food_Text']]
# Getting s as pandas series which has split on full stop and new sentence a new line
s = survey_text["Food_Text"].str.split('.').apply(pd.Series,1).stack()
s.index = s.index.droplevel(-1) # to line up with df's index
s.name = 'Food_Text' # needs a name to join
# There are blank or emplty cell values after above process. Removing them
s.replace('', np.nan, inplace=True)
s.dropna(inplace=True)
# Joining should ideally get me proper output. But I am getting original dataframe instead of split one.
del survey_text['Food_Text']
survey_text = survey_text.join(s)
survey_text.head(10)
这样您的 DataFrame 中就不会有多个 "Food_Text" 列。
而不是
s = survey_text["Food_Text"].str.split('.').apply(pd.Series,1).stack()
拆分成句子的更好方法是使用 nltk sentence tokenizer
from nltk.tokenize import sent_tokenize
s = survey_text["Food_Text"].apply(lambda x : sent_tokenize(x)).apply(pd.Series,1).stack()
我正在尝试将“评论”列拆分为包含每个句子的多行。我使用以下 Whosebug 线程作为参考,因为它往往会给出类似的结果。 引用 Link: pandas: How do I split text in a column into multiple rows? 数据帧的示例数据如下。
Id 团队 Food_Text 1 X 食物很好。煮得很好。可口的! 2 X 我讨厌鱿鱼。食物煮得不好。确实如此。 3 X 请不要在这边做任何事 4 Y 我喜欢鱼。真棒美味。 5 Y 适合甜点。肉不好吃
'Food_Text' 的每条记录可以是多个句子,以句号或句号分隔。我使用了以下代码
import numpy as np
import pandas as pd
survey_data = pd.read_csv("Food_Dummy.csv")
survey_text = survey_data[['Id','Team','Food_Text']]
# Getting s as pandas series which has split on full stop and new sentence a new line
s = survey_text["Food_Text"].str.split('.').apply(pd.Series,1).stack()
s.index = s.index.droplevel(-1) # to line up with df's index
s.name = 'Food_Text' # needs a name to join
# There are blank or emplty cell values after above process. Removing them
s.replace('', np.nan, inplace=True)
s.dropna(inplace=True)
x=s.to_frame(name='Food_Text1')
x.head(10)
# Joining should ideally get me proper output. But I am getting original dataframe instead of split one.
survey_text.join(x)
survey_text.head(10)
我不确定为什么连接没有给我一个包含更多行数的正确数据框。根据拆分索引重复其他列。所以 Id=1 有 3 个句子,所以我们应该有 3 个记录,所有其他数据都相同,并且 Food_Text 列包含来自 ID=1 的评论的新句子。其他记录也类似。
提前感谢您的帮助! 问候, 索希尔沙
在您放入代码的示例中,打印了 join
的结果,因此如果您想更改 survey_text 的值,代码应为:
survey_text = survey_text.join(x)
或者如果您想简化代码,下面的代码就可以了:
import numpy as np
import pandas as pd
survey_data = pd.read_csv("Food_Dummy.csv")
survey_text = survey_data[['Id','Team','Food_Text']]
# Getting s as pandas series which has split on full stop and new sentence a new line
s = survey_text["Food_Text"].str.split('.').apply(pd.Series,1).stack()
s.index = s.index.droplevel(-1) # to line up with df's index
s.name = 'Food_Text' # needs a name to join
# There are blank or emplty cell values after above process. Removing them
s.replace('', np.nan, inplace=True)
s.dropna(inplace=True)
# Joining should ideally get me proper output. But I am getting original dataframe instead of split one.
del survey_text['Food_Text']
survey_text = survey_text.join(s)
survey_text.head(10)
这样您的 DataFrame 中就不会有多个 "Food_Text" 列。
而不是
s = survey_text["Food_Text"].str.split('.').apply(pd.Series,1).stack()
拆分成句子的更好方法是使用 nltk sentence tokenizer
from nltk.tokenize import sent_tokenize
s = survey_text["Food_Text"].apply(lambda x : sent_tokenize(x)).apply(pd.Series,1).stack()