使用正则表达式从 Pandas 中的句子中提取单词以进行网络分析
Using Regex to extract words from sentences in Pandas for network analysis
我有一个 Pandas 数据框,我想从列中的句子中提取每个单词并创建一个新的数据框,其中每个单词都有自己的行。此外,原始数据框有一个应该添加到新行的评级。
数据框如下所示:
base_network
Body Rating
0 Very satisfied 4
1 My daughter lost 2 spoons, so I adjusted them ... 5
2 It was a fiftieth birthday present for my elde... 5
3 Love the shape, shine & elegance of the candle... 5
4 Poor description of what I was buying 3
... ... ...
476 Nice quality but it is too small, description ... 3
477 Edited 6 January 2020As you will have seen, th... 3
478 I love this piece of jewelleryIt is elegant an... 5
479 The leather cord is a little stiff…but I guess... 4
480 Unfortunately the lens is too small and not ve... 1
481 rows × 2 columns
我尝试使用 Regex 将单词分成句子并将它们存储在一个新的数据框中。随后尝试添加匹配的评级。使用此代码:
spaces = r"\s+"
words = pd.DataFrame()
df = pd.DataFrame()
for rows in base_network:
words = re.split(spaces, base_network['Body'])
words['Rating'] = base_network['Rating']
df = df.append(words)
df.head()
我收到以下错误:
TypeError Traceback (most recent call last)
<ipython-input-19-4ff5191a493d> in <module>()
5
6 for rows in base_network:
----> 7 words = re.split(spaces, base_network['Body'])
8 words['Rating'] = base_network['Rating']
9 df = df.append(words)
/usr/lib/python3.7/re.py in split(pattern, string, maxsplit, flags)
213 and the remainder of the string is returned as the final element
214 of the list."""
--> 215 return _compile(pattern, flags).split(string, maxsplit)
216
217 def findall(pattern, string, flags=0):
TypeError: expected string or bytes-like object
我已经尝试将 body 列转换为字符串类型,但这并没有解决问题。
这是否满足您的需求?
# split by any space
df.Body = df.Body.str.split(pat="\s")
# "explode" the list column into a long format.
# The Rating column is recycled accordingly
df.explode("Body")
一些额外的想法
- 可能需要调整正则表达式以在任何标点符号等处拆分。
- 请注意您的输入数据。在第 477 行,“Edited 6 January 2020As you...”似乎漏掉了 space.
我有一个 Pandas 数据框,我想从列中的句子中提取每个单词并创建一个新的数据框,其中每个单词都有自己的行。此外,原始数据框有一个应该添加到新行的评级。
数据框如下所示:
base_network
Body Rating
0 Very satisfied 4
1 My daughter lost 2 spoons, so I adjusted them ... 5
2 It was a fiftieth birthday present for my elde... 5
3 Love the shape, shine & elegance of the candle... 5
4 Poor description of what I was buying 3
... ... ...
476 Nice quality but it is too small, description ... 3
477 Edited 6 January 2020As you will have seen, th... 3
478 I love this piece of jewelleryIt is elegant an... 5
479 The leather cord is a little stiff…but I guess... 4
480 Unfortunately the lens is too small and not ve... 1
481 rows × 2 columns
我尝试使用 Regex 将单词分成句子并将它们存储在一个新的数据框中。随后尝试添加匹配的评级。使用此代码:
spaces = r"\s+"
words = pd.DataFrame()
df = pd.DataFrame()
for rows in base_network:
words = re.split(spaces, base_network['Body'])
words['Rating'] = base_network['Rating']
df = df.append(words)
df.head()
我收到以下错误:
TypeError Traceback (most recent call last)
<ipython-input-19-4ff5191a493d> in <module>()
5
6 for rows in base_network:
----> 7 words = re.split(spaces, base_network['Body'])
8 words['Rating'] = base_network['Rating']
9 df = df.append(words)
/usr/lib/python3.7/re.py in split(pattern, string, maxsplit, flags)
213 and the remainder of the string is returned as the final element
214 of the list."""
--> 215 return _compile(pattern, flags).split(string, maxsplit)
216
217 def findall(pattern, string, flags=0):
TypeError: expected string or bytes-like object
我已经尝试将 body 列转换为字符串类型,但这并没有解决问题。
这是否满足您的需求?
# split by any space
df.Body = df.Body.str.split(pat="\s")
# "explode" the list column into a long format.
# The Rating column is recycled accordingly
df.explode("Body")
一些额外的想法
- 可能需要调整正则表达式以在任何标点符号等处拆分。
- 请注意您的输入数据。在第 477 行,“Edited 6 January 2020As you...”似乎漏掉了 space.