删除 stop-words 和标点符号
Deleting stop-words and punctuation signs
我解析新闻网站的信息。每个新闻都是一个存储在 translated_news 变量中的字典。每条新闻都有其标题、url 和国家/地区。
然后我尝试遍历每个新闻标题并删除 stop-words 和标点符号。我写了这段代码:
for new in translated_news:
tk = tokenize(new['title'])
# delete punctuation signs & stop-words
for t in tk:
if (t in punkts) or (t+'\n' in stops):
tk.remove(t)
tokens.append(tk)
Tokenize 是一个 returns 标记列表的函数。这是输出示例:
['medium', ':', 'russian', 'athlete', 'will', 'be', 'admit', 'to', 'the', '2018', 'olympics', 'in', 'neutral', 'status']
这是相同的输出,但删除了 stop-words 和标点符号:
['medium', 'russian', 'athlete', 'be', 'admit', 'the', 'olympics', 'neutral', 'status']
问题是:即使 'the' 和 'be' 包含在我的 stop-words 列表中,它们也没有从新闻标题中删除。但是,在其他游戏中它有时可以正常工作:
['wada', 'acknowledge', 'the', 'reliable', 'information', 'provide', 'to', 'rodchenkov']
['wada', 'acknowledge', 'reliable', 'information', 'provide', 'rodchenkov']
这里'the'被标题删掉了。
我不明白代码有什么问题,为什么有时输出是完美的,有时却不是。
尝试去掉换行符。
像这样
tk = [x for x in tokenize(new['title']) if x not in stops and x not in string.punctuation]
您必须迭代 tokenize(new['title'])
并使用 De Morgan's laws 来简化 if 语句:
import string
stops = ['will', 'be', 'to', 'the', 'in']
tk = ['medium', ':', 'russian', 'athlete', 'will', 'be', 'admit', 'to', 'the',
'2018', 'olympics', 'in', 'neutral', 'status']
# delete punctuation signs & stop-words
tk = []
for t in tokenize(new['title']):
# if not ((t in string.punctuation) or (t in stops)):
if (t not in string.punctuation) and (t not in stops): # De Morgan's laws
tk.append(t)
print(tk)
将打印:
['medium', 'russian', 'athlete', 'admit', '2018', 'olympics', 'neutral', 'status']
您可以去掉停用词中的新行:
stops = ['will\n', 'be\n', 'to\n', 'the\n', 'in\n']
stops = [item.strip() for item in stops]
print(stops)
将打印:
['will', 'be', 'to', 'the', 'in']
incanus86 建议的解决方案确实有效:
tk = [x for x in tokenize(new['title']) if x not in stops and x not in string.punctuation]
但如果您知道 list comprehensions.
,就不会在 SO 中询问
I don't understand what is wrong with the code and why sometimes the output is perfect and sometimes not.
在迭代 tk
项目时,您确实错过了 'be'
和 'the'
,因为您正在删除 tk
项目,如代码所示:
import string
stops = ['will', 'be', 'to', 'the', 'in']
tk = [
'medium', # 0
':', # 1
'russian', # 2
'athlete', # 3
'will', # 4
'be', # 5
'admit', # 6
'to', # 7
'the', # 8
'2018', # 9
'olympics', # 10
'in', # 11
'neutral', # 12
'status' # 13
]
# delete punctuation signs & stop-words
for t in tk:
print(len(tk), t, tk.index(t))
if (t in string.punctuation) or (t in stops):
tk.remove(t)
print(tk)
将打印:
(14, 'medium', 0)
(14, ':', 1)
(13, 'athlete', 2)
(13, 'will', 3)
(12, 'admit', 4)
(12, 'to', 5)
(11, '2018', 6)
(11, 'olympics', 7)
(11, 'in', 8)
(10, 'status', 9)
['medium', 'russian', 'athlete', 'be', 'admit', 'the', '2018', 'olympics', 'neutral', 'status']
你确实想念 "russian", "be", "the" 和 "neutral".
"athlete" 的索引是 2,"will" 的索引是 3,因为你从 tk.
中删除了“:”
"admit" 的索引为 4,如果 "to" 的索引为 5,因为您从 tk.
中删除了 "will"
“2018”的索引是 6,"olympics" 的索引是 7,"in" 的索引是 8,"status" 的索引是 9。
您不能在迭代时更改列表!
我解析新闻网站的信息。每个新闻都是一个存储在 translated_news 变量中的字典。每条新闻都有其标题、url 和国家/地区。 然后我尝试遍历每个新闻标题并删除 stop-words 和标点符号。我写了这段代码:
for new in translated_news:
tk = tokenize(new['title'])
# delete punctuation signs & stop-words
for t in tk:
if (t in punkts) or (t+'\n' in stops):
tk.remove(t)
tokens.append(tk)
Tokenize 是一个 returns 标记列表的函数。这是输出示例:
['medium', ':', 'russian', 'athlete', 'will', 'be', 'admit', 'to', 'the', '2018', 'olympics', 'in', 'neutral', 'status']
这是相同的输出,但删除了 stop-words 和标点符号:
['medium', 'russian', 'athlete', 'be', 'admit', 'the', 'olympics', 'neutral', 'status']
问题是:即使 'the' 和 'be' 包含在我的 stop-words 列表中,它们也没有从新闻标题中删除。但是,在其他游戏中它有时可以正常工作:
['wada', 'acknowledge', 'the', 'reliable', 'information', 'provide', 'to', 'rodchenkov']
['wada', 'acknowledge', 'reliable', 'information', 'provide', 'rodchenkov']
这里'the'被标题删掉了。 我不明白代码有什么问题,为什么有时输出是完美的,有时却不是。
尝试去掉换行符。
像这样
tk = [x for x in tokenize(new['title']) if x not in stops and x not in string.punctuation]
您必须迭代 tokenize(new['title'])
并使用 De Morgan's laws 来简化 if 语句:
import string
stops = ['will', 'be', 'to', 'the', 'in']
tk = ['medium', ':', 'russian', 'athlete', 'will', 'be', 'admit', 'to', 'the',
'2018', 'olympics', 'in', 'neutral', 'status']
# delete punctuation signs & stop-words
tk = []
for t in tokenize(new['title']):
# if not ((t in string.punctuation) or (t in stops)):
if (t not in string.punctuation) and (t not in stops): # De Morgan's laws
tk.append(t)
print(tk)
将打印:
['medium', 'russian', 'athlete', 'admit', '2018', 'olympics', 'neutral', 'status']
您可以去掉停用词中的新行:
stops = ['will\n', 'be\n', 'to\n', 'the\n', 'in\n']
stops = [item.strip() for item in stops]
print(stops)
将打印:
['will', 'be', 'to', 'the', 'in']
incanus86 建议的解决方案确实有效:
tk = [x for x in tokenize(new['title']) if x not in stops and x not in string.punctuation]
但如果您知道 list comprehensions.
,就不会在 SO 中询问I don't understand what is wrong with the code and why sometimes the output is perfect and sometimes not.
在迭代 tk
项目时,您确实错过了 'be'
和 'the'
,因为您正在删除 tk
项目,如代码所示:
import string
stops = ['will', 'be', 'to', 'the', 'in']
tk = [
'medium', # 0
':', # 1
'russian', # 2
'athlete', # 3
'will', # 4
'be', # 5
'admit', # 6
'to', # 7
'the', # 8
'2018', # 9
'olympics', # 10
'in', # 11
'neutral', # 12
'status' # 13
]
# delete punctuation signs & stop-words
for t in tk:
print(len(tk), t, tk.index(t))
if (t in string.punctuation) or (t in stops):
tk.remove(t)
print(tk)
将打印:
(14, 'medium', 0)
(14, ':', 1)
(13, 'athlete', 2)
(13, 'will', 3)
(12, 'admit', 4)
(12, 'to', 5)
(11, '2018', 6)
(11, 'olympics', 7)
(11, 'in', 8)
(10, 'status', 9)
['medium', 'russian', 'athlete', 'be', 'admit', 'the', '2018', 'olympics', 'neutral', 'status']
你确实想念 "russian", "be", "the" 和 "neutral".
"athlete" 的索引是 2,"will" 的索引是 3,因为你从 tk.
中删除了“:”
"admit" 的索引为 4,如果 "to" 的索引为 5,因为您从 tk.
中删除了 "will"
“2018”的索引是 6,"olympics" 的索引是 7,"in" 的索引是 8,"status" 的索引是 9。
您不能在迭代时更改列表!