为什么这个列表理解只在 df.apply 中有效?
Why does this list comprehension only work in df.apply?
我正在尝试删除数据中的停用词。所以它会从这个开始
data['text'].head(5)
Out[25]:
0 go until jurong point, crazy.. available only ...
1 ok lar... joking wif u oni...
2 free entry in 2 a wkly comp to win fa cup fina...
3 u dun say so early hor... u c already then say...
4 nah i don't think he goes to usf, he lives aro...
Name: text, dtype: object
至此
data['newt'].head(5)
Out[26]:
0 [go, jurong, point,, crazy.., available, bugis...
1 [ok, lar..., joking, wif, u, oni...]
2 [free, entry, 2, wkly, comp, win, fa, cup, fin...
3 [u, dun, say, early, hor..., u, c, already, sa...
4 [nah, think, goes, usf,, lives, around, though]
Name: newt, dtype: object
关于如何执行此操作,我有两种选择。我正在分别尝试这两个选项,因此它不会覆盖任何内容。首先,我将函数应用于数据列。这有效,它消除了实现我想做的事情。
def process(data):
data = data.lower()
data = data.split()
data = [row for row in data if row not in stopwords]
return data
data['newt'] = data['text'].apply(process)
第二个选项不使用应用函数参数。它与函数完全一样,但为什么它是 returns TypeError: unhashable type: 'list'
?我检查了行中的 if row not in stopwords
是导致此问题的原因,因为当我删除它时,它会运行但不会删除停用词
data['newt'] = data['text'].str.lower()
data['newt'] = data['newt'].str.split()
data['newt'] = [row for row in data['newt'] if row not in stopwords]
您的列表理解失败,因为它会检查您的 entire 数据框行是否在停用词列表中。这绝不是真的,所以 [row for row in data['newt'] if row not in stopwords]
产生的只是原始 data['newt']
列中的值列表。
我认为按照您的逻辑,您删除停用词的最后一行可能是
data['newt'] = data['text'].str.lower()
data['newt'] = data['newt'].str.split()
data['newt'] = [[word for word in row if word not in stopwords] for row in data['newt']]
如果你可以使用apply
,最后一行可以替换为
data['newt'] = data['newt'].apply(lambda row: [word for word in row if word not in stopwords])
最后,您还可以拨打
data['newt'].apply(lambda row: " ".join(row))
在过程结束时取回字符串。
请注意 str.split
可能不是进行标记化的最佳方式,您可以选择使用专用库的解决方案,例如 spacy
使用 removing stop words using spacy and adding custom stopwords with [=20] 的组合=]
要使自己相信上述论点,请尝试以下代码:
import spacy
sent = "She said: 'beware, your sentences may contain a lot of funny chars!'"
# spacy tokenization
spacy.cli.download("en_core_web_sm")
nlp = spacy.load('en_core_web_sm')
doc = nlp(sent)
print([token.text for token in doc])
# simple split
print(sent.split())
并比较两个输出。
我正在尝试删除数据中的停用词。所以它会从这个开始
data['text'].head(5)
Out[25]:
0 go until jurong point, crazy.. available only ...
1 ok lar... joking wif u oni...
2 free entry in 2 a wkly comp to win fa cup fina...
3 u dun say so early hor... u c already then say...
4 nah i don't think he goes to usf, he lives aro...
Name: text, dtype: object
至此
data['newt'].head(5)
Out[26]:
0 [go, jurong, point,, crazy.., available, bugis...
1 [ok, lar..., joking, wif, u, oni...]
2 [free, entry, 2, wkly, comp, win, fa, cup, fin...
3 [u, dun, say, early, hor..., u, c, already, sa...
4 [nah, think, goes, usf,, lives, around, though]
Name: newt, dtype: object
关于如何执行此操作,我有两种选择。我正在分别尝试这两个选项,因此它不会覆盖任何内容。首先,我将函数应用于数据列。这有效,它消除了实现我想做的事情。
def process(data):
data = data.lower()
data = data.split()
data = [row for row in data if row not in stopwords]
return data
data['newt'] = data['text'].apply(process)
第二个选项不使用应用函数参数。它与函数完全一样,但为什么它是 returns TypeError: unhashable type: 'list'
?我检查了行中的 if row not in stopwords
是导致此问题的原因,因为当我删除它时,它会运行但不会删除停用词
data['newt'] = data['text'].str.lower()
data['newt'] = data['newt'].str.split()
data['newt'] = [row for row in data['newt'] if row not in stopwords]
您的列表理解失败,因为它会检查您的 entire 数据框行是否在停用词列表中。这绝不是真的,所以 [row for row in data['newt'] if row not in stopwords]
产生的只是原始 data['newt']
列中的值列表。
我认为按照您的逻辑,您删除停用词的最后一行可能是
data['newt'] = data['text'].str.lower()
data['newt'] = data['newt'].str.split()
data['newt'] = [[word for word in row if word not in stopwords] for row in data['newt']]
如果你可以使用apply
,最后一行可以替换为
data['newt'] = data['newt'].apply(lambda row: [word for word in row if word not in stopwords])
最后,您还可以拨打
data['newt'].apply(lambda row: " ".join(row))
在过程结束时取回字符串。
请注意 要使自己相信上述论点,请尝试以下代码: 并比较两个输出。str.split
可能不是进行标记化的最佳方式,您可以选择使用专用库的解决方案,例如 spacy
使用 removing stop words using spacy and adding custom stopwords with import spacy
sent = "She said: 'beware, your sentences may contain a lot of funny chars!'"
# spacy tokenization
spacy.cli.download("en_core_web_sm")
nlp = spacy.load('en_core_web_sm')
doc = nlp(sent)
print([token.text for token in doc])
# simple split
print(sent.split())