word_tokenize 相同的代码和相同的数据集，但结果不同，为什么？

Question

上个月，我尝试对文本进行标记化并创建词表以查看哪个词经常出现。今天，我想用相同的代码在同一个数据集中再做一次。它仍然有效，但结果不同，显然今天的结果是错误的，因为出现单词的频率显着降低。

这是我的代码：

from nltk.tokenize import sent_tokenize, word_tokenize
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
from nltk.stem import WordNetLemmatizer
import nltk
from collections import Counter

sent = nltk.word_tokenize(str(df.description))
lower_token = [t.lower() for t in sent]
alpha = [t for t in lower_token if t.isalpha()]
stop_word =  [t for t in alpha if t not in ENGLISH_STOP_WORDS]
k = WordNetLemmatizer()
lemma = [k.lemmatize(t) for t in stop_word]
bow = Counter(lemma)
print(bow.most_common(20))

Here is a sample of my dataset

此数据集来自 Kaggle，其名称为 "Wine Reviews"。

Answer 1

欢迎使用 Whosebug。

您的问题可能有两个原因。

1) 可能是你修改了数据集。为此，我会检查数据集并查看您是否对数据本身进行了任何更改。因为您的代码适用于其他示例，并且不会每天更改，因为它没有随机元素。

2) 第二个问题可能是您在调用此行中的数据框列时使用了 df.description：

sent = nltk.word_tokenize(str(df.description))

您得到截断的输出。查看 df.description 的类型，它是一个 Series 对象。

我又创建了一个例子，如下：

from nltk.tokenize import word_tokenize
import pandas as pd

df = pd.DataFrame({'description' : ['The OP is asking a question and I referred him to the Minimum Verifible Example page which states: When asking a question, people will be better able to provide help if you provide code that they can easily understand and use to reproduce the problem. This is referred to by community members as creating a minimal, reproducible example (reprex), a minimal, complete and verifiable example (mcve), or a minimal, workable example (mwe). Regardless of how it\'s communicated to you, it boils down to ensuring your code that reproduces the problem follows the following guidelines:']})


print(df.description)

0    The OP is asking a question and I referred him...
Name: description, dtype: object

正如您在上面看到的，它被截断了，它不是 description 列中的全文。

我对您的代码的建议是研究这行代码并找到不同的实现方式：

sent = nltk.word_tokenize(str(df.description))

请注意，您在代码中使用的方法将包括索引号（据我所知，您已通过 isalpha 过滤）以及您正在处理的数据中的 Name: description, dtype: object。

一种方法是使用 map 来处理您的数据。一个例子是：

pd.set_option('display.max_colwidth', -1)
df['tokenized'] = df['description'].map(str).map(nltk.word_tokenize)

对其他操作也继续执行此操作。一种简单的方法是构建一个预处理函数，将所有预处理操作（您想要使用的）应用到您的数据帧上。

希望对您有所帮助。

word_tokenize 相同的代码和相同的数据集，但结果不同，为什么？

word_tokenize with same code and same dataset, but different result, why?

python

text-mining

tokenize

nltk