仅当在 python 中删除停用词时，标记化步骤中的 Unicode 错误 2

Question

我正在尝试运行这个脚本：enter link description here （唯一的区别是，我需要读取我的数据集（列文本），而不是这个 TEST_SENTENCES 。唯一的问题是，我需要在将该列传递给其余代码之前对该列应用停用词删除。

df = pd.DataFrame({'text': ['the "superstar breakfast" is shrink wrapped muffins that can be bought at a convenience store.',
                            'The wireless internet was unreliable. ', 'i am still her . :). ',
                            'I appreciate your help ', 'I appreciate your help '], 'sentiment':[
    'positive', 'negative', 'neutral', 'positive', 'neutral']})

但是当我以这种方式使用数据框时不会出现错误，但是当我使用包含完全相同数据的 csv 文件时会出现错误。

但是当我添加这行代码来移除stop_words

df['text_without_stopwords'] = df['text'].apply(lambda x: ' '.join([word.encode('latin1', 'ignore').decode('latin1') for word in x.split() if word not in (stop)]))
TEST_SENTENCES = df['text_without_stopwords']

一直报这个错误： ValueError: All sentences should be Unicode-encoded!

此外，标记化步骤中出现错误：

tokenized, _, _ = st.tokenize_sentences(TEST_SENTENCES)

我想知道这里发生了什么导致这个错误，以及修复代码的正确解决方案。

（我尝试过不同的编码，如 uff-8 等，但没有成功）

Answer 1

我还不知道原因，但是当我知道的时候

df['text_without_stopwords'] = df['text_without_stopwords'].astype('unicode')

成功了。

仍然很想知道为什么只有当我这样做时才会发生这种情况stop words removal

仅当在 python 中删除停用词时，标记化步骤中的 Unicode 错误 2

Unicode error in the tokenization step only when doing stop words removal in python 2

csv

unicode

stop-words

python-2.7

pandas