阻止 pandas 数据框

Question

我有推文数据集（取自 NLTK），目前在 pandas 数据框中，但我需要阻止它。我尝试了很多不同的方法并得到了一些不同的错误，例如

AttributeError: 'Series' object has no attribute 'lower'
and
KeyError: 'text'

我不理解 KeyError，因为该列确实被称为 'text'，但是我知道我需要将数据帧更改为字符串才能使词干分析器工作（我认为）。

Here is an example of the data

from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")

negative_tweets = twitter_samples.strings('negative_tweets.json')

negtweetsdf = DataFrame(negative_tweets,columns=['text'])

print(stemmer.stem(negtweetstr))

Answer 1

您需要对系列应用词干提取功能，如下所示

negtweetsdf.apply(stemmer.stem)

这将创建一个新系列。

需要单个字符串值或类似值的函数不会简单地在 pandas 数据框或系列上工作。它们需要应用于整个系列，这就是使用 .apply 的原因。

这是一个在数据框列中包含列表的工作示例。

from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import TweetTokenizer
stemmer = SnowballStemmer("english")
import pandas as pd

df = pd.DataFrame([['some extremely exciting tweet'],['another']], columns=['tweets'])

# put the strings into lists
df = pd.DataFrame(df.apply(list,axis=1), columns=['tweets'])

# for each row (apply) for each item in the list, apply the stemmer
# return a list containing the stems
df['tweets'].apply(lambda x: [stemmer.stem(y) for y in x])

阻止 pandas 数据框

Stemming a pandas dataframe

python

stemming

pandas