如何获取 pandas 数据框中单词列表(子字符串)的出现次数?
How do I get the number of occurrences of a list of words (substrings) in a pandas dataframe?
我有一个 pandas 数据框,大约有 150 万行。我想在某一列中找到特定的、选定的单词(都是已知的)的出现次数。这适用于单个单词。
d = df["Content"].str.contains("word").value_counts()
但我想从列表中找出多个已知词的出现次数,例如 "word1"、"word2"。 word2 也可以是 word2 或 wordtwo,像这样:
word1 40
word2/wordtwo 120
我该如何完成?
IMO 最有效的方法之一是使用 sklearn.feature_extraction.text.CountVectorizer 向其传递一个词汇表(单词列表,您想要计数)。
演示:
In [21]: text = """
...: I have a pandas data frame with approximately 1.5 million rows. I want to find the number of occurrences of specific, selected words in a certain colu
...: mn. This works for a single word. But I want to find out the occurrences of multiple, known words like "word1", "word2" from a list. Also word2 could
...: be word2 or wordtwo, like so"""
In [22]: df = pd.DataFrame(text.split('. '), columns=['Content'])
In [23]: df
Out[23]:
Content
0 \nI have a pandas data frame with approximatel...
1 I want to find the number of occurrences of sp...
2 This works for a single word
3 But I want to find out the occurrences of mult...
4 Also word2 could be word2 or wordtwo, like so
In [24]: from sklearn.feature_extraction.text import CountVectorizer
In [25]: vocab = ['word', 'words', 'word1', 'word2', 'wordtwo']
In [26]: vect = CountVectorizer(vocabulary=vocab)
In [27]: res = pd.Series(np.ravel((vect.fit_transform(df['Content']).sum(axis=0))),
index=vect.get_feature_names())
In [28]: res
Out[28]:
word 1
words 2
word1 1
word2 3
wordtwo 1
dtype: int64
您可以像这样创建字典:
{w: df["Content"].str.contains(w).sum() for w in words}
假设 words
是单词列表。
我有一个 pandas 数据框,大约有 150 万行。我想在某一列中找到特定的、选定的单词(都是已知的)的出现次数。这适用于单个单词。
d = df["Content"].str.contains("word").value_counts()
但我想从列表中找出多个已知词的出现次数,例如 "word1"、"word2"。 word2 也可以是 word2 或 wordtwo,像这样:
word1 40
word2/wordtwo 120
我该如何完成?
IMO 最有效的方法之一是使用 sklearn.feature_extraction.text.CountVectorizer 向其传递一个词汇表(单词列表,您想要计数)。
演示:
In [21]: text = """
...: I have a pandas data frame with approximately 1.5 million rows. I want to find the number of occurrences of specific, selected words in a certain colu
...: mn. This works for a single word. But I want to find out the occurrences of multiple, known words like "word1", "word2" from a list. Also word2 could
...: be word2 or wordtwo, like so"""
In [22]: df = pd.DataFrame(text.split('. '), columns=['Content'])
In [23]: df
Out[23]:
Content
0 \nI have a pandas data frame with approximatel...
1 I want to find the number of occurrences of sp...
2 This works for a single word
3 But I want to find out the occurrences of mult...
4 Also word2 could be word2 or wordtwo, like so
In [24]: from sklearn.feature_extraction.text import CountVectorizer
In [25]: vocab = ['word', 'words', 'word1', 'word2', 'wordtwo']
In [26]: vect = CountVectorizer(vocabulary=vocab)
In [27]: res = pd.Series(np.ravel((vect.fit_transform(df['Content']).sum(axis=0))),
index=vect.get_feature_names())
In [28]: res
Out[28]:
word 1
words 2
word1 1
word2 3
wordtwo 1
dtype: int64
您可以像这样创建字典:
{w: df["Content"].str.contains(w).sum() for w in words}
假设 words
是单词列表。