Python 预定义词的词频
Python Word Frequencies with pre-defined words
我在文本文件中有一组数据,我想根据预定义的词(drive、street、i、lives)构建一个频率 table。下面是示例
ID | Text
---|--------------------------------------------------------------------
1 | i drive to work everyday in the morning and i drive back in the evening on main street
2 | i drive back in a car and then drive to the gym on 5th street
3 | Joe lives in Newyork on NY street
4 | Tod lives in Jersey city on NJ street
这里是我想得到的输出
ID | drive | street | i | lives
----|--------|----------|------|-------
1 | 2 | 1 | 2 | 0
2 | 2 | 1 | 1 | 0
3 | 0 | 1 | 0 | 1
4 | 0 | 1 | 0 | 1
这是我正在使用的代码,我可以找到字数,但这并不能解决我的需求,我想使用一组预定义的字来找到字数,如图所示以上
from nltk.corpus import stopwords
import string
from collections import Counter
import nltk
from nltk.tag import pos_tag
xy = open('C:\Python\data\file.txt').read().split()
q = (w.lower() for w in xy)
stopset = set(stopwords.words('english'))
filtered_words = [word for word in xyz if not word in stopset]
filtered_words = []
for word in xyz:
if word not in stopset:
filtered_words.append(word)
print(Counter(filtered_words))
print(len(filtered_words))
sklearn.feature_extraction.text.CountVectorizer
之类的内容似乎与您要查找的内容很接近。另外,collections.Counter
可能会有帮助。你打算如何使用这个数据结构?如果您碰巧尝试使用机器 learning/prediction,那么值得研究一下 sklearn.feature_extraction.text
.
中的不同矢量化器
编辑:
text = ['i drive to work everyday in the morning and i drive back in the evening on main street',
'i drive back in a car and then drive to the gym on 5th street',
'Joe lives in Newyork on NY street',
'Tod lives in Jersey city on NJ street']
from sklearn.feature_extraction.text import CountVectorizer
vocab = ['drive', 'street', 'i', 'lives']
vectorizer = CountVectorizer(vocabulary = vocab)
# turn the text above into a matrix of shape R X C
# where R is number of rows (elements in your text array)
# and C is the number of elements in the set of all words in your text array
X = vectorizer.fit_transform(text)
# sparse to dense matrix
X = X.toarray()
# get the feature names from the already-fitted vectorizer
vectorizer_feature_names = vectorizer.get_feature_names()
# prove that the vectorizer's feature names are identical to the vocab you specified above
assert vectorizer_feature_names == vocab
# make a table with word frequencies as values and vocab as columns
out_df = pd.DataFrame(data = X, columns = vectorizer_feature_names)
print(out_df)
并且,您的结果:
drive street i lives
0 2 1 0 0
1 2 1 0 0
2 0 1 0 1
3 0 1 0 1
只需询问您想要的词,而不是您不想要的停用词:
filtered_words = [word for word in xyz if word in ['drive', 'street', 'i', 'lives']]
如果你想找到列表中某个单词的数量,你可以使用 list.count(word)
来找到它,所以如果你有一个单词列表你想获得频率,你可以这样做像这样:
wanted_words = ["drive", "street", "i", "lives"]
frequencies = [xy.count(i) for i in wanted_words]
基于 Alex Halls 的想法进行预过滤 - 之后只需使用 defaultdict
。用来数数真的很舒服
from collections import defaultdict
s = 'i drive to work everyday in the morning and i drive back in the evening on main street'
filtered_words = [word for word in s.split()
if word in ['drive', 'street', 'i', 'lives']]
d = defaultdict(int)
for k in filtered_words:
d[k] += 1
print(d)
我在文本文件中有一组数据,我想根据预定义的词(drive、street、i、lives)构建一个频率 table。下面是示例
ID | Text
---|--------------------------------------------------------------------
1 | i drive to work everyday in the morning and i drive back in the evening on main street
2 | i drive back in a car and then drive to the gym on 5th street
3 | Joe lives in Newyork on NY street
4 | Tod lives in Jersey city on NJ street
这里是我想得到的输出
ID | drive | street | i | lives
----|--------|----------|------|-------
1 | 2 | 1 | 2 | 0
2 | 2 | 1 | 1 | 0
3 | 0 | 1 | 0 | 1
4 | 0 | 1 | 0 | 1
这是我正在使用的代码,我可以找到字数,但这并不能解决我的需求,我想使用一组预定义的字来找到字数,如图所示以上
from nltk.corpus import stopwords
import string
from collections import Counter
import nltk
from nltk.tag import pos_tag
xy = open('C:\Python\data\file.txt').read().split()
q = (w.lower() for w in xy)
stopset = set(stopwords.words('english'))
filtered_words = [word for word in xyz if not word in stopset]
filtered_words = []
for word in xyz:
if word not in stopset:
filtered_words.append(word)
print(Counter(filtered_words))
print(len(filtered_words))
sklearn.feature_extraction.text.CountVectorizer
之类的内容似乎与您要查找的内容很接近。另外,collections.Counter
可能会有帮助。你打算如何使用这个数据结构?如果您碰巧尝试使用机器 learning/prediction,那么值得研究一下 sklearn.feature_extraction.text
.
编辑:
text = ['i drive to work everyday in the morning and i drive back in the evening on main street',
'i drive back in a car and then drive to the gym on 5th street',
'Joe lives in Newyork on NY street',
'Tod lives in Jersey city on NJ street']
from sklearn.feature_extraction.text import CountVectorizer
vocab = ['drive', 'street', 'i', 'lives']
vectorizer = CountVectorizer(vocabulary = vocab)
# turn the text above into a matrix of shape R X C
# where R is number of rows (elements in your text array)
# and C is the number of elements in the set of all words in your text array
X = vectorizer.fit_transform(text)
# sparse to dense matrix
X = X.toarray()
# get the feature names from the already-fitted vectorizer
vectorizer_feature_names = vectorizer.get_feature_names()
# prove that the vectorizer's feature names are identical to the vocab you specified above
assert vectorizer_feature_names == vocab
# make a table with word frequencies as values and vocab as columns
out_df = pd.DataFrame(data = X, columns = vectorizer_feature_names)
print(out_df)
并且,您的结果:
drive street i lives
0 2 1 0 0
1 2 1 0 0
2 0 1 0 1
3 0 1 0 1
只需询问您想要的词,而不是您不想要的停用词:
filtered_words = [word for word in xyz if word in ['drive', 'street', 'i', 'lives']]
如果你想找到列表中某个单词的数量,你可以使用 list.count(word)
来找到它,所以如果你有一个单词列表你想获得频率,你可以这样做像这样:
wanted_words = ["drive", "street", "i", "lives"]
frequencies = [xy.count(i) for i in wanted_words]
基于 Alex Halls 的想法进行预过滤 - 之后只需使用 defaultdict
。用来数数真的很舒服
from collections import defaultdict
s = 'i drive to work everyday in the morning and i drive back in the evening on main street'
filtered_words = [word for word in s.split()
if word in ['drive', 'street', 'i', 'lives']]
d = defaultdict(int)
for k in filtered_words:
d[k] += 1
print(d)