计算文本文件中几篇文章中特定单词的频率
Count frequency of specific words in several articles in a text file
我想计算单个文本文件中包含的每篇文章的单词列表的出现次数。
每篇文章都可以被识别,因为它们都以一个共同的标签“
Advertisement'”开头。
这是文本文件的示例:
"[<p>Advertisement , By TIM ARANGO , SABRINA TAVERNISE and CEYLAN YEGINSU JUNE 28, 2016
,Credit Ilhas News Agency, via Agence France-Presse — Getty Images,ISTANBUL ......]
[<p>Advertisement , By MILAN SCHREUER and ALISSA J. RUBIN OCT. 5, 2016
, BRUSSELS — A man wounded two police officers with a knife in Brussels around noon
on Wednesday in what the authorities called “a potential terrorist attack.” ,
The two ......]"
我想做的是计算每个单词的频率我有一个 csv 文件(20 个单词)并像这样写输出:
id, attack, war, terrorism, people, killed, said
article_1, 45, 5, 4, 6, 2,1
article_2, 10, 3, 2, 1, 0,0
csv中的单词是这样存储的:
attack
people
killed
attacks
state
islamic
按照建议,我首先尝试通过标签 <p>
拆分整个文本文件,然后再开始计算字数。然后我标记了文件文本中的列表。
这是我目前拥有的:
opener = open("News_words_most_common.csv")
words = opener.read()
my_pattern = ('\w+')
x = re.findall(my_pattern, words)
file_open = open("Training_News_6.csv")
files = file_open.read()
r = files.lower()
stops = set(stopwords.words("english"))
words = r.split("<p>")
token= word_tokenize(words)
string = str(words)
token= word_tokenize(string)
print(token)
这是输出:
['[', "'", "''", '|', '[', "'", ',', "'advertisement",
',', 'by', 'milan', 'schreuer'.....']', '|', "''", '\n', "'", ']']
下一步将循环拆分的文章(现在变成标记化的单词列表)并计算第一个文件中单词的频率。如果您对如何交互和计数有任何建议,请告诉我!
我在 Anaconda 上使用 Python 3.5
您可以尝试阅读您的文本文件,然后在 '<p>'
处拆分(如果,如您所说,它们用于标记新文章的开头),然后您将得到一个文章列表。一个带计数的简单循环就可以了。
我建议您看一下 nltk 模块。我不确定你的最终目标是什么,但 nltk 有非常容易实现的功能来做这些事情以及更多(例如,你可以计算频率,而不是仅仅查看每篇文章中出现的单词的次数,甚至通过反向文档频率对其进行缩放,称为 tf-idf)。
可能是我任务没做好...
如果您要进行文本分类,使用标准的 scikit 向量化器可能会很方便,例如 Bag of Words,它需要一个文本和 returns 一个包含单词的数组。如果你真的需要 csv,你可以直接在分类器中使用它或者输出到 csv。
它已经包含在 scikit 和 Anaconda 中。
另一种方法 - 是手动拆分。
您可以加载数据、拆分为单词、对它们进行计数、排除停用词(这是什么?)并放入输出结果文件。喜欢:
import re
from collections import Counter
txt = open('file.txt', 'r').read()
words = re.findall('[a-z]+', txt, re.I)
cnt = Counter(_ for _ in words if _ not in stopwords)
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
vocabulary = [word.strip() for word in open('vocabulary.txt').readlines()]
corpus = open('articles.txt').read().split('<p>Advertisement')
vectorizer = CountVectorizer(min_df=1, vocabulary=vocabulary)
words_matrix = vectorizer.fit_transform(corpus)
df = pd.DataFrame(data=words_matrix.todense(),
index=('article_%s' % i for i in range(words_matrix.shape[0])),
columns=vectorizer.get_feature_names())
df.index.name = 'id'
df.to_csv('articles.csv')
在文件 articles.csv
中:
$ cat articles.csv
id,attack,people,killed,attacks,state,islamic
article_0,0,0,0,0,0,0
article_1,0,0,0,0,0,0
article_2,1,0,0,0,0,0
这个怎么样:
import re
from collections import Counter
csv_data = [["'", "\n", ","], ['fox'],
['the', 'fox', 'jumped'],
['over', 'the', 'fence'],
['fox'], ['fence']]
key_words = ['over', 'fox']
words_list = []
for i in csv_data:
for j in i:
line_of_words = ",".join(re.findall("[a-zA-Z]+", j))
words_list.append(line_of_words)
word_count = Counter(words_list)
match_dict = {}
for aword, word_freq in zip(word_count.keys(), word_count.items()):
if aword in key_words:
match_dict[aword] = word_freq[1]
这导致:
print('Article words: ', words_list)
print('Article Word Count: ', word_count)
print('Matches: ', match_dict)
Article words: ['', 'n', '', 'fox', 'the', 'fox', 'jumped', 'over', 'the', 'fence', 'fox', 'fence']
Article Word Count: Counter({'fox': 3, '': 2, 'the': 2, 'fence': 2, 'n': 1, 'over': 1, 'jumped': 1})
Matches: {'over': 1, 'fox': 3}
我想计算单个文本文件中包含的每篇文章的单词列表的出现次数。 每篇文章都可以被识别,因为它们都以一个共同的标签“
Advertisement'”开头。
这是文本文件的示例:
"[<p>Advertisement , By TIM ARANGO , SABRINA TAVERNISE and CEYLAN YEGINSU JUNE 28, 2016
,Credit Ilhas News Agency, via Agence France-Presse — Getty Images,ISTANBUL ......]
[<p>Advertisement , By MILAN SCHREUER and ALISSA J. RUBIN OCT. 5, 2016
, BRUSSELS — A man wounded two police officers with a knife in Brussels around noon
on Wednesday in what the authorities called “a potential terrorist attack.” ,
The two ......]"
我想做的是计算每个单词的频率我有一个 csv 文件(20 个单词)并像这样写输出:
id, attack, war, terrorism, people, killed, said
article_1, 45, 5, 4, 6, 2,1
article_2, 10, 3, 2, 1, 0,0
csv中的单词是这样存储的:
attack
people
killed
attacks
state
islamic
按照建议,我首先尝试通过标签 <p>
拆分整个文本文件,然后再开始计算字数。然后我标记了文件文本中的列表。
这是我目前拥有的:
opener = open("News_words_most_common.csv")
words = opener.read()
my_pattern = ('\w+')
x = re.findall(my_pattern, words)
file_open = open("Training_News_6.csv")
files = file_open.read()
r = files.lower()
stops = set(stopwords.words("english"))
words = r.split("<p>")
token= word_tokenize(words)
string = str(words)
token= word_tokenize(string)
print(token)
这是输出:
['[', "'", "''", '|', '[', "'", ',', "'advertisement",
',', 'by', 'milan', 'schreuer'.....']', '|', "''", '\n', "'", ']']
下一步将循环拆分的文章(现在变成标记化的单词列表)并计算第一个文件中单词的频率。如果您对如何交互和计数有任何建议,请告诉我!
我在 Anaconda 上使用 Python 3.5
您可以尝试阅读您的文本文件,然后在 '<p>'
处拆分(如果,如您所说,它们用于标记新文章的开头),然后您将得到一个文章列表。一个带计数的简单循环就可以了。
我建议您看一下 nltk 模块。我不确定你的最终目标是什么,但 nltk 有非常容易实现的功能来做这些事情以及更多(例如,你可以计算频率,而不是仅仅查看每篇文章中出现的单词的次数,甚至通过反向文档频率对其进行缩放,称为 tf-idf)。
可能是我任务没做好...
如果您要进行文本分类,使用标准的 scikit 向量化器可能会很方便,例如 Bag of Words,它需要一个文本和 returns 一个包含单词的数组。如果你真的需要 csv,你可以直接在分类器中使用它或者输出到 csv。 它已经包含在 scikit 和 Anaconda 中。
另一种方法 - 是手动拆分。 您可以加载数据、拆分为单词、对它们进行计数、排除停用词(这是什么?)并放入输出结果文件。喜欢:
import re
from collections import Counter
txt = open('file.txt', 'r').read()
words = re.findall('[a-z]+', txt, re.I)
cnt = Counter(_ for _ in words if _ not in stopwords)
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
vocabulary = [word.strip() for word in open('vocabulary.txt').readlines()]
corpus = open('articles.txt').read().split('<p>Advertisement')
vectorizer = CountVectorizer(min_df=1, vocabulary=vocabulary)
words_matrix = vectorizer.fit_transform(corpus)
df = pd.DataFrame(data=words_matrix.todense(),
index=('article_%s' % i for i in range(words_matrix.shape[0])),
columns=vectorizer.get_feature_names())
df.index.name = 'id'
df.to_csv('articles.csv')
在文件 articles.csv
中:
$ cat articles.csv
id,attack,people,killed,attacks,state,islamic
article_0,0,0,0,0,0,0
article_1,0,0,0,0,0,0
article_2,1,0,0,0,0,0
这个怎么样:
import re
from collections import Counter
csv_data = [["'", "\n", ","], ['fox'],
['the', 'fox', 'jumped'],
['over', 'the', 'fence'],
['fox'], ['fence']]
key_words = ['over', 'fox']
words_list = []
for i in csv_data:
for j in i:
line_of_words = ",".join(re.findall("[a-zA-Z]+", j))
words_list.append(line_of_words)
word_count = Counter(words_list)
match_dict = {}
for aword, word_freq in zip(word_count.keys(), word_count.items()):
if aword in key_words:
match_dict[aword] = word_freq[1]
这导致:
print('Article words: ', words_list)
print('Article Word Count: ', word_count)
print('Matches: ', match_dict)
Article words: ['', 'n', '', 'fox', 'the', 'fox', 'jumped', 'over', 'the', 'fence', 'fox', 'fence']
Article Word Count: Counter({'fox': 3, '': 2, 'the': 2, 'fence': 2, 'n': 1, 'over': 1, 'jumped': 1})
Matches: {'over': 1, 'fox': 3}