遍历目录并计算所有文件和子目录中的单词并累积总数
walk directory and count words from all files and subdirectories and accumulate totals
Whosebug 社区,您好!多年来,我一直使用这个社区来完成用于工作、学校和个人探索的小型一次性项目;然而,这是我发布的第一个问题...所以要小心 ;)
我试图从一个目录和所有子目录中读取每个文件,然后将结果累积到一个 Python 的字典中。现在脚本(见下文)正在根据需要读取所有文件,但每个文件的结果都是单独的。我正在寻找帮助积累成一个。
代码
import re
import os
import sys
import os.path
import fnmatch
import collections
def search( file ):
if os.path.isdir(path) == True:
for root, dirs, files in os.walk(path):
for file in files:
# words = re.findall('\w+', open(file).read().lower())
words = re.findall('\w+', open(os.path.join(root, file)).read().lower())
ignore = ['the','a','if','in','it','of','or','on','and','to']
counter=collections.Counter(x for x in words if x not in ignore)
print(counter.most_common(10))
else:
words = re.findall('\w+', open(path).read().lower())
ignore = ['the','a','if','in','it','of','or','on','and','to']
counter=collections.Counter(x for x in words if x not in ignore)
print(counter.most_common(10))
path = raw_input("Enter file and path")
结果
Enter file and path./dirTest
[('this', 1), ('test', 1), ('is', 1), ('just', 1)]
[('this', 1), ('test', 1), ('is', 1), ('just', 1)]
[('test', 2), ('is', 2), ('just', 2), ('this', 1), ('really', 1)]
[('test', 3), ('just', 2), ('this', 2), ('is', 2), ('power', 1),
('through', 1), ('really', 1)]
[('this', 2), ('another', 1), ('is', 1), ('read', 1), ('can', 1),
('file', 1), ('test', 1), ('you', 1)]
期望的结果 - 示例
[('this', 5), ('another', 1), ('is', 5), ('read', 1), ('can', 1),
('file', 1), ('test', 5), ('you', 1), ('power', 1), ('through', 1),
('really', 2)]
任何指导将不胜感激!
我看到您正在尝试从 file/dir 扫描中查找某些关键字并获取出现次数
基本上你可以获得所有此类事件的列表,然后像这样找到每个事件的计数
def couunt_all(array):
nodup = list(set(array))
for i in nodup:
print(i,array.count(i))
array = ['this','this','this','is','is']
print(couunt_all(array))
out:
('this', 3)
('is', 2)
问题出在您的 print
语句和 Counter
对象的使用上。我建议如下。
ignore = ['the', 'a', 'if', 'in', 'it', 'of', 'or', 'on', 'and', 'to']
def extract(file_path, counter):
words = re.findall('\w+', open(file_path).read().lower())
counter.update([x for x in words if x not in ignore])
def search(file):
counter = collections.Counter()
if os.path.isdir(path):
for root, dirs, files in os.walk(path):
for file in files:
extract(os.path.join(root, file), counter)
else:
extract(path, counter)
print(counter.most_common(10))
您可以将常用的代码行分开。另外 os.path.isdir(path)
returns 一个 bool 值,因此你可以直接将它用于 if
条件而无需比较。
初步解决方案:
我的解决方案是将所有单词附加到一个 list
,然后将该列表与 Counter
一起使用。这样你就可以用你的结果产生一个输出。
根据@ShadowRanger 提到的性能影响,您可以直接更新计数器而不是使用单独的列表。
看起来你想要一个单一的 Counter
,其中包含你在最后打印的所有累积统计数据,但你正在为每个文件制作一个 Counter
,打印它,然后扔掉它离开。您只需要将 Counter
初始化和 print
ing 移动到循环之外,每个文件只需 update
"one true Counter
":
def search( file ):
# Initialize empty Counter up front
counter = Counter()
# Create ignore only once, and make it a set, so membership tests go faster
ignore = {'the','a','if','in','it','of','or','on','and','to'}
if os.path.isdir(path): # Comparing to True is anti-pattern; removed
for root, dirs, files in os.walk(path):
for file in files:
words = re.findall('\w+', open(os.path.join(root, file)).read().lower())
# Update common Counter
counter.update(x for x in words if x not in ignore)
else:
words = re.findall('\w+', open(path).read().lower())
# Update common Counter
counter.update(x for x in words if x not in ignore)
# Do a single print at the end
print(counter.most_common(10))
如果您愿意,可以在此处分解出通用代码,例如:
def update_counts_for_file(path, counter, ignore=()):
with open(path) as f: # Using with statements is good, always do it
words = re.findall('\w+', f.read().lower())
counter.update(x for x in words if x not in ignore)
允许您通过调用分解出的代码来替换重复代码,但除非代码变得非常复杂,否则可能不值得分解出仅重复两次的两行代码。
Whosebug 社区,您好!多年来,我一直使用这个社区来完成用于工作、学校和个人探索的小型一次性项目;然而,这是我发布的第一个问题...所以要小心 ;)
我试图从一个目录和所有子目录中读取每个文件,然后将结果累积到一个 Python 的字典中。现在脚本(见下文)正在根据需要读取所有文件,但每个文件的结果都是单独的。我正在寻找帮助积累成一个。
代码
import re
import os
import sys
import os.path
import fnmatch
import collections
def search( file ):
if os.path.isdir(path) == True:
for root, dirs, files in os.walk(path):
for file in files:
# words = re.findall('\w+', open(file).read().lower())
words = re.findall('\w+', open(os.path.join(root, file)).read().lower())
ignore = ['the','a','if','in','it','of','or','on','and','to']
counter=collections.Counter(x for x in words if x not in ignore)
print(counter.most_common(10))
else:
words = re.findall('\w+', open(path).read().lower())
ignore = ['the','a','if','in','it','of','or','on','and','to']
counter=collections.Counter(x for x in words if x not in ignore)
print(counter.most_common(10))
path = raw_input("Enter file and path")
结果
Enter file and path./dirTest
[('this', 1), ('test', 1), ('is', 1), ('just', 1)]
[('this', 1), ('test', 1), ('is', 1), ('just', 1)]
[('test', 2), ('is', 2), ('just', 2), ('this', 1), ('really', 1)]
[('test', 3), ('just', 2), ('this', 2), ('is', 2), ('power', 1),
('through', 1), ('really', 1)]
[('this', 2), ('another', 1), ('is', 1), ('read', 1), ('can', 1),
('file', 1), ('test', 1), ('you', 1)]
期望的结果 - 示例
[('this', 5), ('another', 1), ('is', 5), ('read', 1), ('can', 1),
('file', 1), ('test', 5), ('you', 1), ('power', 1), ('through', 1),
('really', 2)]
任何指导将不胜感激!
我看到您正在尝试从 file/dir 扫描中查找某些关键字并获取出现次数
基本上你可以获得所有此类事件的列表,然后像这样找到每个事件的计数
def couunt_all(array):
nodup = list(set(array))
for i in nodup:
print(i,array.count(i))
array = ['this','this','this','is','is']
print(couunt_all(array))
out:
('this', 3)
('is', 2)
问题出在您的 print
语句和 Counter
对象的使用上。我建议如下。
ignore = ['the', 'a', 'if', 'in', 'it', 'of', 'or', 'on', 'and', 'to']
def extract(file_path, counter):
words = re.findall('\w+', open(file_path).read().lower())
counter.update([x for x in words if x not in ignore])
def search(file):
counter = collections.Counter()
if os.path.isdir(path):
for root, dirs, files in os.walk(path):
for file in files:
extract(os.path.join(root, file), counter)
else:
extract(path, counter)
print(counter.most_common(10))
您可以将常用的代码行分开。另外 os.path.isdir(path)
returns 一个 bool 值,因此你可以直接将它用于 if
条件而无需比较。
初步解决方案:
我的解决方案是将所有单词附加到一个 list
,然后将该列表与 Counter
一起使用。这样你就可以用你的结果产生一个输出。
根据@ShadowRanger 提到的性能影响,您可以直接更新计数器而不是使用单独的列表。
看起来你想要一个单一的 Counter
,其中包含你在最后打印的所有累积统计数据,但你正在为每个文件制作一个 Counter
,打印它,然后扔掉它离开。您只需要将 Counter
初始化和 print
ing 移动到循环之外,每个文件只需 update
"one true Counter
":
def search( file ):
# Initialize empty Counter up front
counter = Counter()
# Create ignore only once, and make it a set, so membership tests go faster
ignore = {'the','a','if','in','it','of','or','on','and','to'}
if os.path.isdir(path): # Comparing to True is anti-pattern; removed
for root, dirs, files in os.walk(path):
for file in files:
words = re.findall('\w+', open(os.path.join(root, file)).read().lower())
# Update common Counter
counter.update(x for x in words if x not in ignore)
else:
words = re.findall('\w+', open(path).read().lower())
# Update common Counter
counter.update(x for x in words if x not in ignore)
# Do a single print at the end
print(counter.most_common(10))
如果您愿意,可以在此处分解出通用代码,例如:
def update_counts_for_file(path, counter, ignore=()):
with open(path) as f: # Using with statements is good, always do it
words = re.findall('\w+', f.read().lower())
counter.update(x for x in words if x not in ignore)
允许您通过调用分解出的代码来替换重复代码,但除非代码变得非常复杂,否则可能不值得分解出仅重复两次的两行代码。