我怎样才能有效地自动从代码中提取人类可读的 strings/terms？

Question

我正在尝试确定最常用的词，或者 "terms"（我认为），因为我遍历了许多不同的文件。

示例 - 对于在文件中找到的这行代码：

for w in sorted(strings, key=strings.get, reverse=True):

我希望将这些独特的 strings/terms return 作为键输入到我的字典中：

for
w
in
sorted
strings
key
strings
get
reverse
True

但是，我希望此代码是可调的，这样我就可以 return 字符串之间也带有句点或其他字符，因为直到我 [=32] 我才知道什么有意义=] 脚本并计算 "terms" 几次：

strings.get

我该如何解决这个问题？这将有助于理解我如何一次一行地执行这一行，这样我就可以在读入文件的行时循环它。我已经掌握了基本逻辑，但我目前只是通过唯一行而不是进行计数"term":

strings = dict()
fname = '/tmp/bigfile.txt'

with open(fname, "r") as f:
    for line in f:
        if line in strings:
            strings[line] += 1
        else:
            strings[line] = 1

for w in sorted(strings, key=strings.get, reverse=True):
    print str(w).rstrip() + " : " + str(strings[w])

（是的，我在这里使用了我的小片段中的代码作为顶部的示例。）

Answer 1

如果您想要保持在一起的唯一 python 标记是 object.attr 构造，那么您感兴趣的所有标记都适合正则表达式

\w+\.?\w*

这基本上意味着 "one or more alphanumeric characters (including _) optionally followed by a . and then some more characters"

请注意，这也会匹配 42 或 7.6 之类的数字文字，但之后很容易过滤掉。

然后您可以使用 collections.Counter 为您进行实际计数：

import collections
import re

pattern = re.compile(r"\w+\.?\w*")

#here I'm using the source file for `collections` as the test example
with open(collections.__file__, "r") as f:
    tokens = collections.Counter(t.group() for t in pattern.finditer(f.read()))
    for token, count in tokens.most_common(5): #show only the top 5
        print(token, count)

运行 python 版本 3.6.0a1 输出是这样的：

self 226
def 173
return 170
self.data 129
if 102

这对 collections 模块有意义，因为它充满了使用 self 和定义方法的类，它还表明它确实捕获了 self.data适合您感兴趣的结构。

我怎样才能有效地自动从代码中提取人类可读的 strings/terms？

How can I effectively pull out human readable strings/terms from code automatically?

python

line

word