如何使用 Python 来计算文本文档中的唯一单词（没有特殊字符/大小写干扰）

Question

我是 Python 的新手，需要一些帮助来尝试开发一个文本内容分析器，它可以帮助我在一个文本文件中找到 7 个内容：

总字数
唯一单词的总数（没有大小写和特殊字符干扰）
句数
一句话中的平均字数
查找常用短语（3 个或更多单词使用超过 3 次的短语）
使用的单词列表，按频率降序排列（没有大小写和特殊字符干扰）
接受来自 STDIN 或命令行指定文件的输入的能力

到目前为止，我有这个 Python 程序来打印总字数：

with open('/Users/name/Desktop/20words.txt', 'r') as f:

     p = f.read()

     words = p.split()

     wordCount = len(words)
     print "The total word count is:", wordCount

到目前为止，我有这个 Python 程序来打印独特的单词及其频率：（它不按顺序看到单词，例如：dog、dog.、"dog, 和 dog, 作为不同的词)

 file=open("/Users/name/Desktop/20words.txt", "r+")

 wordcount={}

 for word in file.read().split():

     if word not in wordcount:
         wordcount[word] = 1
     else:
         wordcount[word] += 1
 for k, v in wordcount.items():
     print k, v

感谢您提供的任何帮助！

Answer 1

如果您知道要避免哪些字符，可以使用str.strip从四肢中删除这些字符。

word = word.strip().strip("'").strip('"')...

这将删除单词末端出现的这些字符。这可能不如使用某些 NLP 库有效，但它可以完成工作。

str.strip Docs

Answer 2

当然，最困难的部分是识别句子。您可以使用 regular expression for this, but there might still be some ambiguity, e.g. with names and titles, that have a dot followed by an upper case letter. For words, too, you can use a simple regex, instead of using split. The exact expression to use depends on what qualifies as a "word". Finally, you can use collections.Counter 来计算所有这些，而不是手动执行此操作。使用 str.lower 将整个文本或单个单词转换为小写。

这应该可以帮助您入门：

import re, collections
text = """Sentences start with an upper-case letter. Do they always end 
with a dot? No! Also, not each dot is the end of a sentence, e.g. these two, 
but this is. Still, some ambiguity remains with names, like Mr. Miller here."""

sentence = re.compile(r"[A-Z].*?[.!?](?=\s+[A-Z]|$)", re.S)    
sentences = collections.Counter(sentence.findall(text))
for n, s in sentences.most_common():
    print n, s

word = re.compile(r"\w+")
words = collections.Counter(word.findall(text.lower()))
for n, w in words.most_common():
    print n, w

对于 "more power"，您可以使用一些 natural language toolkit，但这对于此任务来说可能有点多。

如何使用 Python 来计算文本文档中的唯一单词（没有特殊字符/大小写干扰）

How can you use Python to count the unique words (without special characters/ cases interfering) in a text document

python

unique