搜索以特定标记和频率直方图结尾的句子

Question

我尝试制作了文本中以感叹号、问号结尾的句子以及以点结尾的句子的频率直方图（我只是统计了这些字符在文本中的数量）。从文件中读取文本。我完成的代码如下所示：

import matplotlib.pyplot as plt
 
text_file = 'text.txt'
 
marks = '?!.'
lcount = dict([(l, 0) for l in marks])
 
 
for l in open(text_file, encoding='utf8').read():
    try:
        lcount[l.upper()] += 1
    except KeyError:
        pass
norm = sum(lcount.values())
 
fig = plt.figure()
ax = fig.add_subplot(111)
x = range(3)
ax.bar(x, [lcount[l]/norm * 100 for l in marks], width=0.8,
       color='g', alpha=0.5, align='center')
ax.set_xticks(x)
ax.set_xticklabels(marks)
ax.tick_params(axis='x', direction='out')
ax.set_xlim(-0.5, 2.5)
ax.yaxis.grid(True)
ax.set_ylabel('Sentences with different ending frequency, %')
plt.show()

但是我不能算别人以省略号结尾的句子（意思是...），我的代码算三个字符，所以三个句子。此外，这是符号的计数，而不是实际的句子。我怎样才能改进句子的计数，而不是标记和以省略号结尾的句子的计数？ 文件示例： 想玩吗？我们走吧！一定会好的！我的朋友也这么认为。不过，我不知道……我不喜欢这样。

Answer 1

您可以尝试使用 regex 拆分句子。 re.split() 函数在这里工作正常：示例代码：

import re
string = "Wanna play? Let's go! It will be definitely good! My friend also think so. However, I don't know... I don't like this."
print(re.split('\.+\s*|!+\s*|\?+\s*', string))

输出：

['Wanna play', "Let's go", 'It will be definitely good', 'My friend also think so', "However, I don't know", "I don't like this", '']

编辑后的答案：re.findall('([^.?!]+\.+|[^.?!]+!+|[^.?!]+\?+)\s*', string)

解释：

() 用于捕获和分组其中序列的结果。例如：

import re

a = 'Hello World'
print(re.findall('l+o', a))   # Match = llo, Output = llo
print(re.findall('(l+)o', a)) # Match = llo, output = ll

输出：

['llo']
['ll']  # With parenthesis, only the part inside them is returned

[^.?!]+指的是除.、?、!之外的1个或多个字符的集合。这会匹配所有单词，一旦遇到三个标点符号，标准就会失败并且搜索结果 breaks.Eg.

import re

a = 'Hello World! My name is Anshumaan. What is Your name?'
print(re.findall('[^.?!]+', a))
print(re.findall('([^.?!]+)\!+\s*', a))
print(re.findall('([^.?!]+\!+)\s*', a))

输出：

['Hello World', ' My name is Anshumaan', ' What is Your name']
['Hello World']
['Hello World!']

它从左边开始，所有字符直到 ! 匹配它，因此它 returns 它们。然后它从 space 开始，因为它也符合条件并一直持续到 ..
在下一种情况下，!也被匹配，但是由于括号中只有单词匹配部分，所以不返回!（\s*匹配0个或多个whitespace）。在第3种情况下，由于\!也在括号中，因此也返回!。

最后是or块。由于我们有 3 个标点符号，因此有 3 个标准，word/phrase 和 .，word/phrase 和 ! 以及 word/phrase 和 ?。它们都使用 or 字符（|）连接，然后为了过滤掉白色 spaces，\s 字符放在括号外。

所以，

re.findall('([^.?!]+\.+|[^.?!]+!+|[^.?!]+\?+)\s*', string)

可以解释为：

find: ('(<character except [.!?] once or more>and<! once or more>)' or '(<character except [.!?] once or more>and<. once or more>)' or '(<character except [.!?] once or more>and<? once or more>)')<also look for whitespace but don't return them since they are not in the parentheses>

Answer 2

根据你的例子，以及在 NLP 中处理单词的常用方法（关于词形还原等），你可以先在空格上拆分句子：

marks = ['?', '!', '...', '.'] # Ordering of ellipsis first gives prio of checking that condition 
mark_count = {}
sentences = 0

for example in file:
    words = example.split(" ") # This gives an array of all words togheter with your symbols
    
    # Finds markings in word
    for word in words:
        for mark in marks:
            if mark in word:
                # Here you will find and count end of sentence and break if found, as a 
                # word can be "know..." the first dot will be found and then we break as we know 
                # thats the end of the sentence.
                sentences += 1
                
                if mark_count.get(mark, False):
                    mark_count[mark] += 1
                else:
                    mark_count[mark] = 1

                break

编辑：

保证配对正确

columns = []
values = []
for key in mark_count.keys():
    columns.append(key)
    values.append(mark_count[key])

plt.bar(columns, values, color='green')

搜索以特定标记和频率直方图结尾的句子

Search of sentences ending with specific marks and frequency histogram

python

dictionary

file

count

histogram

解释：