搜索以特定标记和频率直方图结尾的句子

Search of sentences ending with specific marks and frequency histogram

我尝试制作了文本中以感叹号、问号结尾的句子以及以点结尾的句子的频率直方图(我只是统计了这些字符在文本中的数量)。从文件中读取文本。我完成的代码如下所示:

import matplotlib.pyplot as plt
 
text_file = 'text.txt'
 
marks = '?!.'
lcount = dict([(l, 0) for l in marks])
 
 
for l in open(text_file, encoding='utf8').read():
    try:
        lcount[l.upper()] += 1
    except KeyError:
        pass
norm = sum(lcount.values())
 
fig = plt.figure()
ax = fig.add_subplot(111)
x = range(3)
ax.bar(x, [lcount[l]/norm * 100 for l in marks], width=0.8,
       color='g', alpha=0.5, align='center')
ax.set_xticks(x)
ax.set_xticklabels(marks)
ax.tick_params(axis='x', direction='out')
ax.set_xlim(-0.5, 2.5)
ax.yaxis.grid(True)
ax.set_ylabel('Sentences with different ending frequency, %')
plt.show()

但是我不能算别人以省略号结尾的句子(意思是...),我的代码算三个字符,所以三个句子。此外,这是符号的计数,而不是实际的句子。我怎样才能改进句子的计数,而不是标记和以省略号结尾的句子的计数? 文件示例: 想玩吗?我们走吧!一定会好的!我的朋友也这么认为。不过,我不知道……我不喜欢这样。

您可以尝试使用 regex 拆分句子。 re.split() 函数在这里工作正常: 示例代码:

import re
string = "Wanna play? Let's go! It will be definitely good! My friend also think so. However, I don't know... I don't like this."
print(re.split('\.+\s*|!+\s*|\?+\s*', string))

输出:

['Wanna play', "Let's go", 'It will be definitely good', 'My friend also think so', "However, I don't know", "I don't like this", '']

编辑后的答案:re.findall('([^.?!]+\.+|[^.?!]+!+|[^.?!]+\?+)\s*', string)

解释:

  1. () 用于捕获和分组其中序列的结果。例如:
import re

a = 'Hello World'
print(re.findall('l+o', a))   # Match = llo, Output = llo
print(re.findall('(l+)o', a)) # Match = llo, output = ll

输出:

['llo']
['ll']  # With parenthesis, only the part inside them is returned
  1. [^.?!]+指的是除.?!之外的1个或多个字符的集合。这会匹配所有单词,一旦遇到三个标点符号,标准就会失败并且搜索结果 breaks.Eg.
import re

a = 'Hello World! My name is Anshumaan. What is Your name?'
print(re.findall('[^.?!]+', a))
print(re.findall('([^.?!]+)\!+\s*', a))
print(re.findall('([^.?!]+\!+)\s*', a))

输出:

['Hello World', ' My name is Anshumaan', ' What is Your name']
['Hello World']
['Hello World!']

它从左边开始,所有字符直到 ! 匹配它,因此它 returns 它们。然后它从 space 开始,因为它也符合条件并一直持续到 ..
在下一种情况下,!也被匹配,但是由于括号中只有单词匹配部分,所以不返回!(\s*匹配0个或多个whitespace)。 在第3种情况下,由于\!也在括号中,因此也返回!

  1. 最后是or块。由于我们有 3 个标点符号,因此有 3 个标准,word/phrase 和 .,word/phrase 和 ! 以及 word/phrase 和 ?。它们都使用 or 字符(|)连接,然后为了过滤掉白色 spaces,\s 字符放在括号外。

所以,

re.findall('([^.?!]+\.+|[^.?!]+!+|[^.?!]+\?+)\s*', string)

可以解释为:

find: ('(<character except [.!?] once or more>and<! once or more>)' or '(<character except [.!?] once or more>and<. once or more>)' or '(<character except [.!?] once or more>and<? once or more>)')<also look for whitespace but don't return them since they are not in the parentheses>

根据你的例子,以及在 NLP 中处理单词的常用方法(关于词形还原等),你可以先在空格上拆分句子:

marks = ['?', '!', '...', '.'] # Ordering of ellipsis first gives prio of checking that condition 
mark_count = {}
sentences = 0

for example in file:
    words = example.split(" ") # This gives an array of all words togheter with your symbols
    
    # Finds markings in word
    for word in words:
        for mark in marks:
            if mark in word:
                # Here you will find and count end of sentence and break if found, as a 
                # word can be "know..." the first dot will be found and then we break as we know 
                # thats the end of the sentence.
                sentences += 1
                
                if mark_count.get(mark, False):
                    mark_count[mark] += 1
                else:
                    mark_count[mark] = 1

                break
                

编辑:

保证配对正确

columns = []
values = []
for key in mark_count.keys():
    columns.append(key)
    values.append(mark_count[key])

plt.bar(columns, values, color='green')