搜索以特定标记和频率直方图结尾的句子
Search of sentences ending with specific marks and frequency histogram
我尝试制作了文本中以感叹号、问号结尾的句子以及以点结尾的句子的频率直方图(我只是统计了这些字符在文本中的数量)。从文件中读取文本。我完成的代码如下所示:
import matplotlib.pyplot as plt
text_file = 'text.txt'
marks = '?!.'
lcount = dict([(l, 0) for l in marks])
for l in open(text_file, encoding='utf8').read():
try:
lcount[l.upper()] += 1
except KeyError:
pass
norm = sum(lcount.values())
fig = plt.figure()
ax = fig.add_subplot(111)
x = range(3)
ax.bar(x, [lcount[l]/norm * 100 for l in marks], width=0.8,
color='g', alpha=0.5, align='center')
ax.set_xticks(x)
ax.set_xticklabels(marks)
ax.tick_params(axis='x', direction='out')
ax.set_xlim(-0.5, 2.5)
ax.yaxis.grid(True)
ax.set_ylabel('Sentences with different ending frequency, %')
plt.show()
但是我不能算别人以省略号结尾的句子(意思是...),我的代码算三个字符,所以三个句子。此外,这是符号的计数,而不是实际的句子。我怎样才能改进句子的计数,而不是标记和以省略号结尾的句子的计数? 文件示例: 想玩吗?我们走吧!一定会好的!我的朋友也这么认为。不过,我不知道……我不喜欢这样。
您可以尝试使用 regex
拆分句子。 re.split()
函数在这里工作正常:
示例代码:
import re
string = "Wanna play? Let's go! It will be definitely good! My friend also think so. However, I don't know... I don't like this."
print(re.split('\.+\s*|!+\s*|\?+\s*', string))
输出:
['Wanna play', "Let's go", 'It will be definitely good', 'My friend also think so', "However, I don't know", "I don't like this", '']
编辑后的答案:re.findall('([^.?!]+\.+|[^.?!]+!+|[^.?!]+\?+)\s*', string)
解释:
()
用于捕获和分组其中序列的结果。例如:
import re
a = 'Hello World'
print(re.findall('l+o', a)) # Match = llo, Output = llo
print(re.findall('(l+)o', a)) # Match = llo, output = ll
输出:
['llo']
['ll'] # With parenthesis, only the part inside them is returned
[^.?!]+
指的是除.
、?
、!
之外的1个或多个字符的集合。这会匹配所有单词,一旦遇到三个标点符号,标准就会失败并且搜索结果 breaks.Eg.
import re
a = 'Hello World! My name is Anshumaan. What is Your name?'
print(re.findall('[^.?!]+', a))
print(re.findall('([^.?!]+)\!+\s*', a))
print(re.findall('([^.?!]+\!+)\s*', a))
输出:
['Hello World', ' My name is Anshumaan', ' What is Your name']
['Hello World']
['Hello World!']
它从左边开始,所有字符直到 !
匹配它,因此它 returns 它们。然后它从 space 开始,因为它也符合条件并一直持续到 .
.
在下一种情况下,!
也被匹配,但是由于括号中只有单词匹配部分,所以不返回!
(\s*匹配0个或多个whitespace)。
在第3种情况下,由于\!
也在括号中,因此也返回!
。
- 最后是
or
块。由于我们有 3 个标点符号,因此有 3 个标准,word/phrase 和 .
,word/phrase 和 !
以及 word/phrase 和 ?
。它们都使用 or
字符(|
)连接,然后为了过滤掉白色 spaces,\s
字符放在括号外。
所以,
re.findall('([^.?!]+\.+|[^.?!]+!+|[^.?!]+\?+)\s*', string)
可以解释为:
find: ('(<character except [.!?] once or more>and<! once or more>)' or '(<character except [.!?] once or more>and<. once or more>)' or '(<character except [.!?] once or more>and<? once or more>)')<also look for whitespace but don't return them since they are not in the parentheses>
根据你的例子,以及在 NLP 中处理单词的常用方法(关于词形还原等),你可以先在空格上拆分句子:
marks = ['?', '!', '...', '.'] # Ordering of ellipsis first gives prio of checking that condition
mark_count = {}
sentences = 0
for example in file:
words = example.split(" ") # This gives an array of all words togheter with your symbols
# Finds markings in word
for word in words:
for mark in marks:
if mark in word:
# Here you will find and count end of sentence and break if found, as a
# word can be "know..." the first dot will be found and then we break as we know
# thats the end of the sentence.
sentences += 1
if mark_count.get(mark, False):
mark_count[mark] += 1
else:
mark_count[mark] = 1
break
编辑:
保证配对正确
columns = []
values = []
for key in mark_count.keys():
columns.append(key)
values.append(mark_count[key])
plt.bar(columns, values, color='green')
我尝试制作了文本中以感叹号、问号结尾的句子以及以点结尾的句子的频率直方图(我只是统计了这些字符在文本中的数量)。从文件中读取文本。我完成的代码如下所示:
import matplotlib.pyplot as plt
text_file = 'text.txt'
marks = '?!.'
lcount = dict([(l, 0) for l in marks])
for l in open(text_file, encoding='utf8').read():
try:
lcount[l.upper()] += 1
except KeyError:
pass
norm = sum(lcount.values())
fig = plt.figure()
ax = fig.add_subplot(111)
x = range(3)
ax.bar(x, [lcount[l]/norm * 100 for l in marks], width=0.8,
color='g', alpha=0.5, align='center')
ax.set_xticks(x)
ax.set_xticklabels(marks)
ax.tick_params(axis='x', direction='out')
ax.set_xlim(-0.5, 2.5)
ax.yaxis.grid(True)
ax.set_ylabel('Sentences with different ending frequency, %')
plt.show()
但是我不能算别人以省略号结尾的句子(意思是...),我的代码算三个字符,所以三个句子。此外,这是符号的计数,而不是实际的句子。我怎样才能改进句子的计数,而不是标记和以省略号结尾的句子的计数? 文件示例: 想玩吗?我们走吧!一定会好的!我的朋友也这么认为。不过,我不知道……我不喜欢这样。
您可以尝试使用 regex
拆分句子。 re.split()
函数在这里工作正常:
示例代码:
import re
string = "Wanna play? Let's go! It will be definitely good! My friend also think so. However, I don't know... I don't like this."
print(re.split('\.+\s*|!+\s*|\?+\s*', string))
输出:
['Wanna play', "Let's go", 'It will be definitely good', 'My friend also think so', "However, I don't know", "I don't like this", '']
编辑后的答案:re.findall('([^.?!]+\.+|[^.?!]+!+|[^.?!]+\?+)\s*', string)
解释:
()
用于捕获和分组其中序列的结果。例如:
import re
a = 'Hello World'
print(re.findall('l+o', a)) # Match = llo, Output = llo
print(re.findall('(l+)o', a)) # Match = llo, output = ll
输出:
['llo']
['ll'] # With parenthesis, only the part inside them is returned
[^.?!]+
指的是除.
、?
、!
之外的1个或多个字符的集合。这会匹配所有单词,一旦遇到三个标点符号,标准就会失败并且搜索结果 breaks.Eg.
import re
a = 'Hello World! My name is Anshumaan. What is Your name?'
print(re.findall('[^.?!]+', a))
print(re.findall('([^.?!]+)\!+\s*', a))
print(re.findall('([^.?!]+\!+)\s*', a))
输出:
['Hello World', ' My name is Anshumaan', ' What is Your name']
['Hello World']
['Hello World!']
它从左边开始,所有字符直到 !
匹配它,因此它 returns 它们。然后它从 space 开始,因为它也符合条件并一直持续到 .
.
在下一种情况下,!
也被匹配,但是由于括号中只有单词匹配部分,所以不返回!
(\s*匹配0个或多个whitespace)。
在第3种情况下,由于\!
也在括号中,因此也返回!
。
- 最后是
or
块。由于我们有 3 个标点符号,因此有 3 个标准,word/phrase 和.
,word/phrase 和!
以及 word/phrase 和?
。它们都使用or
字符(|
)连接,然后为了过滤掉白色 spaces,\s
字符放在括号外。
所以,
re.findall('([^.?!]+\.+|[^.?!]+!+|[^.?!]+\?+)\s*', string)
可以解释为:
find: ('(<character except [.!?] once or more>and<! once or more>)' or '(<character except [.!?] once or more>and<. once or more>)' or '(<character except [.!?] once or more>and<? once or more>)')<also look for whitespace but don't return them since they are not in the parentheses>
根据你的例子,以及在 NLP 中处理单词的常用方法(关于词形还原等),你可以先在空格上拆分句子:
marks = ['?', '!', '...', '.'] # Ordering of ellipsis first gives prio of checking that condition
mark_count = {}
sentences = 0
for example in file:
words = example.split(" ") # This gives an array of all words togheter with your symbols
# Finds markings in word
for word in words:
for mark in marks:
if mark in word:
# Here you will find and count end of sentence and break if found, as a
# word can be "know..." the first dot will be found and then we break as we know
# thats the end of the sentence.
sentences += 1
if mark_count.get(mark, False):
mark_count[mark] += 1
else:
mark_count[mark] = 1
break
编辑:
保证配对正确
columns = []
values = []
for key in mark_count.keys():
columns.append(key)
values.append(mark_count[key])
plt.bar(columns, values, color='green')