将已解析的 pdf 中的句子连接在一起
Join together sentences from parsed pdf
我从 pdf 中抓取了一些文本,我已经解析了文本,目前所有内容都作为列表中的字符串。我想将由于 pdf 页面上的中断而作为单独的字符串返回的句子连接在一起。例如,
list = ['I am a ', 'sentence.', 'Please join me toge-', 'ther. Thanks for your help.']
我想要:
list = ['I am a sentence.', 'Please join me together. Thanks for your help.']
我目前有以下代码,它连接了一些句子,但连接到第一个的第二个子句仍然返回。我知道这是由于索引引起的,但不确定如何解决该问题。
new = []
def cleanlist(dictlist):
for i in range(len(dictlist)):
if i>0:
if dictlist[i-1][-1:] != ('.') or dictlist[i-1][-1:] != ('. '):
new.append(dictlist[i-1]+dictlist[i])
elif dictlist[i-1][-1:] == '-':
new.append(dictlist[i-1]+dictlist[i])
else:
new.append[dict_list[i]]
您可以使用生成器方法:
def cleanlist(dictlist):
current = []
for line in dictlist:
if line.endswith("-"):
current.append(line[:-1])
elif line.endswith(" "):
current.append(line)
else:
current.append(line)
yield "".join(current)
current = []
这样使用:
dictlist = ['I am a ', 'sentence.', 'Please join me toge-', 'ther. Thanks for your help.']
print(list(cleanlist(dictlist)))
# ['I am a sentence.', 'Please join me together. Thanks for your help.']
我从 pdf 中抓取了一些文本,我已经解析了文本,目前所有内容都作为列表中的字符串。我想将由于 pdf 页面上的中断而作为单独的字符串返回的句子连接在一起。例如,
list = ['I am a ', 'sentence.', 'Please join me toge-', 'ther. Thanks for your help.']
我想要:
list = ['I am a sentence.', 'Please join me together. Thanks for your help.']
我目前有以下代码,它连接了一些句子,但连接到第一个的第二个子句仍然返回。我知道这是由于索引引起的,但不确定如何解决该问题。
new = []
def cleanlist(dictlist):
for i in range(len(dictlist)):
if i>0:
if dictlist[i-1][-1:] != ('.') or dictlist[i-1][-1:] != ('. '):
new.append(dictlist[i-1]+dictlist[i])
elif dictlist[i-1][-1:] == '-':
new.append(dictlist[i-1]+dictlist[i])
else:
new.append[dict_list[i]]
您可以使用生成器方法:
def cleanlist(dictlist):
current = []
for line in dictlist:
if line.endswith("-"):
current.append(line[:-1])
elif line.endswith(" "):
current.append(line)
else:
current.append(line)
yield "".join(current)
current = []
这样使用:
dictlist = ['I am a ', 'sentence.', 'Please join me toge-', 'ther. Thanks for your help.']
print(list(cleanlist(dictlist)))
# ['I am a sentence.', 'Please join me together. Thanks for your help.']