NLTK PunktSentenceTokenizer 省略号拆分
NLTK PunktSentenceTokenizer ellipsis splitting
我正在使用 NLTK PunktSentenceTokenizer and I'm facing a situation where the a text containing multiple sentences separated by the ellipsis character (...)。这是我正在处理的示例:
>>> from nltk.tokenize import PunktSentenceTokenizer
>>> pst = PunktSentenceTokenizer()
>>> pst.sentences_from_text("Horrible customer service... Cashier was rude... Drive thru took hours... The tables were not clean...")
['Horrible customer service... Cashier was rude... Drive thru took hours... The tables were not clean...']
如你所见,句子没有分开。有没有办法让它像我预期的那样工作(即返回包含四个项目的列表)?
其他信息:我尝试使用 debug_decisions
功能来理解为什么做出这样的决定。我得到以下结果:
>>> g = pst.debug_decisions("Horrible customer service... Cashier was rude... Drive thru took hours... The tables were not clean...")
>>> [x for x in g]
[{'break_decision': None,
'collocation': False,
'period_index': 27,
'reason': 'default decision',
'text': 'service... Cashier',
'type1': '...',
'type1_in_abbrs': False,
'type1_is_initial': False,
'type2': 'cashier',
'type2_is_sent_starter': False,
'type2_ortho_contexts': set(),
'type2_ortho_heuristic': 'unknown'},
{'break_decision': None,
'collocation': False,
'period_index': 47,
'reason': 'default decision',
'text': 'rude... Drive',
'type1': '...',
'type1_in_abbrs': False,
'type1_is_initial': False,
'type2': 'drive',
'type2_is_sent_starter': False,
'type2_ortho_contexts': set(),
'type2_ortho_heuristic': 'unknown'},
{'break_decision': None,
'collocation': False,
'period_index': 72,
'reason': 'default decision',
'text': 'hours... The',
'type1': '...',
'type1_in_abbrs': False,
'type1_is_initial': False,
'type2': 'the',
'type2_is_sent_starter': False,
'type2_ortho_contexts': set(),
'type2_ortho_heuristic': 'unknown'}]
不幸的是,我无法理解这些字典的含义,尽管分词器似乎确实检测到了省略号,但出于某种原因决定不拆分这些符号的句子。任何的想法?
谢谢!
你为什么不直接使用 the split function?
str.split('...')
编辑:我通过使用路透社语料库训练函数来实现它,我想你可以使用你的训练它:
from nltk.tokenize import PunktSentenceTokenizer
from nltk.corpus import reuters
pst = PunktSentenceTokenizer()
pst.train(reuters.raw())
text = "Batts did not take questions or give details of the report's findings... He did say that the city's police department would continue to work on the case under the direction of the prosecutor's office. Gray was injured around the time he was arrested by Baltimore police and put in a police van on 12 April."
print(pst.sentences_from_text(text))
结果:
>>> ["Batts did not take questions or give details of the report's findings...", "He did say that the city's police department would continue to work on the case under the direction of the prosecutor's office.", 'Gray was injured around the time he was arrested by Baltimore police and put in a police van on 12 April.']
我正在使用 NLTK PunktSentenceTokenizer and I'm facing a situation where the a text containing multiple sentences separated by the ellipsis character (...)。这是我正在处理的示例:
>>> from nltk.tokenize import PunktSentenceTokenizer
>>> pst = PunktSentenceTokenizer()
>>> pst.sentences_from_text("Horrible customer service... Cashier was rude... Drive thru took hours... The tables were not clean...")
['Horrible customer service... Cashier was rude... Drive thru took hours... The tables were not clean...']
如你所见,句子没有分开。有没有办法让它像我预期的那样工作(即返回包含四个项目的列表)?
其他信息:我尝试使用 debug_decisions
功能来理解为什么做出这样的决定。我得到以下结果:
>>> g = pst.debug_decisions("Horrible customer service... Cashier was rude... Drive thru took hours... The tables were not clean...")
>>> [x for x in g]
[{'break_decision': None,
'collocation': False,
'period_index': 27,
'reason': 'default decision',
'text': 'service... Cashier',
'type1': '...',
'type1_in_abbrs': False,
'type1_is_initial': False,
'type2': 'cashier',
'type2_is_sent_starter': False,
'type2_ortho_contexts': set(),
'type2_ortho_heuristic': 'unknown'},
{'break_decision': None,
'collocation': False,
'period_index': 47,
'reason': 'default decision',
'text': 'rude... Drive',
'type1': '...',
'type1_in_abbrs': False,
'type1_is_initial': False,
'type2': 'drive',
'type2_is_sent_starter': False,
'type2_ortho_contexts': set(),
'type2_ortho_heuristic': 'unknown'},
{'break_decision': None,
'collocation': False,
'period_index': 72,
'reason': 'default decision',
'text': 'hours... The',
'type1': '...',
'type1_in_abbrs': False,
'type1_is_initial': False,
'type2': 'the',
'type2_is_sent_starter': False,
'type2_ortho_contexts': set(),
'type2_ortho_heuristic': 'unknown'}]
不幸的是,我无法理解这些字典的含义,尽管分词器似乎确实检测到了省略号,但出于某种原因决定不拆分这些符号的句子。任何的想法?
谢谢!
你为什么不直接使用 the split function?
str.split('...')
编辑:我通过使用路透社语料库训练函数来实现它,我想你可以使用你的训练它:
from nltk.tokenize import PunktSentenceTokenizer
from nltk.corpus import reuters
pst = PunktSentenceTokenizer()
pst.train(reuters.raw())
text = "Batts did not take questions or give details of the report's findings... He did say that the city's police department would continue to work on the case under the direction of the prosecutor's office. Gray was injured around the time he was arrested by Baltimore police and put in a police van on 12 April."
print(pst.sentences_from_text(text))
结果:
>>> ["Batts did not take questions or give details of the report's findings...", "He did say that the city's police department would continue to work on the case under the direction of the prosecutor's office.", 'Gray was injured around the time he was arrested by Baltimore police and put in a police van on 12 April.']