NLTK PunktSentenceTokenizer 省略号拆分

NLTK PunktSentenceTokenizer ellipsis splitting

我正在使用 NLTK PunktSentenceTokenizer and I'm facing a situation where the a text containing multiple sentences separated by the ellipsis character (...)。这是我正在处理的示例:

>>> from nltk.tokenize import PunktSentenceTokenizer
>>> pst = PunktSentenceTokenizer()
>>> pst.sentences_from_text("Horrible customer service... Cashier was rude... Drive thru took hours... The tables were not clean...")
['Horrible customer service... Cashier was rude... Drive thru took hours... The tables were not clean...']


其他信息:我尝试使用 debug_decisions 功能来理解为什么做出这样的决定。我得到以下结果:

>>> g = pst.debug_decisions("Horrible customer service... Cashier was rude... Drive thru took hours... The tables were not clean...")

>>> [x for x in g]
[{'break_decision': None,
  'collocation': False,
  'period_index': 27,
  'reason': 'default decision',
  'text': 'service... Cashier',
  'type1': '...',
  'type1_in_abbrs': False,
  'type1_is_initial': False,
  'type2': 'cashier',
  'type2_is_sent_starter': False,
  'type2_ortho_contexts': set(),
  'type2_ortho_heuristic': 'unknown'},
 {'break_decision': None,
  'collocation': False,
  'period_index': 47,
  'reason': 'default decision',
  'text': 'rude... Drive',
  'type1': '...',
  'type1_in_abbrs': False,
  'type1_is_initial': False,
  'type2': 'drive',
  'type2_is_sent_starter': False,
  'type2_ortho_contexts': set(),
  'type2_ortho_heuristic': 'unknown'},
 {'break_decision': None,
  'collocation': False,
  'period_index': 72,
  'reason': 'default decision',
  'text': 'hours... The',
  'type1': '...',
  'type1_in_abbrs': False,
  'type1_is_initial': False,
  'type2': 'the',
  'type2_is_sent_starter': False,
  'type2_ortho_contexts': set(),
  'type2_ortho_heuristic': 'unknown'}]



你为什么不直接使用 the split function? str.split('...')


from nltk.tokenize import PunktSentenceTokenizer
from nltk.corpus import reuters
pst = PunktSentenceTokenizer()
text = "Batts did not take questions or give details of the report's findings... He did say that the city's police department would continue to work on the case under the direction of the prosecutor's office. Gray was injured around the time he was arrested by Baltimore police and put in a police van on 12 April."


>>> ["Batts did not take questions or give details of the report's findings...", "He did say that the city's police department would continue to work on the case under the direction of the prosecutor's office.", 'Gray was injured around the time he was arrested by Baltimore police and put in a police van on 12 April.']