NLP：在句子分割/边界检测中

Question

如果有图书馆根据内容将句子分成小块，我很感兴趣。

例如

input: sentence: "During our stay at the hotel we had a clean room, very nice bathroom, breathtaking view out the window and a delicious breakfast in the morning."

output: list of sentence segments: ["During our stay at the hotel" , "we had a clean room" , "very nice bathroom" , "breathtaking view out the window" , "and a delicious breakfast in the morning."]

所以基本上我正在寻找一个基于意义的句子边界检测/分割。 我的目标是将一个句子分成几部分，这些部分有自己的 'meaning'，没有句子的其余部分。

我对句子边界检测绝对不感兴趣，因为任何人都可以检测其中的一打，但这不适用于句子内分割。

提前致谢

Answer 1

从句子中获取短语的问题在 NLP 文献中通常称为“chunking”。

您似乎想将一个句子分成多个块，使每个单词恰好在一个块中。您可以使用 解析器 来做到这一点，Stanford's 是一种流行的解析器。它的输出称为 "parse tree"，如下所示：

(ROOT
  (S
    (S
      (NP
        (NP (DT The) (JJS strongest) (NN rain))
        (VP
          (ADVP (RB ever))
          (VBN recorded)
          (PP (IN in)
            (NP (NNP India)))))
      (VP
        (VP (VBD shut)
          (PRT (RP down))
          (NP
            (NP (DT the) (JJ financial) (NN hub))
            (PP (IN of)
              (NP (NNP Mumbai)))))
[rest omitted]

这里的大写字母是Penn Treebank tags。 S表示"sentence"、NP"noun phrase"、VP"verb phrase"等。通过从解析树中提取 VP 和 NP 等短语单元，您可以构建您所请求的短语。

这并不完全符合您的要求，但根据您的应用程序，提取关键字短语（如 "social security" 或 "foreign affairs"）可能会有用。这有时称为 关键词提取 。我最近读到的关于该主题的一篇好论文是 Bag of What?, and an implementation is available here。这是他们基于美国政治的语料库的输出示例（标记为 NPSFT）：

像这样拆分句子的技巧有很多，复杂程度和准确性各不相同，最好的方法取决于你在得到这些短语后想对它们做什么。无论如何，希望这对您有所帮助。

NLP：在句子分割/边界检测中

NLP: Within Sentence Segmentation / Boundary Detection

nlp

nltk

sentence

text-segmentation