如何在每个指定的 character/string 处拆分一个句子？

Question

我已经分块了一些基本的名词短语，但是，只有基本的名词短语对我来说是不够的。我想做更多的事情，就是在每个分块名词短语的末尾拆分句子。

例如：

sentence = 'protection of system resources against bad behavior'

分块的名词短语是（通过在 spaCy 中使用 doc.noun_chunks）：

protection, system resources, bad behavior

我想要的结果：

protection, of system resources, against bad behavior

这意味着，我需要在每个分块短语的末尾拆分句子，例如，在“保护”的末尾，在“系统资源”的末尾。

--split()可以这样工作吗？

--或者我可以继续使用 spaCy 中基于规则的匹配来查找.head 或直接 left/right 单词并匹配它们？

有没有人有过这种经历？

谢谢！

Answer 1

--Can the split() work in this way?

没有

--Or maybe I can continue to use the rule-based match in spaCy to find .head or immediate left/right words and matched them?

根据其文档，noun_chunks returns Span 的迭代器。跨度有开始/结束索引，因此您可以使用该信息来拆分源字符串，例如

output = []
prev_end = 0
for span in doc.noun_chunks:
    output.append(sentence[prev_end:span.end_char + 1])
    prev_end = span.end_char + 1

或类似的东西（您可能需要调整代码，因为我从未真正使用过 spaCy，我只是根据我对文档的理解）

How to split a sentence at each specified character/string?