使用正则表达式 'but' 对句子进行分块

Question

我正在尝试在单词 'but'（或任何其他并列连词）处使用 RegEx 对句子进行分块。没用...

sentence = nltk.pos_tag(word_tokenize("There are no large collections present but there is spinal canal stenosis."))
result = nltk.RegexpParser(grammar).parse(sentence)
DigDug = nltk.RegexpParser(r'CHUNK: {.*<CC>.*}')
for subtree in DigDug.parse(sentence).subtrees(): 
    if subtree.label() == 'CHUNK': print(subtree.node())

我需要把句子 "There are no large collections present but there is spinal canal stenosis." 分成两部分：

1. "There are no large collections present"
2. "there is spinal canal stenosis."

我也希望使用相同的代码在 'and' 和其他并列连词 (CC) 词处拆分句子。但是我的代码不起作用。请帮忙。

Answer 1

我想你可以简单地做

import re
result = re.split(r"\s+(?:but|and)\s+", sentence)

哪里

`\s`        Match a single character that is a "whitespace character" (spaces, tabs, line breaks, etc.)
`+`         Between one and unlimited times, as many times as possible, giving back as needed (greedy)
`(?:`       Match the regular expression below, do not capture
            Match either the regular expression below (attempting the next alternative only if this one fails)
  `but`     Match the characters "but" literally
  `|`       Or match regular expression number 2 below (the entire group fails if this one fails to match)
  `and`     Match the characters "and" literally
)
`\s`        Match a single character that is a "whitespace character" (spaces, tabs, line breaks, etc.)
`+`         Between one and unlimited times, as many times as possible, giving back as needed (greedy)

您可以在其中添加更多连词，用竖线分隔 |。请注意，这些词不包含在正则表达式中具有特殊含义的字符。如有疑问，请先使用 re.escape(word)

转义它们

Answer 2

如果您想避免像 'but' 和 'and' 这样的硬编码连接词，请尝试将 chinking 与分块一起使用：

import nltk
Digdug = nltk.RegexpParser(r""" 
CHUNK_AND_CHINK:
{<.*>+}          # Chunk everything
}<CC>+{      # Chink sequences of CC
""")
sentence = nltk.pos_tag(nltk.word_tokenize("There are no large collections present but there is spinal canal stenosis."))

result = Digdug.parse(sentence)

for subtree in result.subtrees(filter=lambda t: t.label() == 
'CHUNK_AND_CHINK'):
            print (subtree)

Chinking 基本上从块短语中排除了我们不需要的内容 - 'but' 在这种情况下。更多详情，请参阅：http://www.nltk.org/book/ch07.html

使用正则表达式 'but' 对句子进行分块

Chunking sentences using the word 'but' with RegEx

python

regex

chunking

nltk