使用正则表达式 'but' 对句子进行分块
Chunking sentences using the word 'but' with RegEx
我正在尝试在单词 'but'(或任何其他并列连词)处使用 RegEx 对句子进行分块。没用...
sentence = nltk.pos_tag(word_tokenize("There are no large collections present but there is spinal canal stenosis."))
result = nltk.RegexpParser(grammar).parse(sentence)
DigDug = nltk.RegexpParser(r'CHUNK: {.*<CC>.*}')
for subtree in DigDug.parse(sentence).subtrees():
if subtree.label() == 'CHUNK': print(subtree.node())
我需要把句子 "There are no large collections present but there is spinal canal stenosis."
分成两部分:
1. "There are no large collections present"
2. "there is spinal canal stenosis."
我也希望使用相同的代码在 'and' 和其他并列连词 (CC) 词处拆分句子。但是我的代码不起作用。请帮忙。
我想你可以简单地做
import re
result = re.split(r"\s+(?:but|and)\s+", sentence)
哪里
`\s` Match a single character that is a "whitespace character" (spaces, tabs, line breaks, etc.)
`+` Between one and unlimited times, as many times as possible, giving back as needed (greedy)
`(?:` Match the regular expression below, do not capture
Match either the regular expression below (attempting the next alternative only if this one fails)
`but` Match the characters "but" literally
`|` Or match regular expression number 2 below (the entire group fails if this one fails to match)
`and` Match the characters "and" literally
)
`\s` Match a single character that is a "whitespace character" (spaces, tabs, line breaks, etc.)
`+` Between one and unlimited times, as many times as possible, giving back as needed (greedy)
您可以在其中添加更多连词,用竖线分隔 |
。
请注意,这些词不包含在正则表达式中具有特殊含义的字符。如有疑问,请先使用 re.escape(word)
转义它们
如果您想避免像 'but' 和 'and' 这样的硬编码连接词,请尝试将 chinking 与分块一起使用:
import nltk
Digdug = nltk.RegexpParser(r"""
CHUNK_AND_CHINK:
{<.*>+} # Chunk everything
}<CC>+{ # Chink sequences of CC
""")
sentence = nltk.pos_tag(nltk.word_tokenize("There are no large collections present but there is spinal canal stenosis."))
result = Digdug.parse(sentence)
for subtree in result.subtrees(filter=lambda t: t.label() ==
'CHUNK_AND_CHINK'):
print (subtree)
Chinking 基本上从块短语中排除了我们不需要的内容 - 'but' 在这种情况下。
更多详情,请参阅:http://www.nltk.org/book/ch07.html
我正在尝试在单词 'but'(或任何其他并列连词)处使用 RegEx 对句子进行分块。没用...
sentence = nltk.pos_tag(word_tokenize("There are no large collections present but there is spinal canal stenosis."))
result = nltk.RegexpParser(grammar).parse(sentence)
DigDug = nltk.RegexpParser(r'CHUNK: {.*<CC>.*}')
for subtree in DigDug.parse(sentence).subtrees():
if subtree.label() == 'CHUNK': print(subtree.node())
我需要把句子 "There are no large collections present but there is spinal canal stenosis."
分成两部分:
1. "There are no large collections present"
2. "there is spinal canal stenosis."
我也希望使用相同的代码在 'and' 和其他并列连词 (CC) 词处拆分句子。但是我的代码不起作用。请帮忙。
我想你可以简单地做
import re
result = re.split(r"\s+(?:but|and)\s+", sentence)
哪里
`\s` Match a single character that is a "whitespace character" (spaces, tabs, line breaks, etc.) `+` Between one and unlimited times, as many times as possible, giving back as needed (greedy) `(?:` Match the regular expression below, do not capture Match either the regular expression below (attempting the next alternative only if this one fails) `but` Match the characters "but" literally `|` Or match regular expression number 2 below (the entire group fails if this one fails to match) `and` Match the characters "and" literally ) `\s` Match a single character that is a "whitespace character" (spaces, tabs, line breaks, etc.) `+` Between one and unlimited times, as many times as possible, giving back as needed (greedy)
您可以在其中添加更多连词,用竖线分隔 |
。
请注意,这些词不包含在正则表达式中具有特殊含义的字符。如有疑问,请先使用 re.escape(word)
如果您想避免像 'but' 和 'and' 这样的硬编码连接词,请尝试将 chinking 与分块一起使用:
import nltk
Digdug = nltk.RegexpParser(r"""
CHUNK_AND_CHINK:
{<.*>+} # Chunk everything
}<CC>+{ # Chink sequences of CC
""")
sentence = nltk.pos_tag(nltk.word_tokenize("There are no large collections present but there is spinal canal stenosis."))
result = Digdug.parse(sentence)
for subtree in result.subtrees(filter=lambda t: t.label() ==
'CHUNK_AND_CHINK'):
print (subtree)
Chinking 基本上从块短语中排除了我们不需要的内容 - 'but' 在这种情况下。 更多详情,请参阅:http://www.nltk.org/book/ch07.html