如何从 POS 标记词列表中提取模式? NLTK
How do I extract patterns from lists of POS tagged words? NLTK
我有一个包含多个列表的文本文件;每个列表包含 word/pos-tag 对的元组,如下所示:
[('reviewtext', 'IN'), ('this', 'DT'), ('movie', 'NN'), ('was', 'VBD'), ('great', 'JJ'), ('and', 'CC'), ('fun', 'NN'), ('i', 'PRP'), ('really', 'RB'), ('enjoyed', 'VBD'), ('this', 'DT'), ('awesome', 'NN'), ('movie', 'NN')]
[('reviewtext', 'IN'), ('it', 'PRP'), ('was', 'VBD'), ('fun', 'VBN'), ('but', 'CC'), ('long', 'RB')]
[('reviewtext', 'IN'), ('i', 'PRP'), ('loved', 'VBD'), ('the', 'DT'), ('new', 'JJ'), ('movie', 'NN'), ('my', 'PRP$'), ('brother', 'NN'), ('got', 'VBD'), ('sad', 'JJ'), ('and', 'CC'), ('unhappy', 'JJ'), ('at', 'IN'), ('the', 'DT'), ('end', 'NN')]
我需要提取所有形容词-连词-形容词对,或所有 JJ-CC-JJ 对(仅单词,而不是 pos 标签)。对于这个例子,最终输出应该是一个包含所有模式的列表:
['great and fun', 'sad and unhappy']
我使用了以下代码来标记文本:
with open("C:\Users\M\Desktop\sample dataset.txt") as fileobject:
for line in fileobject:
line = line.lower() #lowercase
line = re.sub(r'[^\w\s]','',line) #remove punctuation
line = nltk.word_tokenize(line) #tokenize
line = nltk.pos_tag(line) #POS tag
fo = open("C:\Users\M\Desktop\movies1_complete.txt", "a")
fo.write(str(line))
fo.write("\n")
fo.close()
但是如何提取上述模式中的单词呢?我检查了 here and here,但他们没有解释如何提取特定的 pos 模式。提前致谢。
from itertools import islice
for sub in l:
for a, b, c in zip(islice(sub, 0, None), islice(sub, 1, None), islice(sub, 2, None)):
if all((a[-1] == "JJ", b[-1] == "CC", c[-1] == "JJ")):
print("{} {} {}".format(a[0], b[0], c[0]))
输出 sad and unhappy
,它不包含 'great and fun'
,因为它不匹配模式 JJ-CC-JJ
。
或者只使用枚举和生成器:
l = [[('reviewtext', 'IN'), ('this', 'DT'), ('movie', 'NN'), ('was', 'VBD'), ('great', 'JJ'), ('and', 'CC'),
('fun', 'NN'), ('i', 'PRP'), ('really', 'RB'), ('enjoyed', 'VBD'), ('this', 'DT'), ('awesome', 'NN'),
('movie', 'NN')],
[('reviewtext', 'IN'), ('it', 'PRP'), ('was', 'VBD'), ('fun', 'VBN'), ('but', 'CC'), ('long', 'RB')],
[('reviewtext', 'IN'), ('i', 'PRP'), ('loved', 'VBD'), ('the', 'DT'), ('new', 'JJ'), ('movie', 'NN'), ('my', 'PRP$'), ('brother', 'NN'), ('got', 'VBD'), ('sad', 'JJ'), ('and', 'CC'), ('unhappy', 'JJ'), ('at', 'IN'), ('the', 'DT'), ('end', 'NN')]]
def match(l,p1,p2,p3):
for sub in l:
# avoid index error and catch last three elements
end = len(sub) - 1
for ind, (a, b) in enumerate(sub, 1):
if ind == end:
break
if b == p1 and sub[ind][1] == p2 and sub[ind + 1][1] == p3:
yield ("{} {} {}".format(a, sub[ind][0], sub[ind + 1][0]))
print(list(match(l,"JJ","CC","JJ")))
输出(基于示例):
['sad and unhappy']
尽管答案已被接受(答案很好),但我认为您会发现这很有用。您可以使用以下 library 检查对象流中的正则表达式。
from refo import finditer, Predicate, Plus
class Word(object):
def __init__(self, token, pos):
self.token = token
self.pos = pos
class W(Predicate):
def __init__(self, token=".*", pos=".*"):
self.token = re.compile(token + "$")
self.pos = re.compile(pos + "$")
super(W, self).__init__(self.match)
def match(self, word):
m1 = self.token.match(word.token)
m2 = self.pos.match(word.pos)
return m1 and m2
originals = [
[('reviewtext', 'IN'), ('this', 'DT'), ('movie', 'NN'), ('was', 'VBD'),
('great', 'JJ'), ('and', 'CC'), ('fun', 'NN'), ('i', 'PRP'),
('really', 'RB'), ('enjoyed', 'VBD'), ('this', 'DT'),
('awesome', 'NN'), ('movie', 'NN')],
[('reviewtext', 'IN'), ('it', 'PRP'),
('was', 'VBD'), ('fun', 'VBN'), ('but', 'CC'), ('long', 'RB')],
[('reviewtext', 'IN'), ('i', 'PRP'), ('loved', 'VBD'), ('the', 'DT'),
('new', 'JJ'), ('movie', 'NN'), ('my', 'PRP$'), ('brother', 'NN'),
('got', 'VBD'), ('sad', 'JJ'), ('and', 'CC'), ('unhappy', 'JJ'),
('at', 'IN'), ('the', 'DT'), ('end', 'NN')]]
sentences = [[Word(*x) for x in original] for original in originals]
这是一个有趣的位,它表示查找对象序列,其中 pos
属性是 JJ
,然后是 CC
,然后是 JJ
或 NN
.
pred = W(pos="JJ") + W(pos="CC") + (W(pos="JJ") | W(pos="NN"))
for k, s in enumerate(sentences):
for match in finditer(pred, s):
x, y = match.span() # the match spans x to y inside the sentence s
print originals[k][x:y]
输出:
[('great', 'JJ'), ('and', 'CC'), ('fun', 'NN')]
[('sad', 'JJ'), ('and', 'CC'), ('unhappy', 'JJ')]
我有一个包含多个列表的文本文件;每个列表包含 word/pos-tag 对的元组,如下所示:
[('reviewtext', 'IN'), ('this', 'DT'), ('movie', 'NN'), ('was', 'VBD'), ('great', 'JJ'), ('and', 'CC'), ('fun', 'NN'), ('i', 'PRP'), ('really', 'RB'), ('enjoyed', 'VBD'), ('this', 'DT'), ('awesome', 'NN'), ('movie', 'NN')]
[('reviewtext', 'IN'), ('it', 'PRP'), ('was', 'VBD'), ('fun', 'VBN'), ('but', 'CC'), ('long', 'RB')]
[('reviewtext', 'IN'), ('i', 'PRP'), ('loved', 'VBD'), ('the', 'DT'), ('new', 'JJ'), ('movie', 'NN'), ('my', 'PRP$'), ('brother', 'NN'), ('got', 'VBD'), ('sad', 'JJ'), ('and', 'CC'), ('unhappy', 'JJ'), ('at', 'IN'), ('the', 'DT'), ('end', 'NN')]
我需要提取所有形容词-连词-形容词对,或所有 JJ-CC-JJ 对(仅单词,而不是 pos 标签)。对于这个例子,最终输出应该是一个包含所有模式的列表:
['great and fun', 'sad and unhappy']
我使用了以下代码来标记文本:
with open("C:\Users\M\Desktop\sample dataset.txt") as fileobject:
for line in fileobject:
line = line.lower() #lowercase
line = re.sub(r'[^\w\s]','',line) #remove punctuation
line = nltk.word_tokenize(line) #tokenize
line = nltk.pos_tag(line) #POS tag
fo = open("C:\Users\M\Desktop\movies1_complete.txt", "a")
fo.write(str(line))
fo.write("\n")
fo.close()
但是如何提取上述模式中的单词呢?我检查了 here and here,但他们没有解释如何提取特定的 pos 模式。提前致谢。
from itertools import islice
for sub in l:
for a, b, c in zip(islice(sub, 0, None), islice(sub, 1, None), islice(sub, 2, None)):
if all((a[-1] == "JJ", b[-1] == "CC", c[-1] == "JJ")):
print("{} {} {}".format(a[0], b[0], c[0]))
输出 sad and unhappy
,它不包含 'great and fun'
,因为它不匹配模式 JJ-CC-JJ
。
或者只使用枚举和生成器:
l = [[('reviewtext', 'IN'), ('this', 'DT'), ('movie', 'NN'), ('was', 'VBD'), ('great', 'JJ'), ('and', 'CC'),
('fun', 'NN'), ('i', 'PRP'), ('really', 'RB'), ('enjoyed', 'VBD'), ('this', 'DT'), ('awesome', 'NN'),
('movie', 'NN')],
[('reviewtext', 'IN'), ('it', 'PRP'), ('was', 'VBD'), ('fun', 'VBN'), ('but', 'CC'), ('long', 'RB')],
[('reviewtext', 'IN'), ('i', 'PRP'), ('loved', 'VBD'), ('the', 'DT'), ('new', 'JJ'), ('movie', 'NN'), ('my', 'PRP$'), ('brother', 'NN'), ('got', 'VBD'), ('sad', 'JJ'), ('and', 'CC'), ('unhappy', 'JJ'), ('at', 'IN'), ('the', 'DT'), ('end', 'NN')]]
def match(l,p1,p2,p3):
for sub in l:
# avoid index error and catch last three elements
end = len(sub) - 1
for ind, (a, b) in enumerate(sub, 1):
if ind == end:
break
if b == p1 and sub[ind][1] == p2 and sub[ind + 1][1] == p3:
yield ("{} {} {}".format(a, sub[ind][0], sub[ind + 1][0]))
print(list(match(l,"JJ","CC","JJ")))
输出(基于示例):
['sad and unhappy']
尽管答案已被接受(答案很好),但我认为您会发现这很有用。您可以使用以下 library 检查对象流中的正则表达式。
from refo import finditer, Predicate, Plus
class Word(object):
def __init__(self, token, pos):
self.token = token
self.pos = pos
class W(Predicate):
def __init__(self, token=".*", pos=".*"):
self.token = re.compile(token + "$")
self.pos = re.compile(pos + "$")
super(W, self).__init__(self.match)
def match(self, word):
m1 = self.token.match(word.token)
m2 = self.pos.match(word.pos)
return m1 and m2
originals = [
[('reviewtext', 'IN'), ('this', 'DT'), ('movie', 'NN'), ('was', 'VBD'),
('great', 'JJ'), ('and', 'CC'), ('fun', 'NN'), ('i', 'PRP'),
('really', 'RB'), ('enjoyed', 'VBD'), ('this', 'DT'),
('awesome', 'NN'), ('movie', 'NN')],
[('reviewtext', 'IN'), ('it', 'PRP'),
('was', 'VBD'), ('fun', 'VBN'), ('but', 'CC'), ('long', 'RB')],
[('reviewtext', 'IN'), ('i', 'PRP'), ('loved', 'VBD'), ('the', 'DT'),
('new', 'JJ'), ('movie', 'NN'), ('my', 'PRP$'), ('brother', 'NN'),
('got', 'VBD'), ('sad', 'JJ'), ('and', 'CC'), ('unhappy', 'JJ'),
('at', 'IN'), ('the', 'DT'), ('end', 'NN')]]
sentences = [[Word(*x) for x in original] for original in originals]
这是一个有趣的位,它表示查找对象序列,其中 pos
属性是 JJ
,然后是 CC
,然后是 JJ
或 NN
.
pred = W(pos="JJ") + W(pos="CC") + (W(pos="JJ") | W(pos="NN"))
for k, s in enumerate(sentences):
for match in finditer(pred, s):
x, y = match.span() # the match spans x to y inside the sentence s
print originals[k][x:y]
输出:
[('great', 'JJ'), ('and', 'CC'), ('fun', 'NN')]
[('sad', 'JJ'), ('and', 'CC'), ('unhappy', 'JJ')]