匹配 POS 标签和单词序列
match POS tag and sequence of words
我有以下两个字符串及其 POS 标签:
Sent1:“像 writer pro 或 phramology 那样的东西真的很酷。”
[('something', 'NN'), ('like', 'IN'), ('how', 'WRB'), ('writer',
'NN'), ('pro', 'NN'), ('or', 'CC'), ('phraseology', 'NN'), ('works',
'NNS'), ('would', 'MD'), ('be', 'VB'), ('really', 'RB'), ('cool',
'JJ'), ('.', '.')]
Sent2:“语法编辑器等更多选项会更好”
[('more', 'JJR'), ('options', 'NNS'), ('like', 'IN'), ('the', 'DT'),
('syntax', 'NN'), ('editor', 'NN'), ('would', 'MD'), ('be', 'VB'),
('nice', 'JJ')]
我正在寻找一种方法来检测 (return True) 是否存在以下序列:"would" + be" + 形容词(无论形容词的位置如何,只要其在这些字符串中 "would" "be") 之后。在第二个字符串中,形容词 "nice" 紧跟在 "would be" 之后,但在第一个字符串中并非如此。
琐碎的情况(形容词前没有其他词;"would be nice")在我之前的一个问题中得到解决:
我现在正在寻找一个更通用的解决方案,其中可选词可以出现在形容词之前。我是 NLTK 和 Python.
的新手
您似乎只需搜索连续标签 "would",然后搜索 "be",然后搜索标签 "JJ" 的第一个实例。像这样:
import nltk
def has_would_be_adj(S):
# make pos tags
pos = nltk.pos_tag(S.split())
# Search consecutive tags for "would", "be"
j = None # index of found "would"
for i, (x, y) in enumerate(zip(pos[:-1], pos[1:])):
if x[0] == "would" and y[0] == "be":
j = i
break
if j is None or len(pos) < j + 2:
return False
a = None # index of found adjective
for i, (word, tag) in enumerate(pos[j + 2:]):
if tag == "JJ":
a = i+j+2 #
break
if a is None:
return False
print("Found adjective {} at {}", pos[a], a)
return True
S = "something like how writer pro or phraseology works would be really cool."
print(has_would_be_adj(S))
我敢肯定这可以写得更紧凑、更简洁,但它确实符合包装盒上的说明:)
勾选
from nltk.tokenize import word_tokenize
def would_be(tagged):
return any(['would', 'be', 'JJ'] == [tagged[i][0], tagged[i+1][0], tagged[i+2][1]] for i in xrange(len(tagged) - 2))
S = "more options like the syntax editor would be nice."
pos = nltk.pos_tag(word_tokenize(S))
would_be(pos)
同时检查代码
from nltk.tokenize import word_tokenize
import nltk
def checkTag(S):
pos = nltk.pos_tag(word_tokenize(S))
flag = 0
for tag in pos:
if tag[1] == 'JJ':
flag = 1
if flag:
for ind,tag in enumerate(pos):
if tag[0] == 'would' and pos[ind+1][0] == 'be':
return True
return False
return False
S = "something like how writer pro or phraseology works would be really cool."
print checkTag(S)
from itertools import tee,izip,dropwhile
import nltk
def check_sentence(S):
def pairwise(iterable):
"s -> (s0,s1), (s1,s2), (s2, s3), ..."
a, b = tee(iterable)
next(b, None)
return izip(a, b)
def consecutive_would_be(word_group):
first, second = word_group
(would_word, _) = first
(be_word, _) = second
return would_word.lower() != "would" && be_word.lower() != "be"
for word_groups in dropwhile(consecutive_would_be, pairwise(nltk.pos_tag(nltk.word_tokenize(S))):
first, second = word_groups
(_, pos1) = first
(_, pos2) = second
if pos1 == "JJ" || pos2 == "JJ":
return True
return False
然后你可以像这样使用函数:
S = "more options like the syntax editor would be nice."
check_sentence(S)
首先按照说明安装nltk_cli
:https://github.com/alvations/nltk_cli
那么,这里是nltk_cli
中的一个秘密函数,也许你会发现它有用:
alvas@ubi:~/git/nltk_cli$ cat infile.txt
something like how writer pro or phraseology works would be really cool .
more options like the syntax editor would be nice
alvas@ubi:~/git/nltk_cli$ python senna.py --chunk2 VP+ADJP infile.txt
would be really cool
would be nice
为了说明其他可能的用法:
alvas@ubi:~/git/nltk_cli$ python senna.py --chunk2 VP+VP infile.txt
!!! NO CHUNK of VP+VP in this sentence !!!
!!! NO CHUNK of VP+VP in this sentence !!!
alvas@ubi:~/git/nltk_cli$ python senna.py --chunk2 NP+VP infile.txt
how writer pro or phraseology works would be
the syntax editor would be
alvas@ubi:~/git/nltk_cli$ python senna.py --chunk2 VP+NP infile.txt
!!! NO CHUNK of VP+NP in this sentence !!!
!!! NO CHUNK of VP+NP in this sentence !!!
然后如果你想检查句子中的短语和输出 True/False,只需读取并遍历 nltk_cli
的输出并检查 if-else
条件。
这会有帮助吗?
s1=[('something', 'NN'), ('like', 'IN'), ('how', 'WRB'), ('writer', 'NN'), ('pro', 'NN'), ('or', 'CC'), ('phraseology', 'NN'), ('works', 'NNS'), ('would', 'MD'), ('be', 'VB'), ('really', 'RB'), ('cool', 'JJ'), ('.', '.')]
flag = True
for i,j in zip(s1[:-1],s1[1:]):
if i[0]+" "+j[0] == "would be":
flag = True
if flag and (i[-1] == "JJ" or j[-1] == "JJ"):
print "would be adjective found in the tagged string"
我有以下两个字符串及其 POS 标签:
Sent1:“像 writer pro 或 phramology 那样的东西真的很酷。”
[('something', 'NN'), ('like', 'IN'), ('how', 'WRB'), ('writer', 'NN'), ('pro', 'NN'), ('or', 'CC'), ('phraseology', 'NN'), ('works', 'NNS'), ('would', 'MD'), ('be', 'VB'), ('really', 'RB'), ('cool', 'JJ'), ('.', '.')]
Sent2:“语法编辑器等更多选项会更好”
[('more', 'JJR'), ('options', 'NNS'), ('like', 'IN'), ('the', 'DT'), ('syntax', 'NN'), ('editor', 'NN'), ('would', 'MD'), ('be', 'VB'), ('nice', 'JJ')]
我正在寻找一种方法来检测 (return True) 是否存在以下序列:"would" + be" + 形容词(无论形容词的位置如何,只要其在这些字符串中 "would" "be") 之后。在第二个字符串中,形容词 "nice" 紧跟在 "would be" 之后,但在第一个字符串中并非如此。
琐碎的情况(形容词前没有其他词;"would be nice")在我之前的一个问题中得到解决:
我现在正在寻找一个更通用的解决方案,其中可选词可以出现在形容词之前。我是 NLTK 和 Python.
的新手您似乎只需搜索连续标签 "would",然后搜索 "be",然后搜索标签 "JJ" 的第一个实例。像这样:
import nltk
def has_would_be_adj(S):
# make pos tags
pos = nltk.pos_tag(S.split())
# Search consecutive tags for "would", "be"
j = None # index of found "would"
for i, (x, y) in enumerate(zip(pos[:-1], pos[1:])):
if x[0] == "would" and y[0] == "be":
j = i
break
if j is None or len(pos) < j + 2:
return False
a = None # index of found adjective
for i, (word, tag) in enumerate(pos[j + 2:]):
if tag == "JJ":
a = i+j+2 #
break
if a is None:
return False
print("Found adjective {} at {}", pos[a], a)
return True
S = "something like how writer pro or phraseology works would be really cool."
print(has_would_be_adj(S))
我敢肯定这可以写得更紧凑、更简洁,但它确实符合包装盒上的说明:)
勾选
from nltk.tokenize import word_tokenize def would_be(tagged): return any(['would', 'be', 'JJ'] == [tagged[i][0], tagged[i+1][0], tagged[i+2][1]] for i in xrange(len(tagged) - 2)) S = "more options like the syntax editor would be nice." pos = nltk.pos_tag(word_tokenize(S)) would_be(pos)
同时检查代码
from nltk.tokenize import word_tokenize
import nltk
def checkTag(S):
pos = nltk.pos_tag(word_tokenize(S))
flag = 0
for tag in pos:
if tag[1] == 'JJ':
flag = 1
if flag:
for ind,tag in enumerate(pos):
if tag[0] == 'would' and pos[ind+1][0] == 'be':
return True
return False
return False
S = "something like how writer pro or phraseology works would be really cool."
print checkTag(S)
from itertools import tee,izip,dropwhile
import nltk
def check_sentence(S):
def pairwise(iterable):
"s -> (s0,s1), (s1,s2), (s2, s3), ..."
a, b = tee(iterable)
next(b, None)
return izip(a, b)
def consecutive_would_be(word_group):
first, second = word_group
(would_word, _) = first
(be_word, _) = second
return would_word.lower() != "would" && be_word.lower() != "be"
for word_groups in dropwhile(consecutive_would_be, pairwise(nltk.pos_tag(nltk.word_tokenize(S))):
first, second = word_groups
(_, pos1) = first
(_, pos2) = second
if pos1 == "JJ" || pos2 == "JJ":
return True
return False
然后你可以像这样使用函数:
S = "more options like the syntax editor would be nice."
check_sentence(S)
首先按照说明安装nltk_cli
:https://github.com/alvations/nltk_cli
那么,这里是nltk_cli
中的一个秘密函数,也许你会发现它有用:
alvas@ubi:~/git/nltk_cli$ cat infile.txt
something like how writer pro or phraseology works would be really cool .
more options like the syntax editor would be nice
alvas@ubi:~/git/nltk_cli$ python senna.py --chunk2 VP+ADJP infile.txt
would be really cool
would be nice
为了说明其他可能的用法:
alvas@ubi:~/git/nltk_cli$ python senna.py --chunk2 VP+VP infile.txt
!!! NO CHUNK of VP+VP in this sentence !!!
!!! NO CHUNK of VP+VP in this sentence !!!
alvas@ubi:~/git/nltk_cli$ python senna.py --chunk2 NP+VP infile.txt
how writer pro or phraseology works would be
the syntax editor would be
alvas@ubi:~/git/nltk_cli$ python senna.py --chunk2 VP+NP infile.txt
!!! NO CHUNK of VP+NP in this sentence !!!
!!! NO CHUNK of VP+NP in this sentence !!!
然后如果你想检查句子中的短语和输出 True/False,只需读取并遍历 nltk_cli
的输出并检查 if-else
条件。
这会有帮助吗?
s1=[('something', 'NN'), ('like', 'IN'), ('how', 'WRB'), ('writer', 'NN'), ('pro', 'NN'), ('or', 'CC'), ('phraseology', 'NN'), ('works', 'NNS'), ('would', 'MD'), ('be', 'VB'), ('really', 'RB'), ('cool', 'JJ'), ('.', '.')]
flag = True
for i,j in zip(s1[:-1],s1[1:]):
if i[0]+" "+j[0] == "would be":
flag = True
if flag and (i[-1] == "JJ" or j[-1] == "JJ"):
print "would be adjective found in the tagged string"