从列表列表 pos_tag 序列中仅提取名词?
Extracting only nouns from list of lists pos_tag sequence?
我正在尝试使用 nltk.pos_tag()
从 list of lists text sequence
中仅提取 名词 。我能够从 nltk.pos_tag()
列表中提取所有名词,而不保留列表序列?如何通过保留列表序列列表来实现这一点。非常感谢任何帮助。
这里list of lists文本序列集合的意思是:由列表分隔的标记化词的集合。
[[('icosmos', 'JJ'), ('cosmology', 'NN'), ('calculator', 'NN'), ('with', 'IN'), ('graph', 'JJ')], [('generation', 'NN'), ('the', 'DT'), ('expanding', 'VBG'), ('universe', 'JJ')], [('american', 'JJ'), ('institute', 'NN')]]
输出应如下所示:
[['cosmology', 'calculator'], ['generation'], [institute]]
我试过的如下:
def function1():
tokens_sentences = sent_tokenize(tokenized_raw_data.lower())
unfiltered_tokens = [[word for word in word_tokenize(word)] for word in tokens_sentences]
word_list = []
for i in range(len(unfiltered_tokens)):
word_list.append([])
for i in range(len(unfiltered_tokens)):
for word in unfiltered_tokens[i]:
if word[:].isalpha():
word_list[i].append(word[:])
tagged_tokens=[]
for token in word_list:
tagged_tokens.append(nltk.pos_tag(token))
noun_tagged = [(word,tag) for word, tag in tagged_tokens
if tag.startswith('NN') or tag.startswith('NNPS')]
print(nouns_tagged)
如果我在附加 tagged_tokens 列表后在原始代码中使用了下面提到的 code-shippet,则输出显示在单个列表中,这不是必需的。
only_tagged_nouns = []
for sentence in tagged_tokens:
for word, pos in sentence:
if (pos == 'NN' or pos == 'NNPS'):
only_tagged_nouns.append(word)
你可以这样做:
words = [[('icosmos', 'JJ'), ('cosmology', 'NN'), ('calculator', 'NN'), ('with', 'IN'), ('graph', 'JJ')], [('generation', 'NN'), ('the', 'DT'), ('expanding', 'VBG'), ('universe', 'JJ')], [('american', 'JJ'), ('institute', 'NN')]]
new_list = []
for i in words:
temp = [j[0] for j in i if j[1].startswith("NN")]
new_list.append(temp)
print(new_list)
输出
[['cosmology', 'calculator'], ['generation'], ['institute']]
对一行解决方案使用列表理解:
inputList = [[('icosmos', 'JJ'), ('cosmology', 'NN'), ('calculator', 'NN'), ('with', 'IN'), ('graph', 'JJ')], [('generation', 'NN'), ('the', 'DT'), ('expanding', 'VBG'), ('universe', 'JJ')], [('american', 'JJ'), ('institute', 'NN')]]
[[k[0] for k in j if k[1].startswith("NN")] for j in inputList]
#[['cosmology', 'calculator'], ['generation'], ['institute']]
我正在尝试使用 nltk.pos_tag()
从 list of lists text sequence
中仅提取 名词 。我能够从 nltk.pos_tag()
列表中提取所有名词,而不保留列表序列?如何通过保留列表序列列表来实现这一点。非常感谢任何帮助。
这里list of lists文本序列集合的意思是:由列表分隔的标记化词的集合。
[[('icosmos', 'JJ'), ('cosmology', 'NN'), ('calculator', 'NN'), ('with', 'IN'), ('graph', 'JJ')], [('generation', 'NN'), ('the', 'DT'), ('expanding', 'VBG'), ('universe', 'JJ')], [('american', 'JJ'), ('institute', 'NN')]]
输出应如下所示:
[['cosmology', 'calculator'], ['generation'], [institute]]
我试过的如下:
def function1():
tokens_sentences = sent_tokenize(tokenized_raw_data.lower())
unfiltered_tokens = [[word for word in word_tokenize(word)] for word in tokens_sentences]
word_list = []
for i in range(len(unfiltered_tokens)):
word_list.append([])
for i in range(len(unfiltered_tokens)):
for word in unfiltered_tokens[i]:
if word[:].isalpha():
word_list[i].append(word[:])
tagged_tokens=[]
for token in word_list:
tagged_tokens.append(nltk.pos_tag(token))
noun_tagged = [(word,tag) for word, tag in tagged_tokens
if tag.startswith('NN') or tag.startswith('NNPS')]
print(nouns_tagged)
如果我在附加 tagged_tokens 列表后在原始代码中使用了下面提到的 code-shippet,则输出显示在单个列表中,这不是必需的。
only_tagged_nouns = []
for sentence in tagged_tokens:
for word, pos in sentence:
if (pos == 'NN' or pos == 'NNPS'):
only_tagged_nouns.append(word)
你可以这样做:
words = [[('icosmos', 'JJ'), ('cosmology', 'NN'), ('calculator', 'NN'), ('with', 'IN'), ('graph', 'JJ')], [('generation', 'NN'), ('the', 'DT'), ('expanding', 'VBG'), ('universe', 'JJ')], [('american', 'JJ'), ('institute', 'NN')]]
new_list = []
for i in words:
temp = [j[0] for j in i if j[1].startswith("NN")]
new_list.append(temp)
print(new_list)
输出
[['cosmology', 'calculator'], ['generation'], ['institute']]
对一行解决方案使用列表理解:
inputList = [[('icosmos', 'JJ'), ('cosmology', 'NN'), ('calculator', 'NN'), ('with', 'IN'), ('graph', 'JJ')], [('generation', 'NN'), ('the', 'DT'), ('expanding', 'VBG'), ('universe', 'JJ')], [('american', 'JJ'), ('institute', 'NN')]]
[[k[0] for k in j if k[1].startswith("NN")] for j in inputList]
#[['cosmology', 'calculator'], ['generation'], ['institute']]