将 nltk.pos_tag 应用于整个数据框
Apply nltk.pos_tag to entire dataframe
我有以下数据框
0 1 2 3 4 5 6
0 i love eating spicy hand pulled noodles
1 i also like to game alot
我想应用一个函数来创建一个新的数据框,但是 df 将填充每个词的词性标签而不是上面的词。
我正在使用 nltk.pos_tag
,并且我这样做了 df.apply(nltk.pos_tag)
。
我的预期输出应该是这样的:
0 1 2 3 4 5 6
0 NN NN VB JJ NN VB NN
1 NN DT NN NN VB DT
然而,我得到 IndexError: ('string index out of range', 'occurred at index 6')
此外,我了解到 nltk.pos_tag 将 return 以以下格式输出元组:('word', 'pos_tag')
。因此,可能需要进行一些进一步的操作才能仅获取标签。关于如何有效地执行此操作的任何建议?
回溯:
Traceback (most recent call last):
File "PartsOfSpeech.py", line 71, in <module>
FilteredTrees = pos.run_pos(data.lower())
File "PartsOfSpeech.py", line 59, in run_pos
df = df.apply(pos_tag)
File "/anaconda3/envs/customer_sentiment/lib/python3.6/site-packages/pandas/core/frame.py", line 6487, in apply
return op.get_result()
File "/anaconda3/envs/customer_sentiment/lib/python3.6/site-packages/pandas/core/apply.py", line 151, in get_result
return self.apply_standard()
File "/anaconda3/envs/customer_sentiment/lib/python3.6/site-packages/pandas/core/apply.py", line 257, in apply_standard
self.apply_series_generator()
File "/anaconda3/envs/customer_sentiment/lib/python3.6/site-packages/pandas/core/apply.py", line 286, in apply_series_generator
results[i] = self.f(v)
File "/anaconda3/envs/customer_sentiment/lib/python3.6/site-packages/nltk/tag/__init__.py", line 162, in pos_tag
return _pos_tag(tokens, tagset, tagger, lang)
File "/anaconda3/envs/customer_sentiment/lib/python3.6/site-packages/nltk/tag/__init__.py", line 119, in _pos_tag
tagged_tokens = tagger.tag(tokens)
File "/anaconda3/envs/customer_sentiment/lib/python3.6/site-packages/nltk/tag/perceptron.py", line 157, in tag
context = self.START + [self.normalize(w) for w in tokens] + self.END
File "/anaconda3/envs/customer_sentiment/lib/python3.6/site-packages/nltk/tag/perceptron.py", line 157, in <listcomp>
context = self.START + [self.normalize(w) for w in tokens] + self.END
File "/anaconda3/envs/customer_sentiment/lib/python3.6/site-packages/nltk/tag/perceptron.py", line 242, in normalize
elif word[0].isdigit():
你可以使用applymap。
df.fillna('').applymap(lambda x: nltk.pos_tag([x])[0][1] if x!='' else '')
0 1 2 3 4 5 6
0 NN NN VBG NN NN VBD NNS
1 NN RB IN TO NN NN
注意:如果您的数据框很大,标记整个句子然后将标记转换为数据框会更有效。对于大数据集,当前的方法会很慢。
我有以下数据框
0 1 2 3 4 5 6
0 i love eating spicy hand pulled noodles
1 i also like to game alot
我想应用一个函数来创建一个新的数据框,但是 df 将填充每个词的词性标签而不是上面的词。
我正在使用 nltk.pos_tag
,并且我这样做了 df.apply(nltk.pos_tag)
。
我的预期输出应该是这样的:
0 1 2 3 4 5 6
0 NN NN VB JJ NN VB NN
1 NN DT NN NN VB DT
然而,我得到 IndexError: ('string index out of range', 'occurred at index 6')
此外,我了解到 nltk.pos_tag 将 return 以以下格式输出元组:('word', 'pos_tag')
。因此,可能需要进行一些进一步的操作才能仅获取标签。关于如何有效地执行此操作的任何建议?
回溯:
Traceback (most recent call last):
File "PartsOfSpeech.py", line 71, in <module>
FilteredTrees = pos.run_pos(data.lower())
File "PartsOfSpeech.py", line 59, in run_pos
df = df.apply(pos_tag)
File "/anaconda3/envs/customer_sentiment/lib/python3.6/site-packages/pandas/core/frame.py", line 6487, in apply
return op.get_result()
File "/anaconda3/envs/customer_sentiment/lib/python3.6/site-packages/pandas/core/apply.py", line 151, in get_result
return self.apply_standard()
File "/anaconda3/envs/customer_sentiment/lib/python3.6/site-packages/pandas/core/apply.py", line 257, in apply_standard
self.apply_series_generator()
File "/anaconda3/envs/customer_sentiment/lib/python3.6/site-packages/pandas/core/apply.py", line 286, in apply_series_generator
results[i] = self.f(v)
File "/anaconda3/envs/customer_sentiment/lib/python3.6/site-packages/nltk/tag/__init__.py", line 162, in pos_tag
return _pos_tag(tokens, tagset, tagger, lang)
File "/anaconda3/envs/customer_sentiment/lib/python3.6/site-packages/nltk/tag/__init__.py", line 119, in _pos_tag
tagged_tokens = tagger.tag(tokens)
File "/anaconda3/envs/customer_sentiment/lib/python3.6/site-packages/nltk/tag/perceptron.py", line 157, in tag
context = self.START + [self.normalize(w) for w in tokens] + self.END
File "/anaconda3/envs/customer_sentiment/lib/python3.6/site-packages/nltk/tag/perceptron.py", line 157, in <listcomp>
context = self.START + [self.normalize(w) for w in tokens] + self.END
File "/anaconda3/envs/customer_sentiment/lib/python3.6/site-packages/nltk/tag/perceptron.py", line 242, in normalize
elif word[0].isdigit():
你可以使用applymap。
df.fillna('').applymap(lambda x: nltk.pos_tag([x])[0][1] if x!='' else '')
0 1 2 3 4 5 6
0 NN NN VBG NN NN VBD NNS
1 NN RB IN TO NN NN
注意:如果您的数据框很大,标记整个句子然后将标记转换为数据框会更有效。对于大数据集,当前的方法会很慢。