将文本拆分为数据框中不同行的标记
Split text into tokens on different rows in a dataframe
我对此很陌生,但我正在尝试将 pandas 数据框中的文本拆分为单独的行,这些行由文本的每个标记及其各自的 POS 和 TAG 组成。例如:
Text
1 Police officers arrest teen.
2 Man agrees to help.
我想在这里实现的是:
Sentence# Token POS Tag
1 Police NNS B-NP
officers NNS I-NP
arrest VBP B-VP
teen NN B-NP
2 Man NNP B-NP
agrees VBZ B-VP
to TO B-VP
help VB B-VP
nltk
模块可以帮你做你想做的事。此代码利用 nltk
创建一个新的 DataFrame,其输出与您想要的输出相似。为了获得与所需输出匹配的标签,您可能需要提供自己的块解析器。我不是 POS 和 IOB 标签方面的专家。
import pandas as pd
from nltk import word_tokenize, pos_tag, tree2conlltags, RegexpParser
# orig data
d = {'Text': ["Police officers arrest teen.", "Man agrees to help."]}
# orig DataFrame
df = pd.DataFrame(data = d)
# new data
new_d = {'Sentence': [], 'Token': [], 'POS': [], 'Tag': []}
# grammar taken from nltk.org
grammar = r"NP: {<[CDJNP].*>+}"
parser = RegexpParser(grammar)
for idx, row in df.iterrows():
temp = tree2conlltags(parser.parse(pos_tag(word_tokenize(row["Text"]))))
new_d['Token'].extend(i[0] for i in temp)
new_d['POS'].extend(i[1] for i in temp)
new_d['Tag'].extend(i[2] for i in temp)
new_d['Sentence'].extend([idx + 1] * len(temp))
# new DataFrame
new_df = pd.DataFrame(data = new_d)
print(f"***Original DataFrame***\n\n {df}\n")
print(f"***New DataFrame***\n\n {new_df}")
输出:
***Original DataFrame***
Text
0 Police officers arrest teen.
1 Man agrees to help.
***New DataFrame***
Sentence Token POS Tag
0 1 Police NNP B-NP
1 1 officers NNS I-NP
2 1 arrest VBP O
3 1 teen NN B-NP
4 1 . . O
5 2 Man NN B-NP
6 2 agrees VBZ O
7 2 to TO O
8 2 help VB O
9 2 . . O
注意 pip
安装 nltk
之后,在上述代码可以 运行 之前,您可能需要调用 nltk.download
几次。您收到的错误消息应该会告诉您要执行什么。例如,您可能需要执行此
>>> import nltk
>>> nltk.download('punkt')
>>> nltk.download('averaged_perceptron_tagger')
我对此很陌生,但我正在尝试将 pandas 数据框中的文本拆分为单独的行,这些行由文本的每个标记及其各自的 POS 和 TAG 组成。例如:
Text
1 Police officers arrest teen.
2 Man agrees to help.
我想在这里实现的是:
Sentence# Token POS Tag
1 Police NNS B-NP
officers NNS I-NP
arrest VBP B-VP
teen NN B-NP
2 Man NNP B-NP
agrees VBZ B-VP
to TO B-VP
help VB B-VP
nltk
模块可以帮你做你想做的事。此代码利用 nltk
创建一个新的 DataFrame,其输出与您想要的输出相似。为了获得与所需输出匹配的标签,您可能需要提供自己的块解析器。我不是 POS 和 IOB 标签方面的专家。
import pandas as pd
from nltk import word_tokenize, pos_tag, tree2conlltags, RegexpParser
# orig data
d = {'Text': ["Police officers arrest teen.", "Man agrees to help."]}
# orig DataFrame
df = pd.DataFrame(data = d)
# new data
new_d = {'Sentence': [], 'Token': [], 'POS': [], 'Tag': []}
# grammar taken from nltk.org
grammar = r"NP: {<[CDJNP].*>+}"
parser = RegexpParser(grammar)
for idx, row in df.iterrows():
temp = tree2conlltags(parser.parse(pos_tag(word_tokenize(row["Text"]))))
new_d['Token'].extend(i[0] for i in temp)
new_d['POS'].extend(i[1] for i in temp)
new_d['Tag'].extend(i[2] for i in temp)
new_d['Sentence'].extend([idx + 1] * len(temp))
# new DataFrame
new_df = pd.DataFrame(data = new_d)
print(f"***Original DataFrame***\n\n {df}\n")
print(f"***New DataFrame***\n\n {new_df}")
输出:
***Original DataFrame***
Text
0 Police officers arrest teen.
1 Man agrees to help.
***New DataFrame***
Sentence Token POS Tag
0 1 Police NNP B-NP
1 1 officers NNS I-NP
2 1 arrest VBP O
3 1 teen NN B-NP
4 1 . . O
5 2 Man NN B-NP
6 2 agrees VBZ O
7 2 to TO O
8 2 help VB O
9 2 . . O
注意 pip
安装 nltk
之后,在上述代码可以 运行 之前,您可能需要调用 nltk.download
几次。您收到的错误消息应该会告诉您要执行什么。例如,您可能需要执行此
>>> import nltk
>>> nltk.download('punkt')
>>> nltk.download('averaged_perceptron_tagger')