模式 table 到 Pandas DataFrame
Pattern table to Pandas DataFrame
我正在使用 Python "Pattern.en" 程序包,它为我提供有关特定句子的主语、宾语和其他详细信息。
但我想将此输出存储到另一个变量或 Dataframe 中以供进一步处理,但我无法这样做。
对此的任何输入都会有所帮助。
示例代码如下,供参考。
from pattern.en import parse
from pattern.en import pprint
import pandas as pd
input = parse('I want to go to the Restaurant as I am hungry very much')
print(input)
I/PRP/B-NP/O want/VBP/B-VP/O to/TO/I-VP/O go/VB/I-VP/O to/TO/O/O the/DT/B-NP/O Restaurant/NNP/I-NP/O as/IN/B-PP/B-PNP I/PRP/B-NP/I-PNP am/VBP/B-VP/O hungry/JJ/B-ADJP/O very/RB/I-ADJP/O much/JJ/I-ADJP/O
pprint(input)
WORD TAG CHUNK ROLE ID PNP LEMMA
I PRP NP - - - -
want VBP VP - - - -
to TO VP ^ - - - -
go VB VP ^ - - - -
to TO - - - - -
the DT NP - - - -
Restaurant NNP NP ^ - - - -
as IN PP - - PNP -
I PRP NP - - PNP -
am VBP VP - - - -
hungry JJ ADJP - - - -
very RB ADJP ^ - - - -
much JJ ADJP ^ - - - -
请注意 print 和 pprint 语句的输出。我试图将它们中的任何一个存储到一个变量中。如果我可以将 pprint 语句的输出存储到 Dataframe 中,因为它以表格格式打印,那就更好了。
但是当我尝试这样做时遇到了下面提到的错误
df = pd.DataFrame(input)
ValueError: DataFrame constructor not properly called!
取 table 函数的源代码,我得出这个
from pattern.en import parse
from pattern.text.tree import WORD, POS, CHUNK, PNP, REL, ANCHOR, LEMMA, IOB, ROLE, MBSP, Text
import pandas as pd
def sentence2df(sentence, placeholder="-"):
tags = [WORD, POS, IOB, CHUNK, ROLE, REL, PNP, ANCHOR, LEMMA]
tags += [tag for tag in sentence.token if tag not in tags]
def format(token, tag):
# Returns the token tag as a string.
if tag == WORD : s = token.string
elif tag == POS : s = token.type
elif tag == IOB : s = token.chunk and (token.index == token.chunk.start and "B" or "I")
elif tag == CHUNK : s = token.chunk and token.chunk.type
elif tag == ROLE : s = token.chunk and token.chunk.role
elif tag == REL : s = token.chunk and token.chunk.relation and str(token.chunk.relation)
elif tag == PNP : s = token.chunk and token.chunk.pnp and token.chunk.pnp.type
elif tag == ANCHOR : s = token.chunk and token.chunk.anchor_id
elif tag == LEMMA : s = token.lemma
else : s = token.custom_tags.get(tag)
return s or placeholder
columns = [[format(token, tag) for token in sentence] for tag in tags]
columns[3] = [columns[3][i]+(iob == "I" and " ^" or "") for i, iob in enumerate(columns[2])]
del columns[2]
header = ['word', 'tag', 'chunk', 'role', 'id', 'pnp', 'anchor', 'lemma']+tags[9:]
if not MBSP:
del columns[6]
del header[6]
return pd.DataFrame(
[[x[i] for x in columns] for i in range(len(columns[0]))],
columns=header,
)
用法
>>> string = parse('I want to go to the Restaurant as I am hungry very much')
>>> sentence = Text(string, token=[WORD, POS, CHUNK, PNP])[0]
>>> df = sentence2df(sentence)
>>> print(df)
word tag chunk role id pnp lemma
0 I PRP NP - - - -
1 want VBP VP - - - -
2 to TO VP ^ - - - -
3 go VB VP ^ - - - -
4 to TO - - - - -
5 the DT NP - - - -
6 Restaurant NNP NP ^ - - - -
7 as IN PP - - PNP -
8 I PRP NP - - PNP -
9 am VBP VP - - - -
10 hungry JJ ADJP - - - -
11 very RB ADJP ^ - - - -
12 much JJ ADJP ^ - - - -
我正在使用 Python "Pattern.en" 程序包,它为我提供有关特定句子的主语、宾语和其他详细信息。
但我想将此输出存储到另一个变量或 Dataframe 中以供进一步处理,但我无法这样做。
对此的任何输入都会有所帮助。
示例代码如下,供参考。
from pattern.en import parse
from pattern.en import pprint
import pandas as pd
input = parse('I want to go to the Restaurant as I am hungry very much')
print(input)
I/PRP/B-NP/O want/VBP/B-VP/O to/TO/I-VP/O go/VB/I-VP/O to/TO/O/O the/DT/B-NP/O Restaurant/NNP/I-NP/O as/IN/B-PP/B-PNP I/PRP/B-NP/I-PNP am/VBP/B-VP/O hungry/JJ/B-ADJP/O very/RB/I-ADJP/O much/JJ/I-ADJP/O
pprint(input)
WORD TAG CHUNK ROLE ID PNP LEMMA
I PRP NP - - - -
want VBP VP - - - -
to TO VP ^ - - - -
go VB VP ^ - - - -
to TO - - - - -
the DT NP - - - -
Restaurant NNP NP ^ - - - -
as IN PP - - PNP -
I PRP NP - - PNP -
am VBP VP - - - -
hungry JJ ADJP - - - -
very RB ADJP ^ - - - -
much JJ ADJP ^ - - - -
请注意 print 和 pprint 语句的输出。我试图将它们中的任何一个存储到一个变量中。如果我可以将 pprint 语句的输出存储到 Dataframe 中,因为它以表格格式打印,那就更好了。
但是当我尝试这样做时遇到了下面提到的错误
df = pd.DataFrame(input)
ValueError: DataFrame constructor not properly called!
取 table 函数的源代码,我得出这个
from pattern.en import parse
from pattern.text.tree import WORD, POS, CHUNK, PNP, REL, ANCHOR, LEMMA, IOB, ROLE, MBSP, Text
import pandas as pd
def sentence2df(sentence, placeholder="-"):
tags = [WORD, POS, IOB, CHUNK, ROLE, REL, PNP, ANCHOR, LEMMA]
tags += [tag for tag in sentence.token if tag not in tags]
def format(token, tag):
# Returns the token tag as a string.
if tag == WORD : s = token.string
elif tag == POS : s = token.type
elif tag == IOB : s = token.chunk and (token.index == token.chunk.start and "B" or "I")
elif tag == CHUNK : s = token.chunk and token.chunk.type
elif tag == ROLE : s = token.chunk and token.chunk.role
elif tag == REL : s = token.chunk and token.chunk.relation and str(token.chunk.relation)
elif tag == PNP : s = token.chunk and token.chunk.pnp and token.chunk.pnp.type
elif tag == ANCHOR : s = token.chunk and token.chunk.anchor_id
elif tag == LEMMA : s = token.lemma
else : s = token.custom_tags.get(tag)
return s or placeholder
columns = [[format(token, tag) for token in sentence] for tag in tags]
columns[3] = [columns[3][i]+(iob == "I" and " ^" or "") for i, iob in enumerate(columns[2])]
del columns[2]
header = ['word', 'tag', 'chunk', 'role', 'id', 'pnp', 'anchor', 'lemma']+tags[9:]
if not MBSP:
del columns[6]
del header[6]
return pd.DataFrame(
[[x[i] for x in columns] for i in range(len(columns[0]))],
columns=header,
)
用法
>>> string = parse('I want to go to the Restaurant as I am hungry very much')
>>> sentence = Text(string, token=[WORD, POS, CHUNK, PNP])[0]
>>> df = sentence2df(sentence)
>>> print(df)
word tag chunk role id pnp lemma
0 I PRP NP - - - -
1 want VBP VP - - - -
2 to TO VP ^ - - - -
3 go VB VP ^ - - - -
4 to TO - - - - -
5 the DT NP - - - -
6 Restaurant NNP NP ^ - - - -
7 as IN PP - - PNP -
8 I PRP NP - - PNP -
9 am VBP VP - - - -
10 hungry JJ ADJP - - - -
11 very RB ADJP ^ - - - -
12 much JJ ADJP ^ - - - -