解析 python asciitree 输出并打印 'labelled bracket notation'
parsing python asciitree output and print 'labelled bracket notation'
Google SyntaxNet给出一个输出
喜欢..
saw VBD ROOT
+-- Alice NNP nsubj
| +-- , , punct
| +-- reading VBG rcmod
| +-- who WP nsubj
| +-- had VBD aux
| +-- been VBN aux
| +-- about IN prep
| +-- SyntaxNet NNP pobj
+-- , , punct
+-- Bob NNP dobj
+-- in IN prep
| +-- hallway NN pobj
| +-- the DT det
+-- yesterday NN tmod
+-- . . punct
我想使用 python 读取和解析此输出(字符串数据)。
并用 'labelled bracket notation' 打印出来
像
->
[saw [Alice [,] [reading [who][had][been][about [SyntaxNet]]]][,][Bob][in [hallway [the]]][yesterday][.]]
你能帮帮我吗?
您可以让 SyntaxNet 以更易于解析的 conll 格式输出所有内容,而不是解析树。你的句子的 conll 格式如下所示:
1 Alice _ NOUN NNP _ 10 nsubj _ _
2 , _ . , _ 1 punct _ _
3 who _ PRON WP _ 6 nsubj _ _
4 had _ VERB VBD _ 6 aux _ _
5 been _ VERB VBN _ 6 aux _ _
6 reading _ VERB VBG _ 1 rcmod _ _
7 about _ ADP IN _ 6 prep _ _
8 SyntaxNet _ NOUN NNP _ 7 pobj _ _
9 , _ . , _ 10 punct _ _
10 saw _ VERB VBD _ 0 ROOT _ _
11 Bob _ NOUN NNP _ 10 dobj _ _
12 in _ ADP IN _ 10 prep _ _
13 the _ DET DT _ 14 det _ _
14 hallway _ NOUN NN _ 12 pobj _ _
15 yesterday _ NOUN NN _ 10 tmod _ _
16 . _ . . _ 10 punct _ _
各栏的含义可以查here。目前我们唯一关心的列是第一列(单词的 ID)、第二列(单词本身)和第七列(头部,换句话说,parent)。根节点的 parent 为 0。
要获得 conll 格式,我们只需要注释掉 demo.sh 的最后几行(我假设您曾经用来获取输出):
$PARSER_EVAL \
--input=$INPUT_FORMAT \
--output=stdout-conll \
--hidden_layer_sizes=64 \
--arg_prefix=brain_tagger \
--graph_builder=structured \
--task_context=$MODEL_DIR/context.pbtxt \
--model_path=$MODEL_DIR/tagger-params \
--slim_model \
--batch_size=1024 \
--alsologtostderr \
| \
$PARSER_EVAL \
--input=stdin-conll \
--output=stdout-conll \
--hidden_layer_sizes=512,512 \
--arg_prefix=brain_parser \
--graph_builder=structured \
--task_context=$MODEL_DIR/context.pbtxt \
--model_path=$MODEL_DIR/parser-params \
--slim_model \
--batch_size=1024 \
--alsologtostderr #\
# | \
# bazel-bin/syntaxnet/conll2tree \
# --task_context=$MODEL_DIR/context.pbtxt \
# --alsologtostderr
(不要忘记注释掉上一行的反斜杠)
()
当我 运行 demo.sh 自己时,我会得到很多我不需要的信息。你如何摆脱我留给你弄清楚的东西(让我知道:))。
我暂时将相关部分复制到一个文件中,这样我就可以将其通过管道传输到我要编写的 python 程序中。如果你能去掉这些信息,你应该也能将 demo.sh 直接通过管道传输到 python 程序中。
注意:我是 python 的新手,所以请随时改进我的代码。
首先,我们只想从输入中读取 conll 文件。我喜欢把每个词都写得漂亮 class.
#!/usr/bin/env python
import sys
# Conll data format:
# http://ilk.uvt.nl/conll/#dataformat
#
# The only parts we need:
# 1: ID
# 2: FORM (The original word)
# 7: HEAD (The ID of its parent)
class Word:
"A class containing the information of a single line from a conll file."
def __init__(self, columns):
self.id = int(columns[0])
self.form = columns[1]
self.head = int(columns[6])
self.children = []
# Read the conll input and put it in a list of words.
words = []
for line in sys.stdin:
# Remove newline character, split on spaces and remove empty columns.
line = filter(None, line.rstrip().split(" "))
words.append(Word(line))
很好,但它还不是树结构。我们还需要做更多的工作。
我可以 foreach 整个列表几次以查找每个 child 的每个单词,但这效率很低。我改为按 parent 对它们进行排序,然后它应该只是快速查找以获取给定 parent.
的每个 child
# Sort the words by their head (parent).
lookup = [[] for _ in range(len(words) + 1)]
for word in words:
lookup[word.head].append(word)
创建树结构:
# Build a tree
def buildTree(head):
"Find the children for the given head in the lookup, recursively"
# Get all the children of this parent.
children = lookup[head]
# Get the children of the children.
for child in children:
child.children = buildTree(child.id)
return children
# Get the root's child. There should only be one child. The function returns an
# array of children so just get the first one.
tree = buildTree(0)[0] # Start with head = 0 (which is the ROOT node)
为了能够以新格式打印树,您可以向 Word 添加一些方法重载 class:
def __str__(self):
if len(self.children) == 0:
return "[" + self.form + "]"
else:
return "[" + self.form + " " + "".join(str(child) for child in self.children) + "]"
def __repr__(self):
return self.__str__()
现在你可以这样做了:
print tree
然后像这样管道:
cat input.conll | ./my_parser.py
或直接来自 syntaxnet:
echo "Alice, who had been reading about SyntaxNet, saw Bob in the hallway yesterday." | syntaxnet/demo.sh | ./my_parser.py
Google SyntaxNet给出一个输出 喜欢..
saw VBD ROOT
+-- Alice NNP nsubj
| +-- , , punct
| +-- reading VBG rcmod
| +-- who WP nsubj
| +-- had VBD aux
| +-- been VBN aux
| +-- about IN prep
| +-- SyntaxNet NNP pobj
+-- , , punct
+-- Bob NNP dobj
+-- in IN prep
| +-- hallway NN pobj
| +-- the DT det
+-- yesterday NN tmod
+-- . . punct
我想使用 python 读取和解析此输出(字符串数据)。 并用 'labelled bracket notation' 打印出来 像 ->
[saw [Alice [,] [reading [who][had][been][about [SyntaxNet]]]][,][Bob][in [hallway [the]]][yesterday][.]]
你能帮帮我吗?
您可以让 SyntaxNet 以更易于解析的 conll 格式输出所有内容,而不是解析树。你的句子的 conll 格式如下所示:
1 Alice _ NOUN NNP _ 10 nsubj _ _
2 , _ . , _ 1 punct _ _
3 who _ PRON WP _ 6 nsubj _ _
4 had _ VERB VBD _ 6 aux _ _
5 been _ VERB VBN _ 6 aux _ _
6 reading _ VERB VBG _ 1 rcmod _ _
7 about _ ADP IN _ 6 prep _ _
8 SyntaxNet _ NOUN NNP _ 7 pobj _ _
9 , _ . , _ 10 punct _ _
10 saw _ VERB VBD _ 0 ROOT _ _
11 Bob _ NOUN NNP _ 10 dobj _ _
12 in _ ADP IN _ 10 prep _ _
13 the _ DET DT _ 14 det _ _
14 hallway _ NOUN NN _ 12 pobj _ _
15 yesterday _ NOUN NN _ 10 tmod _ _
16 . _ . . _ 10 punct _ _
各栏的含义可以查here。目前我们唯一关心的列是第一列(单词的 ID)、第二列(单词本身)和第七列(头部,换句话说,parent)。根节点的 parent 为 0。
要获得 conll 格式,我们只需要注释掉 demo.sh 的最后几行(我假设您曾经用来获取输出):
$PARSER_EVAL \
--input=$INPUT_FORMAT \
--output=stdout-conll \
--hidden_layer_sizes=64 \
--arg_prefix=brain_tagger \
--graph_builder=structured \
--task_context=$MODEL_DIR/context.pbtxt \
--model_path=$MODEL_DIR/tagger-params \
--slim_model \
--batch_size=1024 \
--alsologtostderr \
| \
$PARSER_EVAL \
--input=stdin-conll \
--output=stdout-conll \
--hidden_layer_sizes=512,512 \
--arg_prefix=brain_parser \
--graph_builder=structured \
--task_context=$MODEL_DIR/context.pbtxt \
--model_path=$MODEL_DIR/parser-params \
--slim_model \
--batch_size=1024 \
--alsologtostderr #\
# | \
# bazel-bin/syntaxnet/conll2tree \
# --task_context=$MODEL_DIR/context.pbtxt \
# --alsologtostderr
(不要忘记注释掉上一行的反斜杠)
(
当我 运行 demo.sh 自己时,我会得到很多我不需要的信息。你如何摆脱我留给你弄清楚的东西(让我知道:))。 我暂时将相关部分复制到一个文件中,这样我就可以将其通过管道传输到我要编写的 python 程序中。如果你能去掉这些信息,你应该也能将 demo.sh 直接通过管道传输到 python 程序中。
注意:我是 python 的新手,所以请随时改进我的代码。
首先,我们只想从输入中读取 conll 文件。我喜欢把每个词都写得漂亮 class.
#!/usr/bin/env python
import sys
# Conll data format:
# http://ilk.uvt.nl/conll/#dataformat
#
# The only parts we need:
# 1: ID
# 2: FORM (The original word)
# 7: HEAD (The ID of its parent)
class Word:
"A class containing the information of a single line from a conll file."
def __init__(self, columns):
self.id = int(columns[0])
self.form = columns[1]
self.head = int(columns[6])
self.children = []
# Read the conll input and put it in a list of words.
words = []
for line in sys.stdin:
# Remove newline character, split on spaces and remove empty columns.
line = filter(None, line.rstrip().split(" "))
words.append(Word(line))
很好,但它还不是树结构。我们还需要做更多的工作。
我可以 foreach 整个列表几次以查找每个 child 的每个单词,但这效率很低。我改为按 parent 对它们进行排序,然后它应该只是快速查找以获取给定 parent.
的每个 child# Sort the words by their head (parent).
lookup = [[] for _ in range(len(words) + 1)]
for word in words:
lookup[word.head].append(word)
创建树结构:
# Build a tree
def buildTree(head):
"Find the children for the given head in the lookup, recursively"
# Get all the children of this parent.
children = lookup[head]
# Get the children of the children.
for child in children:
child.children = buildTree(child.id)
return children
# Get the root's child. There should only be one child. The function returns an
# array of children so just get the first one.
tree = buildTree(0)[0] # Start with head = 0 (which is the ROOT node)
为了能够以新格式打印树,您可以向 Word 添加一些方法重载 class:
def __str__(self):
if len(self.children) == 0:
return "[" + self.form + "]"
else:
return "[" + self.form + " " + "".join(str(child) for child in self.children) + "]"
def __repr__(self):
return self.__str__()
现在你可以这样做了:
print tree
然后像这样管道:
cat input.conll | ./my_parser.py
或直接来自 syntaxnet:
echo "Alice, who had been reading about SyntaxNet, saw Bob in the hallway yesterday." | syntaxnet/demo.sh | ./my_parser.py