如何解析 Python 中 属性 列表的 lisp 可读文件
How to parse a lisp-readable file of property lists in Python
我正在尝试解析一个动词英语词典,以便使用 Python 构建一个 NLP 应用程序,所以我必须将它与我的 NLTK 脚本合并,该词典是 [= 的 lisp 可读文件38=] 列表,但我需要更简单的格式,如 Json 文件或 pandas 数据框。
Lexicon 数据库中的一个例子是:
;; Grid: 51.2#1#_th,src#abandon#abandon#abandon#abandon+ingly#(1.5,01269572,01188040,01269413,00345378)(1.6,01524319,01421290,01524047,00415625)###AD
(
:DEF_WORD "abandon"
:CLASS "51.2"
:WN_SENSE (("1.5" 01269572 01188040 01269413 00345378)
("1.6" 01524319 01421290 01524047 00415625))
:PROPBANK ("arg1 arg2")
:THETA_ROLES ((1 "_th,src"))
:LCS (go loc (* thing 2)
(away_from loc (thing 2) (at loc (thing 2) (* thing 4)))
(abandon+ingly 26))
:VAR_SPEC ((4 :optional) (2 (animate +)))
)
;; Grid: 45.4.a#1#_ag_th,instr(with)#abase#abase#abase#abase+ed#(1.5,01024949)(1.6,01228249)###AD
(
:DEF_WORD "abase"
:CLASS "45.4.a"
:WN_SENSE (("1.5" 01024949)
("1.6" 01228249))
:PROPBANK ("arg0 arg1 arg2(with)")
:THETA_ROLES ((1 "_ag_th,instr(with)"))
:LCS (cause (* thing 1)
(go ident (* thing 2)
(toward ident (thing 2) (at ident (thing 2) (abase+ed 9))))
((* with 19) instr (*head*) (thing 20)))
:VAR_SPEC ((1 (animate +)))
)
这里有完整的数据https://raw.githubusercontent.com/ihmc/LCS/master/verbs-English.lcs
我已经使用类似这样的东西尝试了在此 post Parsing a lisp file with Python 中发表的想法,但我得到的格式与我正在寻找的格式不一样
inputdata = '''
(
:DEF_WORD "abandon"
:CLASS "51.2"
:WN_SENSE (("1.5" 01269572 01188040 01269413 00345378)
("1.6" 01524319 01421290 01524047 00415625))
:PROPBANK ("arg1 arg2")
:THETA_ROLES ((1 "_th,src"))
:LCS (go loc (* thing 2)
(away_from loc (thing 2) (at loc (thing 2) (* thing 4)))
(abandon+ingly 26))
:VAR_SPEC ((4 :optional) (2 (animate +)))
)
(
:DEF_WORD "abase"
:CLASS "45.4.a"
:WN_SENSE (("1.5" 01024949)
("1.6" 01228249))
:PROPBANK ("arg0 arg1 arg2(with)")
:THETA_ROLES ((1 "_ag_th,instr(with)"))
:LCS (cause (* thing 1)
(go ident (* thing 2)
(toward ident (thing 2) (at ident (thing 2) (abase+ed 9))))
((* with 19) instr (*head*) (thing 20)))
:VAR_SPEC ((1 (animate +)))
)'''
from pyparsing import OneOrMore, nestedExpr
data = OneOrMore(nestedExpr()).parseString(inputdata)
print (data)
我得到了这样的输出:
[
[ ':DEF_WORD', '"abandon"',
':CLASS', '"51.2"',
':WN_SENSE', [
['"1.5"', '01269572', '01188040', '01269413', '00345378'],
['"1.6"', '01524319', '01421290', '01524047', '00415625']
],
':PROPBANK', ['"arg1 arg2"'],
':THETA_ROLES', [['1', '"_th,src"']],
':LCS', ['go', 'loc', ['*', 'thing', '2'],
['away_from', 'loc', ['thing', '2'],
['at', 'loc', ['thing', '2'], ['*', 'thing', '4']]], ['abandon+ingly', '26']],
':VAR_SPEC', [['4', ':optional'], ['2', ['animate', '+']]]]
,
[':DEF_WORD', '"abase"',
':CLASS', '"45.4.a"',
':WN_SENSE', [
['"1.5"', '01024949'],
['"1.6"', '01228249']
],
':PROPBANK', ['"arg0 arg1 arg2(with)"'],
':THETA_ROLES', [['1', '"_ag_th,instr(with)"']],
':LCS', ['cause', ['*', 'thing', '1'],
['go', 'ident', ['*', 'thing', '2'],
['toward', 'ident', ['thing', '2'],
['at', 'ident', ['thing', '2'],
['abase+ed', '9']]]],
[['*', 'with', '19'], 'instr', ['*head*'], ['thing', '20']]],
':VAR_SPEC', [['1', ['animate', '+']]]
]
]
我不确定如何处理此输出格式以获得例如 THETA_ROLES 值或此词典中的其他动词特征,我的所有句子都在使用 pandas 和 NLTK 的数组,因此我们的想法是寻找具有特定 THETA_ROLES 值或此词典中存在的其他特征的动词类型的句子。
您获得的数据是键值对的平面序列。也就是说,你有 ["A", 1, "B", 2]
形式的东西,但你想要像 {"A": 1, "B": 2}
.
这样的字典
这是一个生成器,它将 return 扁平化序列作为对序列:
def pairs(seq):
for x, y in zip(seq[::2], seq[1::2]):
yield (x, y)
print(dict(pairs(["A", 1, "B", 2])))
使用该方法将每个已解析的组转换为 Python 字典,然后您可以从中轻松地按名称提取位。
for group in data:
groupdict = dict(pairs(group))
print(groupdict[":THETA_ROLES"])
我正在尝试解析一个动词英语词典,以便使用 Python 构建一个 NLP 应用程序,所以我必须将它与我的 NLTK 脚本合并,该词典是 [= 的 lisp 可读文件38=] 列表,但我需要更简单的格式,如 Json 文件或 pandas 数据框。
Lexicon 数据库中的一个例子是:
;; Grid: 51.2#1#_th,src#abandon#abandon#abandon#abandon+ingly#(1.5,01269572,01188040,01269413,00345378)(1.6,01524319,01421290,01524047,00415625)###AD
(
:DEF_WORD "abandon"
:CLASS "51.2"
:WN_SENSE (("1.5" 01269572 01188040 01269413 00345378)
("1.6" 01524319 01421290 01524047 00415625))
:PROPBANK ("arg1 arg2")
:THETA_ROLES ((1 "_th,src"))
:LCS (go loc (* thing 2)
(away_from loc (thing 2) (at loc (thing 2) (* thing 4)))
(abandon+ingly 26))
:VAR_SPEC ((4 :optional) (2 (animate +)))
)
;; Grid: 45.4.a#1#_ag_th,instr(with)#abase#abase#abase#abase+ed#(1.5,01024949)(1.6,01228249)###AD
(
:DEF_WORD "abase"
:CLASS "45.4.a"
:WN_SENSE (("1.5" 01024949)
("1.6" 01228249))
:PROPBANK ("arg0 arg1 arg2(with)")
:THETA_ROLES ((1 "_ag_th,instr(with)"))
:LCS (cause (* thing 1)
(go ident (* thing 2)
(toward ident (thing 2) (at ident (thing 2) (abase+ed 9))))
((* with 19) instr (*head*) (thing 20)))
:VAR_SPEC ((1 (animate +)))
)
这里有完整的数据https://raw.githubusercontent.com/ihmc/LCS/master/verbs-English.lcs
我已经使用类似这样的东西尝试了在此 post Parsing a lisp file with Python 中发表的想法,但我得到的格式与我正在寻找的格式不一样
inputdata = '''
(
:DEF_WORD "abandon"
:CLASS "51.2"
:WN_SENSE (("1.5" 01269572 01188040 01269413 00345378)
("1.6" 01524319 01421290 01524047 00415625))
:PROPBANK ("arg1 arg2")
:THETA_ROLES ((1 "_th,src"))
:LCS (go loc (* thing 2)
(away_from loc (thing 2) (at loc (thing 2) (* thing 4)))
(abandon+ingly 26))
:VAR_SPEC ((4 :optional) (2 (animate +)))
)
(
:DEF_WORD "abase"
:CLASS "45.4.a"
:WN_SENSE (("1.5" 01024949)
("1.6" 01228249))
:PROPBANK ("arg0 arg1 arg2(with)")
:THETA_ROLES ((1 "_ag_th,instr(with)"))
:LCS (cause (* thing 1)
(go ident (* thing 2)
(toward ident (thing 2) (at ident (thing 2) (abase+ed 9))))
((* with 19) instr (*head*) (thing 20)))
:VAR_SPEC ((1 (animate +)))
)'''
from pyparsing import OneOrMore, nestedExpr
data = OneOrMore(nestedExpr()).parseString(inputdata)
print (data)
我得到了这样的输出:
[
[ ':DEF_WORD', '"abandon"',
':CLASS', '"51.2"',
':WN_SENSE', [
['"1.5"', '01269572', '01188040', '01269413', '00345378'],
['"1.6"', '01524319', '01421290', '01524047', '00415625']
],
':PROPBANK', ['"arg1 arg2"'],
':THETA_ROLES', [['1', '"_th,src"']],
':LCS', ['go', 'loc', ['*', 'thing', '2'],
['away_from', 'loc', ['thing', '2'],
['at', 'loc', ['thing', '2'], ['*', 'thing', '4']]], ['abandon+ingly', '26']],
':VAR_SPEC', [['4', ':optional'], ['2', ['animate', '+']]]]
,
[':DEF_WORD', '"abase"',
':CLASS', '"45.4.a"',
':WN_SENSE', [
['"1.5"', '01024949'],
['"1.6"', '01228249']
],
':PROPBANK', ['"arg0 arg1 arg2(with)"'],
':THETA_ROLES', [['1', '"_ag_th,instr(with)"']],
':LCS', ['cause', ['*', 'thing', '1'],
['go', 'ident', ['*', 'thing', '2'],
['toward', 'ident', ['thing', '2'],
['at', 'ident', ['thing', '2'],
['abase+ed', '9']]]],
[['*', 'with', '19'], 'instr', ['*head*'], ['thing', '20']]],
':VAR_SPEC', [['1', ['animate', '+']]]
]
]
我不确定如何处理此输出格式以获得例如 THETA_ROLES 值或此词典中的其他动词特征,我的所有句子都在使用 pandas 和 NLTK 的数组,因此我们的想法是寻找具有特定 THETA_ROLES 值或此词典中存在的其他特征的动词类型的句子。
您获得的数据是键值对的平面序列。也就是说,你有 ["A", 1, "B", 2]
形式的东西,但你想要像 {"A": 1, "B": 2}
.
这是一个生成器,它将 return 扁平化序列作为对序列:
def pairs(seq):
for x, y in zip(seq[::2], seq[1::2]):
yield (x, y)
print(dict(pairs(["A", 1, "B", 2])))
使用该方法将每个已解析的组转换为 Python 字典,然后您可以从中轻松地按名称提取位。
for group in data:
groupdict = dict(pairs(group))
print(groupdict[":THETA_ROLES"])