取三个制表符分隔的标记并制作 Prolog "fact"
Take three tab separated tokens and make a Prolog "fact"
本质上,从每行输入的 foo\tbar\tbaz
到 'bar'('foo', 'baz').
如果任何标记包含单引号,则需要用反斜杠转义:
don't
--> 'don\'t'
详情:
我有一个包含 'semi-structured' 个句子成分的文件,格式如下:
the grand hall of the hong kong convention attend by some # guests
principal representatives of both countries seat on the central dais
representing china be mr jiang
britain be hrh
the principal representatives be more than # distinguished guests
hong kong end with the playing of the british national anthem
this follow at the stroke of midnight
both countries take part in the ceremony
the ceremony start at about # pm
the ceremony end about # am
# royal hong kong police officers lower the british hong kong flag
another # raise the sar flag
the # leave for the royal yacht britannia
the handover of hong kong hold by the chinese and british governments
the world cast eye on hong kong
the # governments hold on schedule
this be festival for the chinese nation
july # , # go in the annals of history
the hong kong compatriots become master of this chinese land
hong kong enter era of development
history remember mr deng xiaoping
it be along the course
we resolve the hong kong question
i wish to express thanks to all the personages
both china and britain contribute to the settlement of the hong kong
the world support hong kong 's return
i wish to extend my cordial greetings and best wishes
如您所见,它们由制表符分隔。我想要做的是从这些数据中创建普通的定式子句,将它们呈现为:
'attend by'('some # guests','the grand hall of the hong kong convention').
'take part in'('the ceremony','both countries').
be('representing china', 'mr jiang').
所以在现在的数据中,中间有一个动词短语,它应该成为这个新构造的基础,然后被作用的实体应该是第一个参数,后面是主要演员。
我希望这些最终可以在 Prolog 中使用。
我想并不是所有的数据都是完整的,所以也许我可以把它扔掉。
我想有某种花哨的 perl 脚本或正则表达式、sed、类型操作可以最有效地实现这一点。我需要在一个大文件上执行这个,所以我希望优化效率,这就是我把它放在这里的原因。
使用 sed:
sed "s/\(.*\)\t\(.*\)\t\(.*\)/''('', '')/" filename
要使不带空格的标记不被引用,使用 awk 会更简单:
awk -F\t -vq="'" 'function quote(token) { if(index(token, " ")) { return q token q }; return token } { print quote() "(" quote() ", " quote() ")" }' filename
至于性能,我怀疑瓶颈是I/O,而不是这个程序。但是,如果它确实是一个问题,你不会想乱用脚本语言并拼凑 20 行 C++ 来完成它。
编辑:为了回应评论(我对 prolog 了解多少,嗯?:P),总是在引号内引用和引用撇号,awk 又更容易了:
awk -F\t -vq="'" 'function quote(token) { gsub(q, "\"q, token); return q token q } { print quote() "(" quote() ", " quote() ")" }' filename
不过用sed也是可以的:
sed "s/'/\\'/g;s/\(.*\)\t\(.*\)\t\(.*\)/''('', '')/" filename
这将在执行原始操作之前将 '
替换为 \'
。 Shell 涉及引号,这就是为什么它需要这么多反斜杠。
请注意,sed 解决方案要求每行有两个制表符。查看测试输入,我不完全确定情况是否如此,所以 awk 可能更适合你。
在 SWI-Prolog 中,考虑 tokenize_atom/2(您需要最新版本才能放入源任意长文本常量,并引用 ')
t :- Text = '
the grand hall of the hong kong convention attend by some # guests
principal representatives of both countries seat on the central dais
... rest of text...
the world support hong kong \'s return
i wish to extend my cordial greetings and best wishes',
tokenize_atom(Text,L), maplist(writeln,L).
产量
?- t.
the
grand
hall
of
the
hong
kong
...
因此您可以使用 DCG 来 'understand' 文本。这比通过外部工具要容易得多,我猜...
编辑 让我们对 Boris 的评论进行编码:
file_2_statements(File) :-
atom_codes('\t', Tab),
open(File, read, S),
repeat,
read_line_to_codes(S, L),
( L \= end_of_file
-> append([H,Tab,A1,Tab,A2], L),
maplist(atom_codes, [Hc,Ac1,Ac2], [H,A1,A2]),
P =.. [Hc,Ac1,Ac2], assert(P),
fail
; true
),
close(S).
本质上,从每行输入的 foo\tbar\tbaz
到 'bar'('foo', 'baz').
如果任何标记包含单引号,则需要用反斜杠转义:
don't
--> 'don\'t'
详情:
我有一个包含 'semi-structured' 个句子成分的文件,格式如下:
the grand hall of the hong kong convention attend by some # guests
principal representatives of both countries seat on the central dais
representing china be mr jiang
britain be hrh
the principal representatives be more than # distinguished guests
hong kong end with the playing of the british national anthem
this follow at the stroke of midnight
both countries take part in the ceremony
the ceremony start at about # pm
the ceremony end about # am
# royal hong kong police officers lower the british hong kong flag
another # raise the sar flag
the # leave for the royal yacht britannia
the handover of hong kong hold by the chinese and british governments
the world cast eye on hong kong
the # governments hold on schedule
this be festival for the chinese nation
july # , # go in the annals of history
the hong kong compatriots become master of this chinese land
hong kong enter era of development
history remember mr deng xiaoping
it be along the course
we resolve the hong kong question
i wish to express thanks to all the personages
both china and britain contribute to the settlement of the hong kong
the world support hong kong 's return
i wish to extend my cordial greetings and best wishes
如您所见,它们由制表符分隔。我想要做的是从这些数据中创建普通的定式子句,将它们呈现为:
'attend by'('some # guests','the grand hall of the hong kong convention').
'take part in'('the ceremony','both countries').
be('representing china', 'mr jiang').
所以在现在的数据中,中间有一个动词短语,它应该成为这个新构造的基础,然后被作用的实体应该是第一个参数,后面是主要演员。
我希望这些最终可以在 Prolog 中使用。
我想并不是所有的数据都是完整的,所以也许我可以把它扔掉。
我想有某种花哨的 perl 脚本或正则表达式、sed、类型操作可以最有效地实现这一点。我需要在一个大文件上执行这个,所以我希望优化效率,这就是我把它放在这里的原因。
使用 sed:
sed "s/\(.*\)\t\(.*\)\t\(.*\)/''('', '')/" filename
要使不带空格的标记不被引用,使用 awk 会更简单:
awk -F\t -vq="'" 'function quote(token) { if(index(token, " ")) { return q token q }; return token } { print quote() "(" quote() ", " quote() ")" }' filename
至于性能,我怀疑瓶颈是I/O,而不是这个程序。但是,如果它确实是一个问题,你不会想乱用脚本语言并拼凑 20 行 C++ 来完成它。
编辑:为了回应评论(我对 prolog 了解多少,嗯?:P),总是在引号内引用和引用撇号,awk 又更容易了:
awk -F\t -vq="'" 'function quote(token) { gsub(q, "\"q, token); return q token q } { print quote() "(" quote() ", " quote() ")" }' filename
不过用sed也是可以的:
sed "s/'/\\'/g;s/\(.*\)\t\(.*\)\t\(.*\)/''('', '')/" filename
这将在执行原始操作之前将 '
替换为 \'
。 Shell 涉及引号,这就是为什么它需要这么多反斜杠。
请注意,sed 解决方案要求每行有两个制表符。查看测试输入,我不完全确定情况是否如此,所以 awk 可能更适合你。
在 SWI-Prolog 中,考虑 tokenize_atom/2(您需要最新版本才能放入源任意长文本常量,并引用 ')
t :- Text = '
the grand hall of the hong kong convention attend by some # guests
principal representatives of both countries seat on the central dais
... rest of text...
the world support hong kong \'s return
i wish to extend my cordial greetings and best wishes',
tokenize_atom(Text,L), maplist(writeln,L).
产量
?- t.
the
grand
hall
of
the
hong
kong
...
因此您可以使用 DCG 来 'understand' 文本。这比通过外部工具要容易得多,我猜...
编辑 让我们对 Boris 的评论进行编码:
file_2_statements(File) :-
atom_codes('\t', Tab),
open(File, read, S),
repeat,
read_line_to_codes(S, L),
( L \= end_of_file
-> append([H,Tab,A1,Tab,A2], L),
maplist(atom_codes, [Hc,Ac1,Ac2], [H,A1,A2]),
P =.. [Hc,Ac1,Ac2], assert(P),
fail
; true
),
close(S).