取三个制表符分隔的标记并制作 Prolog "fact"

Question

本质上，从每行输入的 foo\tbar\tbaz 到 'bar'('foo', 'baz').

如果任何标记包含单引号，则需要用反斜杠转义：

don't --> 'don\'t'

详情：

我有一个包含 'semi-structured' 个句子成分的文件，格式如下：

the grand hall of the hong kong convention  attend by   some # guests
principal representatives of both countries seat on the central dais
representing china  be  mr jiang
britain be  hrh
the principal representatives   be more than    # distinguished guests
hong kong   end with    the playing of the british national anthem
this    follow at   the stroke of midnight
both countries  take part in    the ceremony
the ceremony    start at about  # pm
the ceremony    end about   # am
# royal hong kong police officers   lower   the british hong kong flag
another #   raise   the sar flag
the #   leave for   the royal yacht britannia
the handover of hong kong   hold by the chinese and british governments
the world   cast eye on hong kong
the # governments   hold on schedule
this    be festival for the chinese nation
july # , #  go in   the annals of history
the hong kong compatriots   become master of    this chinese land
hong kong   enter era of    development
history remember    mr deng xiaoping
it  be along    the course
we  resolve the hong kong question
i   wish to express thanks to   all the personages
both china and britain  contribute to   the settlement of the hong kong
the world   support hong kong 's return
i   wish to extend  my cordial greetings and best wishes

如您所见，它们由制表符分隔。我想要做的是从这些数据中创建普通的定式子句，将它们呈现为：

'attend by'('some # guests','the grand hall of the hong kong convention').
'take part in'('the ceremony','both countries').
be('representing china', 'mr jiang').

所以在现在的数据中，中间有一个动词短语，它应该成为这个新构造的基础，然后被作用的实体应该是第一个参数，后面是主要演员。

我希望这些最终可以在 Prolog 中使用。

我想并不是所有的数据都是完整的，所以也许我可以把它扔掉。

我想有某种花哨的 perl 脚本或正则表达式、sed、类型操作可以最有效地实现这一点。我需要在一个大文件上执行这个，所以我希望优化效率，这就是我把它放在这里的原因。

Answer 1

使用 sed：

sed "s/\(.*\)\t\(.*\)\t\(.*\)/''('', '')/" filename

要使不带空格的标记不被引用，使用 awk 会更简单：

awk -F\t -vq="'" 'function quote(token) { if(index(token, " ")) { return q token q }; return token } { print quote() "(" quote() ", " quote() ")" }' filename

至于性能，我怀疑瓶颈是I/O，而不是这个程序。但是，如果它确实是一个问题，你不会想乱用脚本语言并拼凑 20 行 C++ 来完成它。

编辑：为了回应评论（我对 prolog 了解多少，嗯？:P），总是在引号内引用和引用撇号，awk 又更容易了：

awk -F\t -vq="'" 'function quote(token) { gsub(q, "\"q, token); return q token q } { print quote() "(" quote() ", " quote() ")" }' filename

不过用sed也是可以的:

sed "s/'/\\'/g;s/\(.*\)\t\(.*\)\t\(.*\)/''('', '')/" filename

这将在执行原始操作之前将 ' 替换为 \'。 Shell 涉及引号，这就是为什么它需要这么多反斜杠。

请注意，sed 解决方案要求每行有两个制表符。查看测试输入，我不完全确定情况是否如此，所以 awk 可能更适合你。

Answer 2

在 SWI-Prolog 中，考虑 tokenize_atom/2（您需要最新版本才能放入源任意长文本常量，并引用 ')

t :- Text = '
the grand hall of the hong kong convention  attend by   some # guests
principal representatives of both countries seat on the central dais
... rest of text...
the world   support hong kong \'s return
i   wish to extend  my cordial greetings and best wishes',
tokenize_atom(Text,L), maplist(writeln,L).

产量

?- t.
the
grand
hall
of
the
hong
kong
...

因此您可以使用 DCG 来 'understand' 文本。这比通过外部工具要容易得多，我猜...

编辑让我们对 Boris 的评论进行编码：

file_2_statements(File) :-
  atom_codes('\t', Tab),
  open(File, read, S),
  repeat,
   read_line_to_codes(S, L),
   (  L \= end_of_file
   -> append([H,Tab,A1,Tab,A2], L),
      maplist(atom_codes, [Hc,Ac1,Ac2], [H,A1,A2]),
      P =.. [Hc,Ac1,Ac2], assert(P),
      fail
   ;  true
   ),
  close(S).

取三个制表符分隔的标记并制作 Prolog "fact"

Take three tab separated tokens and make a Prolog "fact"

regex

perl

nlp

sed

prolog