TXR:使用函数使用更复杂的语法解析包含 unicode 的摘要报告

TXR: Parsing summary reports containing unicode with a more complicated syntax using functions

我正在尝试解析一堆计算机报告的 "summary" 区域,其中报告名称及其相关变量随文件而变化。我按照以下格式给出了一个虚构的例子:

 Summary Report


       Bath Tub

  Temperature:    30 °C       

  Water ready                 
       volume:    200000 cm³  


    Bath Room

   Floor Area:    40 ft²      

  Door Height:    9 ± 0.1 ft  



Full Report Set

从上面很难看出白色 space 是什么样子,所以这是我的文本编辑器的屏幕截图,其中可见白色 space。

关注区域以 Summary Report 开始,以 Full Report Set 结束。属性可能跨越两行。 属性 名称对齐,因此冒号 : 在每个子报告中保持相同的字符位置。

从诊断输出来看,我利用这一事实的尝试似乎没有奏效。

txr: (src/generic-micrometrics-report.txr:36) chr mismatch (position 11 vs. k) txr: (src/generic-micrometrics-report.txr:36) variable k binding mismatch (13 vs. 12) txr: (src/generic-micrometrics-report.txr:36) chr mismatch (position 12 vs. k) txr: (src/generic-micrometrics-report.txr:36) string matched, position 13-18 (data/dummy-generic-report.txt:6) txr: (src/generic-micrometrics-report.txr:36) Temperature: 30 °C
txr: (src/generic-micrometrics-report.txr:36) ^ ^ txr: (src/generic-micrometrics-report.txr:23) spec ran out of data txr: (source location n/a) function (capture (nil (k . 13) (report . "Bath Tub"))) failed

我已经包含了下面的代码。你能解释为什么这段代码不起作用吗?我是在做我想用 colon_position 函数做的事情吗?如果是这样,为什么会失败?您将如何编写 capture 函数?这是您会采用的一般方法吗?有没有更好的办法?非常感谢您的帮助和建议。

@; This output format always starts with or ends with atleast 2 blank spaces.
@; Fully blank spaced lines follow each property value pair line.
@(define blank_spaces)
  @/[ ]+/@(eol)
@(end)
@; All colons align at the same column position within the body of a report.
@; If that doesn't happen, that means there is nothing to capture,
@; which shouldn't happen.
@; This function should bind the appropriate position without updating
@; the line position.
@; Reports end when there is an empty line, so don't look past that.
@(define colon_position (column))
@(trailer)
@(gather :vars (column))
@(skip)@(chr column):@(skip)
@(until)

@(end)
@(end)
@; Capture values for a property. Values are always given on a single line.
@; If there is error information, it will be indicated by a ± character.#\x00B1
@(define capture (value error units))
@(cases)@value@\ ±@\ @error@\ @units@/[ ]+/@(eol)@\
@(or)@value@\ @units@/[ ]+/@(eol)@(bind error "")@\
@(end)
@(end)
 Summary Report

@(collect :vars (report property value error units))

 @report

@(forget k)
@(colon_position k)
@(cases)
 @property@(chr k):    @(capture value error units)@(blank_spaces)
@(ord)
@; Properties can span two lines. I have not seen any that span more.
 @property_head@(chr k)     @(blank_spaces)
 @property_tail@(chr k):    @(capture value error units)@(blank_spaces)
 @(merge property property_head property_tail)
 @(cat property " ")
@(end)
@(blank_spaces)
@(end)


Full Report Set
@(output)
report,property,value,error,units
@(repeat)
@report,@property,@value,@error,@units
@(end)
@(end)

在这里和那里进行一些更改后,我现在得到以下输出:

report,property,value,error,units
Bath Tub,Temperature,30,,°C
Bath Tub,Water ready volume,200000,,cm³
Bath Room,Floor Area,40,,ft²
Bath Room,Door Height,9,0.1,ft

代码:

@; This output format always starts with or ends with atleast 2 blank spaces.
@; Fully blank spaced lines follow each property value pair line.
@(define blank_spaces)@\
@/[ ]*/@(eol)@\
@(end)
@; All colons align at the same column position within the body of a report.
@; If that doesn't happen, that means there is nothing to capture,
@; which shouldn't happen.
@; This function should bind the appropriate position without updating
@; the line position.
@; Reports end when there is an empty line, so don't look past that.
@(define colon_position (column))
@  (trailer)
@  (gather :vars (column))
@  (skip)@(chr column):@(skip)
@(until)

@(end)
@(end)
@; Capture values for a property. Values are always given on a single line.
@; If there is error information, it will be indicated by a ± character.#\x00B1
@(define capture (value error units))@\
  @(cases)@value@\ ±@\ @error@\ @units @(eol)@\
  @(or)@value@\ @units@/[ ]+/@(eol)@(bind error "")@\
  @(end)@\
@(end)
 Summary Report

@(collect :vars (report property value error units))

 @report

@  (colon_position k)
@  (collect)
@    (cases)
 @property@(chr k):    @(capture value error units)@(blank_spaces)
@    (or)
@; Properties can span two lines. I have not seen any that span more.
 @property_head@(chr k)     @(blank_spaces)
 @property_tail@(chr k):    @(capture value error units)@(blank_spaces)
@      (merge property property_head property_tail)
@      (cat property " ")
@    (end)
@  (until)


@  (end)
@(until)
Full Report Set
@(end)
@(output)
report,property,value,error,units
@  (repeat)
@    (repeat)
@report,@property,@value,@error,@units
@    (end)
@  (end)
@(end)

冒号的技巧确实有效(trailerchr 的很好的应用)。代码被绊倒的地方是各种小细节。将 @(or) 拼错为 @(orf),应该是水平的模式函数没有使用正确的 @\ 行继续,并且 @(blank_spaces) 中的不正确导致它想要无条件地消耗一些空间, @(merge) 之前的虚假空格等等。

此外,主要问题是数据是双重嵌套的,所以我们需要一个收集中的收集。我们还需要适当的 @(until) 终止模式。对于内部收集,我选择了两个空行;这似乎是终止这些部分的原因(它适用于数据样本)。外部收集在 Full Report Set 处终止,但这并不是绝对必要的。

为了配合嵌套集合,我们在输出中使用嵌套重复。

我应用了一些缩进。水平函数可以使用空格缩进,因为忽略续行后的前导空格。

那个@(forget k)没了;那里的范围内没有 k。周围 collect 的每次迭代都会在没有 k.

的环境中重新绑定 k

附录:这里是与代码的差异,以使其对意外数据更加健壮。实际上,内部的 @(collect) 将默默地跳过不匹配的元素,这意味着如果文件包含不符合预期情况的元素,它们将被忽略。这种行为已经被利用:这就是数据项之间的空行被忽略的原因。我们可以使用 :gap 0(收集的区域必须是连续的)来收紧它,并视情况处理空白行。然后,回退案例可以将输入行诊断为无法识别:

diff --git a/extract.txr b/extract.txr
index 8c93d89..3d1fac6 100644
--- a/extract.txr
+++ b/extract.txr
@@ -24,6 +24,7 @@
   @(or)@value@\ @units@/[ ]+/@(eol)@(bind error "")@\
   @(end)@\
 @(end)
+@(name file)
  Summary Report

 @(collect :vars (report property value error units))
@@ -31,7 +32,7 @@
  @report

 @  (colon_position k)
-@  (collect)
+@  (collect :gap 0)
 @    (cases)
  @property@(chr k):    @(capture value error units)@(blank_spaces)
 @    (or)
@@ -40,6 +41,12 @@
  @property_tail@(chr k):    @(capture value error units)@(blank_spaces)
 @      (merge property property_head property_tail)
 @      (cat property " ")
+@    (or)
+
+@    (or)
+@      (line ln)
+@      badline
+@      (throw error `@file:@ln unrecognized syntax: @badline`)
 @    (end)
 @  (until)