Ruby ARGF 和 RegEx:如何在段落回车 return“\r\n”但不是行尾“\r\n”处拆分

Ruby ARGF & RegEx: How to split on paragraph carriage return "\r\n" but not end of line "\r\n"

我正在尝试使用 ruby 中的正则表达式对一些文本进行预处理,以输入到映射器作业中,并希望在 return 表示段落的马车上拆分。

作为 hadoop 流作业的一部分,文本将使用 ARGF.each 进入映射器

"\"Walter Elliot, born March 1, 1760, married, July 15, 1784, Elizabeth,\r\n"
"daughter of James Stevenson, Esq. of South Park, in the county of\r\n"
"Gloucester, by which lady (who died 1800) he has issue Elizabeth, born\r\n"
"June 1, 1785; Anne, born August 9, 1787; a still-born son, November 5,\r\n"
"1789\"\r\n"
"\r\n"    # <----- this is where I would like to split
"Precisely such had the paragraph originally stood from the printer's\r\n"

完成此操作后,我将选择每行的换行符/carriage return。

这看起来像这样:

ARGF.each do |text|

  paragraph = text.split(INSERT_REGEX_HERE)

  #some more blah will happen beyond here
end

更新:

所需的输出是一个数组,如下所示:

[
  [0]  "\"Walter Elliot, born March 1, 1760, married, July 15, 1784, Elizabeth,\r\n"
    "daughter of James Stevenson, Esq. of South Park, in the county of\r\n"
    "Gloucester, by which lady (who died 1800) he has issue Elizabeth, born\r\n"
    "June 1, 1785; Anne, born August 9, 1787; a still-born son, November 5,\r\n"
    "1789\"\r\n"
  [1] "Precisely such had the paragraph originally stood from the printer's\r\n"
]

最终我想要的是下面的数组,数组中没有回车 returns:

[
  [0]  "\"Walter Elliot, born March 1, 1760, married, July 15, 1784, Elizabeth,"
    "daughter of James Stevenson, Esq. of South Park, in the county of"
    "Gloucester, by which lady (who died 1800) he has issue Elizabeth, born"
    "June 1, 1785; Anne, born August 9, 1787; a still-born son, November 5,"
    "1789\""
  [1] "Precisely such had the paragraph originally stood from the printer's"
]

提前感谢您的任何见解。

要拆分文本,请使用:

result = text.gsub(/(?<!\")\r\n|(?<=\\")\r\n/, '').split(/[\r\n]+\"\r\n\".*?[\r\n]+/)

请注意当您执行 ARGF.each do |text| 时,text 将是每一行,不是整个文本块

你可以提供ARGF.each一个特殊的行分隔符,它将return你两个"lines",这是你的两个段落。

试试这个:

paragraphs = ARGF.each("\r\n\r\n").map{|p| p.gsub("\r\n","")}

首先,将输入分成两段,然后使用gsub删除不需要的换行符。