无法使用 Ruby Regex Rubular 正确拆分数据

Question

我正在尝试组织和分解通过 Net::POP3 提取的电子邮件中的内容。在代码中，当我使用

p mail.pop

我明白了

****************************\r\n>>=20\r\n>>11) <> Summary: Working with Vars on Social Influence =\r\nplatform=20\r\n>>=20\r\n>> Name: Megumi Lindon \r\n>>=20\r\n>> Category: Social Psychology=20\r\n>>=20\r\n>> Email: information@example.com =\r\n<mailto:information@example.com>=20\r\n>>=20\r\n>> Journal News: Saving Grace \r\n>>=20\r\n>> Deadline: 10:00 PM EST - 15 February=20\r\n>>=20\r\n>> Query:=20\r\n>>=20\r\n>> Lorem ipsum dolor sit amet \r\n>> consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.\r\n>>=20\r\n>> Duis aute irure dolor in reprehenderit in voluptate \r\n>> velit esse cillum dolore eu fugiat nulla pariatur. =20\r\n>>=20\r\n>> Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.=20\r\n>> Requirements:=20\r\n>>=20\r\n>> Psychologists; anyone with good knowdledge\r\n>> with sociology and psychology.=20\r\n>>=20\r\n>> Please do send me your article and profile\r\n>> you want to be known as well. Thank you!=20\r\n>> Back to Top <x-msg://30/#top> Back to Category Index =\r\n<x-msg://30/#SocialPsychology>\r\n>>-----------------------------------\r\n>>=20\r\n>>

我正在尝试将其分解并整理成

11) Summary: Working with Vars on Social Influence 

Name: Megumi Lindon 

Category: Social Psychology 

Email: information@example.com 

Journal News: Saving Grace 

Deadline: 10:00 PM EST - 15 February

Questions:Lorem ipsum dolor sit amet consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

Requirements: Psychologists; anyone with good knowdledge with sociology and psychology.

到目前为止，我一直在使用 rubular，但结果各不相同，因为我仍在学习如何正确使用 regex、gsub 和 split。到目前为止，我的代码如下。

  p mail.pop.scan(/Summary: (.+) Name:/)
  p mail.pop.scan(/Name: (.+) Category:/)
  p mail.pop.scan(/Category: (.+) Email:/) 
  p mail.pop.scan(/Email: (.+) Journal News:/)     
  p mail.pop.scan(/Journal News: (.+) Deadline:/)       
  p mail.pop.scan(/Deadline: (.+) Questions:/)    
  p mail.pop.scan(/Questions:(.+) Requirements:/) 
  p mail.pop.scan(/Requirements:(.+) Back to Top/)

但是我得到的是空数组。

[]
[]
[]
[]
[]
[]
[]
[]

想知道如何才能做得更好。提前致谢。

Answer 1

天哪！真是一团糟！

当然，有很多方法可以解决这个问题，但我希望它们都涉及多个步骤和大量试验和错误。我只能说我是怎么做到的。

很多小步骤是一件好事，原因有两个。首先，它将问题分解为可管理的任务，其解决方案可以单独测试。其次，解析规则将来可能会发生变化。如果您有多个步骤，您可能只需更改 and/or 添加一两个操作。如果你的步骤少，正则表达式复杂，你不妨重新开始，特别是如果代码是别人写的。

假设 text 是一个包含您的字符串的变量。

首先，我不喜欢所有这些换行符，因为它们会使正则表达式复杂化，所以我要做的第一件事就是摆脱它们：

s1 = text.gsub(/\n/, '')

接下来，有很多 "20\r" 可能会很麻烦，因为我们可能希望保留其他包含数字的文本，因此我们可以删除它们（以及 "7941\r"）：

s2 = s1.gsub(/\d+\r/, '')

现在让我们看看您想要的字段以及紧跟其前和紧随其后的文本：

puts s2.scan(/.{4}(?:\w+\s+)*\w+:.{15}/)
  # <> Summary: Working with V
  #=>> Name: Megumi Lindon 
  #=>> Category: Social Psychol
  #=>> Email: information@ex
  #<mailto:information@exa
  #=>> Journal News: Saving Grace 
  #=>> Deadline: 10:00 PM EST -
  #=>> Query:=>>=>> Lorem ip
  #=>> Requirements:=>>=>> Psycholo
  # <x-msg://30/#top> Back
  #<x-msg://30/#SocialPsy

我们看到感兴趣的字段以 "> " 开头，字段名称后跟 ": " 或 ":="。让我们通过将字段名称之后的 ":=" 更改为 ": " 以及字段名称之前的 "> " 更改为 " :" 来简化：

s3 = s2.gsub(/(?<=\w):=/, ": ")
s4 = s3.gsub(/>\s+(?=(?:\w+\s+)*\w+: )/, " :")

在 s3 的正则表达式中，(?<=\w) 是一个 "positive lookbehind"：匹配必须紧接在单词字符之前（不包括在匹配中）；在 s4 的正则表达式中，(?=(?:\w+\s+)*\w+: ) 是一个 "positive lookahead"：匹配必须紧跟一个或多个单词，后跟一个冒号，然后是 space。注意s3和s4必须按照给定的顺序计算。

我们现在可以删除除标点符号和 spaces:

以外的所有非单词字符

s5 = s4.gsub(/[^a-zA-Z0-9 :;.?!-()\[\]{}]/, "")

然后（最后）split 在字段上：

a1 = s5.split(/((?<= :)(?:\w+\s+)*\w+:\s+)/)
  # => ["11)  :", "Summary: ", "Working with Vars on Social Influence platform :",
  #     "Name: ", "Megumi Lindon  :",
  #     "Category: ", "Social Psychology :",
  #     "Email: ", "informationexample.com mailto:informationexample.com :",
  #     "Journal News: ", "Saving Grace  :",
  #     "Deadline: ", "10:00 PM EST  15 February :",
  #     "Query:  ", "Lorem ipsum ...laborum. :",
  #     "Requirements:  ", "Psychologists; anyone...psychology...Top xmsg:30#top...Psychology"]

请注意，我已将 (?<= :)(?:\w+\s+)*\w+:\s+ 包含在捕获组中，因此 String#split 将在结果数组中包含它拆分的位。

剩下的就是一些清理工作：

a2 = a1.map { |s| s.chomp(':') }
a2[0] = a2.shift + a2.first
  #=> "11)  Summary: "
a3 = a2.each_slice(2).to_a
  #=> [["11)  Summary: ", "Working with Vars on Social Influence platform "],
  #    ["Name: ", "Megumi Lindon  "],
  #    ["Category: ", "Social Psychology "],
  #    ["Email: ", "informationexample.com mailto:informationexample.com "],
  #    ["Journal News: ", "Saving Grace  "],
  #    ["Deadline: ", "10:00 PM EST  15 February "],
  #    ["Query:  ", "Lorem...est laborum. "],
  #    ["Requirements:  ", "Psychologists;...psychology. Please...xmsg:30#SocialPsychology"]] 

idx = a3.index { |n,_| n =~ /Email: / }
  #=> 3 
a3[idx][1] = a3[idx][1][/.*?\s/] if idx
  #=> "informationexample.com "

连接字符串并删除多余的 spaces:

a4 = a3.map { |b| b.join(' ').split.join(' ') }
  #=> ["11) Summary: Working with Vars on Social Influence platform",
  #    "Name: Megumi Lindon",
  #    "Category: Social Psychology",
  #    "Email: informationexample.com",
  #    "Journal News: Saving Grace",
  #    "Deadline: 10:00 PM EST 15 February",
  #    "Query: Lorem...laborum.",
  #    "Requirements: Psychologists...psychology. Please...well. Thank...Psychology"]

"Requirements"还是有问题，但是没有额外的规则，也无能为力了。我们不能将所有类别值限制为一个句子，因为 "Query" 可以有多个。如果您希望将 "Requirements" 限制为一个句子：

idx = a4.index { |n,_| n =~ /Requirements: / }
  #=> 7
a4[idx] = a4[idx][/.*?[.!?]/] if idx
  # => "Requirements: Psychologists; anyone with good knowsledge with sociology and psychology."

如果您希望合并这些操作：

def parse_it(text)
  a1 = text.gsub(/\n/, '')
           .gsub(/\d+\r/, '') 
           .gsub(/(?<=\w):=/, ": ")
           .gsub(/>\s+(?=(?:\w+\s+)*\w+: )/, " :")
           .gsub(/[^a-zA-Z0-9 :;.?!-()\[\]{}]/, "")
           .split(/((?<= :)(?:\w+\s+)*\w+:\s+)/)
           .map { |s| s.chomp(':') }

  a1[0] = a1.shift + a1.first

  a2 = a1.each_slice(2).to_a
  idx = a2.index { |n,_| n =~ /Email: / }
  a2[idx][1] = a2[idx][1][/.*?\s/] if idx

  a3 = a2.map { |b| b.join(' ').split.join(' ') }    
  idx = a3.index { |n,_| n =~ /Requirements: / }
  a3[idx] = a3[idx][/.*?[.!?]/] if idx

  a3
end

无法使用 Ruby Regex Rubular 正确拆分数据

Unable to split data properly with Ruby Regex Rubular

ruby

regex

rubular