将通话记录解析为哈希数组 - Ruby

Parse a call transcript into array of hash - Ruby

我正在解析通话记录。成绩单的内容以字符串形式返回,格式如下:

"Operator: Hi, please welcome Bob Smith to the call. Bob Smith: Hello there, thank you for inviting me...Now I will turn the call over to Stacy. Stacy White: Thanks Bob. As he was saying...."

每个新发言者开始发言时没有换行。

我想把上面的字符串变成一个散列数组。类似于以下内容:

[ { speaker: "Operator",
    content: "Hi, please welcome Bob Smith to the call" },
  { speaker: "Bob Smith",
    content: "Hello there, thank you for inviting me...Now I will turn the call over to Stacy." }, 
  { speaker: "Stacy White",
    content: "Thanks Bob. As he was saying...." }
]

我想我需要使用某种正则表达式来解析它,但我不知道在早上阅读它之后如何。如有任何帮助,我们将不胜感激。

谢谢

更新:

对于可能觉得这有用的其他人,这是我使用下面建议的解决方案最终得出的结果:

def display_transcript
  transcript_pretty = []
  transcript = self.content
  transcript_split = transcript.split(/\W*([A-Z]\w*\W*\w+):\W*/)[1..-1]
  transcript_split_2d = transcript_split.each_slice(2).to_a
  transcript_split_2d.each do |row|
    blurb = { speaker: row[0], content: row[1]}
    transcript_pretty << blurb
  end

  return transcript_pretty
end

我可以给你一个表达式,你可以用它来分解字符串。 从那里你可以自己承担,我敢肯定,你不会希望我带走实现目标的乐趣吧? :>)

string = "Operator: Hi, please welcome Bob Smith to the call. Bob Smith: Hello there, thank you for inviting me...Now I will turn the call over to Stacy. Stacy White: Thanks Bob. As he was saying...."
split_up = string.split(/\W*(\w*\W*\w+):\W*/)[1..-1]
Hash[*split_up]
# {"Operator"=>"Hi, please welcome Bob Smith to the call", "Bob Smith"=>"Hello there, thank you for inviting me...Now I will turn the call over to Stacy", "Stacy White"=>"Thanks Bob. As he was saying...."}

一些解释:正则表达式查找一个或两个单词 (\w*\W*\w+),最终以一个点和一个 space \W* 开头,然后是一个双点,最后是 space 之后 :\W* 此表达式用于拆分数组中的字符串。 结果始终以空字符串开头,因此您可以通过 [1..-1] 摆脱它 接下来将该数组转换为哈希,第一个元素是键,第二个元素是值,依此类推,直到数组结束。

R = /(\S[^:]*):\s*([^:]*[.?!])/
def str_to_hash(str)
  str.gsub(r).with_object({}) { |_,h| h[]= }
end
str = "Operator: Hi, please welcome Bob Smith to the call. Bob Smith: Hello there, thank you for inviting me...Now I will turn the call over to Stacy. Stacy White: Thanks Bob. As he was saying...."
str_to_hash(str)
  #=> {"Operator"=>"Hi, please welcome Bob Smith to the call.",
  #=>  "Bob Smith"=>"Hello there, thank you for inviting me...Now I will turn the call over to Stacy.",
       "Stacy White"=>"Thanks Bob. As he was saying...."}
str = "Operator: Bob Smith, what's the value?   Bob Smith: 0,000 or so. Stacy? Stacy White: Thanks Bob. I agree...."
str_to_hash(str)
  #=> {"Operator"=>"Bob Smith, what's the value?",
  #    "Bob Smith"=>"0,000 or so. Stacy?",
  #    "Stacy White"=>"Thanks Bob. I agree...."}

您可以看到正在运行的正则表达式 here

这里使用了String#gsub that takes one argument (here a regular expression) and no block, returning an enumerator that is chained to Enumerator#with_object.1

的形式

我们可以将正则表达式写成free-spacing模式,使其自文档化。

R = /
    (           # begin capture group 1
      \S        # match a character other than a whitespace
      [^:]*  # match 0+ characters other than a colon
    )           # end capture group 1
    :           # match a colon
    \s*         # match 0+ whitespaces
    (           # begin capture group 2
      [^:]*     # match 0+ characters other than a colon
      [.?!]     # match a period, question mark or exclamation mark
    )           # end capture group 2
    /x          # free-spacing regex definition mode 

因为[^:]*贪心 [.?!]会匹配最后一个句点,冒号前的问号或感叹号或字符串的结尾。

1 注意这种形式的String#gsub与字符替换无关。它只是 returns 匹配,由第一个块变量 _ 保存。该块变量使用下划线表示它未在块中使用。