将通话记录解析为哈希数组 - Ruby
Parse a call transcript into array of hash - Ruby
我正在解析通话记录。成绩单的内容以字符串形式返回,格式如下:
"Operator: Hi, please welcome Bob Smith to the call. Bob Smith: Hello there, thank you for inviting me...Now I will turn the call over to Stacy. Stacy White: Thanks Bob. As he was saying...."
每个新发言者开始发言时没有换行。
我想把上面的字符串变成一个散列数组。类似于以下内容:
[ { speaker: "Operator",
content: "Hi, please welcome Bob Smith to the call" },
{ speaker: "Bob Smith",
content: "Hello there, thank you for inviting me...Now I will turn the call over to Stacy." },
{ speaker: "Stacy White",
content: "Thanks Bob. As he was saying...." }
]
我想我需要使用某种正则表达式来解析它,但我不知道在早上阅读它之后如何。如有任何帮助,我们将不胜感激。
谢谢
更新:
对于可能觉得这有用的其他人,这是我使用下面建议的解决方案最终得出的结果:
def display_transcript
transcript_pretty = []
transcript = self.content
transcript_split = transcript.split(/\W*([A-Z]\w*\W*\w+):\W*/)[1..-1]
transcript_split_2d = transcript_split.each_slice(2).to_a
transcript_split_2d.each do |row|
blurb = { speaker: row[0], content: row[1]}
transcript_pretty << blurb
end
return transcript_pretty
end
我可以给你一个表达式,你可以用它来分解字符串。
从那里你可以自己承担,我敢肯定,你不会希望我带走实现目标的乐趣吧? :>)
string = "Operator: Hi, please welcome Bob Smith to the call. Bob Smith: Hello there, thank you for inviting me...Now I will turn the call over to Stacy. Stacy White: Thanks Bob. As he was saying...."
split_up = string.split(/\W*(\w*\W*\w+):\W*/)[1..-1]
Hash[*split_up]
# {"Operator"=>"Hi, please welcome Bob Smith to the call", "Bob Smith"=>"Hello there, thank you for inviting me...Now I will turn the call over to Stacy", "Stacy White"=>"Thanks Bob. As he was saying...."}
一些解释:正则表达式查找一个或两个单词 (\w*\W*\w+)
,最终以一个点和一个 space \W*
开头,然后是一个双点,最后是 space 之后 :\W*
此表达式用于拆分数组中的字符串。
结果始终以空字符串开头,因此您可以通过 [1..-1]
摆脱它
接下来将该数组转换为哈希,第一个元素是键,第二个元素是值,依此类推,直到数组结束。
R = /(\S[^:]*):\s*([^:]*[.?!])/
def str_to_hash(str)
str.gsub(r).with_object({}) { |_,h| h[]= }
end
str = "Operator: Hi, please welcome Bob Smith to the call. Bob Smith: Hello there, thank you for inviting me...Now I will turn the call over to Stacy. Stacy White: Thanks Bob. As he was saying...."
str_to_hash(str)
#=> {"Operator"=>"Hi, please welcome Bob Smith to the call.",
#=> "Bob Smith"=>"Hello there, thank you for inviting me...Now I will turn the call over to Stacy.",
"Stacy White"=>"Thanks Bob. As he was saying...."}
str = "Operator: Bob Smith, what's the value? Bob Smith: 0,000 or so. Stacy? Stacy White: Thanks Bob. I agree...."
str_to_hash(str)
#=> {"Operator"=>"Bob Smith, what's the value?",
# "Bob Smith"=>"0,000 or so. Stacy?",
# "Stacy White"=>"Thanks Bob. I agree...."}
您可以看到正在运行的正则表达式 here。
这里使用了String#gsub that takes one argument (here a regular expression) and no block, returning an enumerator that is chained to Enumerator#with_object.1
的形式
我们可以将正则表达式写成free-spacing模式,使其自文档化。
R = /
( # begin capture group 1
\S # match a character other than a whitespace
[^:]* # match 0+ characters other than a colon
) # end capture group 1
: # match a colon
\s* # match 0+ whitespaces
( # begin capture group 2
[^:]* # match 0+ characters other than a colon
[.?!] # match a period, question mark or exclamation mark
) # end capture group 2
/x # free-spacing regex definition mode
因为[^:]*
是贪心 [.?!]
会匹配最后一个句点,冒号前的问号或感叹号或字符串的结尾。
1 注意这种形式的String#gsub
与字符替换无关。它只是 returns 匹配,由第一个块变量 _
保存。该块变量使用下划线表示它未在块中使用。
我正在解析通话记录。成绩单的内容以字符串形式返回,格式如下:
"Operator: Hi, please welcome Bob Smith to the call. Bob Smith: Hello there, thank you for inviting me...Now I will turn the call over to Stacy. Stacy White: Thanks Bob. As he was saying...."
每个新发言者开始发言时没有换行。
我想把上面的字符串变成一个散列数组。类似于以下内容:
[ { speaker: "Operator",
content: "Hi, please welcome Bob Smith to the call" },
{ speaker: "Bob Smith",
content: "Hello there, thank you for inviting me...Now I will turn the call over to Stacy." },
{ speaker: "Stacy White",
content: "Thanks Bob. As he was saying...." }
]
我想我需要使用某种正则表达式来解析它,但我不知道在早上阅读它之后如何。如有任何帮助,我们将不胜感激。
谢谢
更新:
对于可能觉得这有用的其他人,这是我使用下面建议的解决方案最终得出的结果:
def display_transcript
transcript_pretty = []
transcript = self.content
transcript_split = transcript.split(/\W*([A-Z]\w*\W*\w+):\W*/)[1..-1]
transcript_split_2d = transcript_split.each_slice(2).to_a
transcript_split_2d.each do |row|
blurb = { speaker: row[0], content: row[1]}
transcript_pretty << blurb
end
return transcript_pretty
end
我可以给你一个表达式,你可以用它来分解字符串。 从那里你可以自己承担,我敢肯定,你不会希望我带走实现目标的乐趣吧? :>)
string = "Operator: Hi, please welcome Bob Smith to the call. Bob Smith: Hello there, thank you for inviting me...Now I will turn the call over to Stacy. Stacy White: Thanks Bob. As he was saying...."
split_up = string.split(/\W*(\w*\W*\w+):\W*/)[1..-1]
Hash[*split_up]
# {"Operator"=>"Hi, please welcome Bob Smith to the call", "Bob Smith"=>"Hello there, thank you for inviting me...Now I will turn the call over to Stacy", "Stacy White"=>"Thanks Bob. As he was saying...."}
一些解释:正则表达式查找一个或两个单词 (\w*\W*\w+)
,最终以一个点和一个 space \W*
开头,然后是一个双点,最后是 space 之后 :\W*
此表达式用于拆分数组中的字符串。
结果始终以空字符串开头,因此您可以通过 [1..-1]
摆脱它
接下来将该数组转换为哈希,第一个元素是键,第二个元素是值,依此类推,直到数组结束。
R = /(\S[^:]*):\s*([^:]*[.?!])/
def str_to_hash(str)
str.gsub(r).with_object({}) { |_,h| h[]= }
end
str = "Operator: Hi, please welcome Bob Smith to the call. Bob Smith: Hello there, thank you for inviting me...Now I will turn the call over to Stacy. Stacy White: Thanks Bob. As he was saying...."
str_to_hash(str)
#=> {"Operator"=>"Hi, please welcome Bob Smith to the call.",
#=> "Bob Smith"=>"Hello there, thank you for inviting me...Now I will turn the call over to Stacy.",
"Stacy White"=>"Thanks Bob. As he was saying...."}
str = "Operator: Bob Smith, what's the value? Bob Smith: 0,000 or so. Stacy? Stacy White: Thanks Bob. I agree...."
str_to_hash(str)
#=> {"Operator"=>"Bob Smith, what's the value?",
# "Bob Smith"=>"0,000 or so. Stacy?",
# "Stacy White"=>"Thanks Bob. I agree...."}
您可以看到正在运行的正则表达式 here。
这里使用了String#gsub that takes one argument (here a regular expression) and no block, returning an enumerator that is chained to Enumerator#with_object.1
的形式我们可以将正则表达式写成free-spacing模式,使其自文档化。
R = /
( # begin capture group 1
\S # match a character other than a whitespace
[^:]* # match 0+ characters other than a colon
) # end capture group 1
: # match a colon
\s* # match 0+ whitespaces
( # begin capture group 2
[^:]* # match 0+ characters other than a colon
[.?!] # match a period, question mark or exclamation mark
) # end capture group 2
/x # free-spacing regex definition mode
因为[^:]*
是贪心 [.?!]
会匹配最后一个句点,冒号前的问号或感叹号或字符串的结尾。
1 注意这种形式的String#gsub
与字符替换无关。它只是 returns 匹配,由第一个块变量 _
保存。该块变量使用下划线表示它未在块中使用。