使用 Nokogiri 解析 HTML 字符串
Parse HTML string with Nokogiri
我正在尝试编写一个 ruby 脚本来解析 HTML 字符串并从特定节点获取一些值。
目前我正在努力将字符串读入 Nokogiri 文档:
此代码:
#!/usr/bin/ruby
html_doc = Nokogiri::HTML("<html> <meta content="text/html; charset=UTF-8"/> <body style='margin:20px'> <p>The following user has registered a device, click on the link below to review the user and make any changes if necessary.</p> <ul style='list-style-type:none; margin:25px 15px;'> <li><b>User name:</b> Test User</li> <li><b>User email:</b> test@abc.com</li> <li><b>Identifier:</b> abc123def132afd1213afas</li> <li><b>Description:</b> Tom's iPad</li> <li><b>Model:</b> iPad 3</li> <li><b>Platform:</b> </li> <li><b>App:</b> Test app name</li> <li><b>UserID:</b> </li> </ul> <p>Review user: https://cirrus.app47.com/users?search=test@abc.com</p> <hr style='height=2px; color:#aaa'/> <p>We hope you enjoy the app store experience!</p> <p style='font-size:18px; color:#999'>Powered by App47</p> <img src='https://cirrus.app47.com/notifications/562506219ac25b1033000904/img' alt=''/></body></html>")
产生此错误:
$ ruby emailParser.rb
emailParser.rb:3: syntax error, unexpected tIDENTIFIER, expecting ')'
...ML("<html> <meta content="text/html; charset=UTF-8"/> <bod...
... ^
emailParser.rb:3: syntax error, unexpected tSTRING_BEG, expecting end-of-input
...tent="text/html; charset=UTF-8"/> <body style='margin:20px'...
... ^
请注意,我已经尝试过这里的解决方案,结果相同:
"syntax error, unexpected tIDENTIFIER, expecting $end"
您必须将 html 字符串引号从 " 更改为 ' 并将字符串引号 inside html 更改为 "。这样的事情应该有效:
#!/usr/bin/ruby
html_doc = Nokogiri::HTML('<html> <meta content="text/html; charset=UTF-8"/> <body style="margin:20px"> <p>The following user has registered a device, click on the link below to review the user and make any changes if necessary.</p> <ul style="list-style-type:none; margin:25px 15px;"> <li><b>User name:</b> Test User</li> <li><b>User email:</b> test@abc.com</li> <li><b>Identifier:</b> abc123def132afd1213afas</li> <li><b>Description:</b> Tom\'s iPad</li> <li><b>Model:</b> iPad 3</li> <li><b>Platform:</b> </li> <li><b>App:</b> Test app name</li> <li><b>UserID:</b> </li> </ul> <p>Review user: https://cirrus.app47.com/users?search=test@abc.com</p> <hr style="height=2px; color:#aaa"/> <p>We hope you enjoy the app store experience!</p> <p style="font-size:18px; color:#999">Powered by App47</p> <img src="https://cirrus.app47.com/notifications/562506219ac25b1033000904/img" alt=""/></body></html>')
问题是您的字符串中有双引号,这会混淆解析器,因为您还使用双引号将字符串括起来。举例说明:
puts "foo"bar"
# => SyntaxError: unexpected tIDENTIFIER, expecting end-of-input
# puts "foo"bar"
# ^
你可能打算打印 foo"bar
,但是当解析器到达第二个 "
(在 foo
之后)时,它认为字符串结束了,所以这些东西在它导致语法错误之后。 (Stack Overflow 的语法高亮甚至给了你一个提示——看看第一行 "foo"
与 bar"
的颜色有何不同?一个好的语法高亮文本编辑器会做同样的事情。)
一个解决方案是改用单引号:
puts 'bar"baz'
# => bar"baz
这解决了这种情况下的问题,但实际上对您没有帮助,因为您的字符串中也有单引号!
另一种解决方案是 转义 你的引号,方法是在它们前面加上 \
,像这样:
puts "foo\"bar"
# => foo"bar
...但是对于像您这样的长字符串来说,这有点乏味(有时甚至很棘手)。更好的解决方案是使用一种称为 "heredoc" 的特殊字符串(对于 "here document," 它的价值):
str = <<-END_OF_HTML
<html> <meta content="text/html; charset=UTF-8"/> <body style='margin:20px'> <p>The following user has registered a device, click on the link below to review the user and make any changes if necessary.</p> <ul style='list-style-type:none; margin:25px 15px;'> <li><b>User name:</b> Test User</li> <li><b>User email:</b> test@abc.com</li> <li><b>Identifier:</b> abc123def132afd1213afas</li> <li><b>Description:</b> Tom's iPad</li> <li><b>Model:</b> iPad 3</li> <li><b>Platform:</b> </li> <li><b>App:</b> Test app name</li> <li><b>UserID:</b> </li> </ul> <p>Review user: https://cirrus.app47.com/users?search=test@abc.com</p> <hr style='height=2px; color:#aaa'/> <p>We hope you enjoy the app store experience!</p> <p style='font-size:18px; color:#999'>Powered by App47</p> <img src='https://cirrus.app47.com/notifications/562506219ac25b1033000904/img' alt=''/></body></html>
END_OF_HTML
html_doc = Nokogiri::HTML(str)
分隔符“END_OF_HTML
”是任意的。您可以使用 EOF
或 XYZZY
或任何适合您的喜好,尽管使用有意义的东西是个好主意。 (您会注意到 Stack Overflow 的语法高亮显示对于 heredocs 有点问题;不过大多数代码编辑器都能很好地处理它们。)
你可以像这样让它更紧凑一点:
Nokogiri::HTML <<-END_OF_HTML
<html> <meta content="text/html; charset=UTF-8"/> <body style='margin:20px'> <p>The following user has registered a device, click on the link below to review the user and make any changes if necessary.</p> <ul style='list-style-type:none; margin:25px 15px;'> <li><b>User name:</b> Test User</li> <li><b>User email:</b> test@abc.com</li> <li><b>Identifier:</b> abc123def132afd1213afas</li> <li><b>Description:</b> Tom's iPad</li> <li><b>Model:</b> iPad 3</li> <li><b>Platform:</b> </li> <li><b>App:</b> Test app name</li> <li><b>UserID:</b> </li> </ul> <p>Review user: https://cirrus.app47.com/users?search=test@abc.com</p> <hr style='height=2px; color:#aaa'/> <p>We hope you enjoy the app store experience!</p> <p style='font-size:18px; color:#999'>Powered by App47</p> <img src='https://cirrus.app47.com/notifications/562506219ac25b1033000904/img' alt=''/></body></html>
END_OF_HTML
或带括号(看起来有点奇怪,但它有效,有时是必要的):
Nokogiri::HTML(<<-END_OF_HTML)
<html> <meta content="text/html; charset=UTF-8"/> <body style='margin:20px'> <p>The following user has registered a device, click on the link below to review the user and make any changes if necessary.</p> <ul style='list-style-type:none; margin:25px 15px;'> <li><b>User name:</b> Test User</li> <li><b>User email:</b> test@abc.com</li> <li><b>Identifier:</b> abc123def132afd1213afas</li> <li><b>Description:</b> Tom's iPad</li> <li><b>Model:</b> iPad 3</li> <li><b>Platform:</b> </li> <li><b>App:</b> Test app name</li> <li><b>UserID:</b> </li> </ul> <p>Review user: https://cirrus.app47.com/users?search=test@abc.com</p> <hr style='height=2px; color:#aaa'/> <p>We hope you enjoy the app store experience!</p> <p style='font-size:18px; color:#999'>Powered by App47</p> <img src='https://cirrus.app47.com/notifications/562506219ac25b1033000904/img' alt=''/></body></html>
END_OF_HTML
您可以在 Ruby 文档的 Literals 部分阅读更多关于 heredocs 和其他表示字符串的方法。
我正在尝试编写一个 ruby 脚本来解析 HTML 字符串并从特定节点获取一些值。
目前我正在努力将字符串读入 Nokogiri 文档:
此代码:
#!/usr/bin/ruby
html_doc = Nokogiri::HTML("<html> <meta content="text/html; charset=UTF-8"/> <body style='margin:20px'> <p>The following user has registered a device, click on the link below to review the user and make any changes if necessary.</p> <ul style='list-style-type:none; margin:25px 15px;'> <li><b>User name:</b> Test User</li> <li><b>User email:</b> test@abc.com</li> <li><b>Identifier:</b> abc123def132afd1213afas</li> <li><b>Description:</b> Tom's iPad</li> <li><b>Model:</b> iPad 3</li> <li><b>Platform:</b> </li> <li><b>App:</b> Test app name</li> <li><b>UserID:</b> </li> </ul> <p>Review user: https://cirrus.app47.com/users?search=test@abc.com</p> <hr style='height=2px; color:#aaa'/> <p>We hope you enjoy the app store experience!</p> <p style='font-size:18px; color:#999'>Powered by App47</p> <img src='https://cirrus.app47.com/notifications/562506219ac25b1033000904/img' alt=''/></body></html>")
产生此错误:
$ ruby emailParser.rb
emailParser.rb:3: syntax error, unexpected tIDENTIFIER, expecting ')'
...ML("<html> <meta content="text/html; charset=UTF-8"/> <bod...
... ^
emailParser.rb:3: syntax error, unexpected tSTRING_BEG, expecting end-of-input
...tent="text/html; charset=UTF-8"/> <body style='margin:20px'...
... ^
请注意,我已经尝试过这里的解决方案,结果相同:
"syntax error, unexpected tIDENTIFIER, expecting $end"
您必须将 html 字符串引号从 " 更改为 ' 并将字符串引号 inside html 更改为 "。这样的事情应该有效:
#!/usr/bin/ruby
html_doc = Nokogiri::HTML('<html> <meta content="text/html; charset=UTF-8"/> <body style="margin:20px"> <p>The following user has registered a device, click on the link below to review the user and make any changes if necessary.</p> <ul style="list-style-type:none; margin:25px 15px;"> <li><b>User name:</b> Test User</li> <li><b>User email:</b> test@abc.com</li> <li><b>Identifier:</b> abc123def132afd1213afas</li> <li><b>Description:</b> Tom\'s iPad</li> <li><b>Model:</b> iPad 3</li> <li><b>Platform:</b> </li> <li><b>App:</b> Test app name</li> <li><b>UserID:</b> </li> </ul> <p>Review user: https://cirrus.app47.com/users?search=test@abc.com</p> <hr style="height=2px; color:#aaa"/> <p>We hope you enjoy the app store experience!</p> <p style="font-size:18px; color:#999">Powered by App47</p> <img src="https://cirrus.app47.com/notifications/562506219ac25b1033000904/img" alt=""/></body></html>')
问题是您的字符串中有双引号,这会混淆解析器,因为您还使用双引号将字符串括起来。举例说明:
puts "foo"bar"
# => SyntaxError: unexpected tIDENTIFIER, expecting end-of-input
# puts "foo"bar"
# ^
你可能打算打印 foo"bar
,但是当解析器到达第二个 "
(在 foo
之后)时,它认为字符串结束了,所以这些东西在它导致语法错误之后。 (Stack Overflow 的语法高亮甚至给了你一个提示——看看第一行 "foo"
与 bar"
的颜色有何不同?一个好的语法高亮文本编辑器会做同样的事情。)
一个解决方案是改用单引号:
puts 'bar"baz'
# => bar"baz
这解决了这种情况下的问题,但实际上对您没有帮助,因为您的字符串中也有单引号!
另一种解决方案是 转义 你的引号,方法是在它们前面加上 \
,像这样:
puts "foo\"bar"
# => foo"bar
...但是对于像您这样的长字符串来说,这有点乏味(有时甚至很棘手)。更好的解决方案是使用一种称为 "heredoc" 的特殊字符串(对于 "here document," 它的价值):
str = <<-END_OF_HTML
<html> <meta content="text/html; charset=UTF-8"/> <body style='margin:20px'> <p>The following user has registered a device, click on the link below to review the user and make any changes if necessary.</p> <ul style='list-style-type:none; margin:25px 15px;'> <li><b>User name:</b> Test User</li> <li><b>User email:</b> test@abc.com</li> <li><b>Identifier:</b> abc123def132afd1213afas</li> <li><b>Description:</b> Tom's iPad</li> <li><b>Model:</b> iPad 3</li> <li><b>Platform:</b> </li> <li><b>App:</b> Test app name</li> <li><b>UserID:</b> </li> </ul> <p>Review user: https://cirrus.app47.com/users?search=test@abc.com</p> <hr style='height=2px; color:#aaa'/> <p>We hope you enjoy the app store experience!</p> <p style='font-size:18px; color:#999'>Powered by App47</p> <img src='https://cirrus.app47.com/notifications/562506219ac25b1033000904/img' alt=''/></body></html>
END_OF_HTML
html_doc = Nokogiri::HTML(str)
分隔符“END_OF_HTML
”是任意的。您可以使用 EOF
或 XYZZY
或任何适合您的喜好,尽管使用有意义的东西是个好主意。 (您会注意到 Stack Overflow 的语法高亮显示对于 heredocs 有点问题;不过大多数代码编辑器都能很好地处理它们。)
你可以像这样让它更紧凑一点:
Nokogiri::HTML <<-END_OF_HTML
<html> <meta content="text/html; charset=UTF-8"/> <body style='margin:20px'> <p>The following user has registered a device, click on the link below to review the user and make any changes if necessary.</p> <ul style='list-style-type:none; margin:25px 15px;'> <li><b>User name:</b> Test User</li> <li><b>User email:</b> test@abc.com</li> <li><b>Identifier:</b> abc123def132afd1213afas</li> <li><b>Description:</b> Tom's iPad</li> <li><b>Model:</b> iPad 3</li> <li><b>Platform:</b> </li> <li><b>App:</b> Test app name</li> <li><b>UserID:</b> </li> </ul> <p>Review user: https://cirrus.app47.com/users?search=test@abc.com</p> <hr style='height=2px; color:#aaa'/> <p>We hope you enjoy the app store experience!</p> <p style='font-size:18px; color:#999'>Powered by App47</p> <img src='https://cirrus.app47.com/notifications/562506219ac25b1033000904/img' alt=''/></body></html>
END_OF_HTML
或带括号(看起来有点奇怪,但它有效,有时是必要的):
Nokogiri::HTML(<<-END_OF_HTML)
<html> <meta content="text/html; charset=UTF-8"/> <body style='margin:20px'> <p>The following user has registered a device, click on the link below to review the user and make any changes if necessary.</p> <ul style='list-style-type:none; margin:25px 15px;'> <li><b>User name:</b> Test User</li> <li><b>User email:</b> test@abc.com</li> <li><b>Identifier:</b> abc123def132afd1213afas</li> <li><b>Description:</b> Tom's iPad</li> <li><b>Model:</b> iPad 3</li> <li><b>Platform:</b> </li> <li><b>App:</b> Test app name</li> <li><b>UserID:</b> </li> </ul> <p>Review user: https://cirrus.app47.com/users?search=test@abc.com</p> <hr style='height=2px; color:#aaa'/> <p>We hope you enjoy the app store experience!</p> <p style='font-size:18px; color:#999'>Powered by App47</p> <img src='https://cirrus.app47.com/notifications/562506219ac25b1033000904/img' alt=''/></body></html>
END_OF_HTML
您可以在 Ruby 文档的 Literals 部分阅读更多关于 heredocs 和其他表示字符串的方法。