使用 Groovy 在 HTML 中获取 href 值

Get href value inside HTML using Groovy

我有 JSON 个响应,其中有 HTML 个页面。我想在 HTML 页面中获取 href 值。

JSON 回应

    {
       "@odata.context": "https://graph.microsoft.com/v1.0/$metadata#users('fbd22ce4-XXXX-4d87-XXXX-6c74983b96fa')/messages(body)",
       "value": [   {
          "@odata.etag": "W/\"CQAAABYAAADuJXXX2LXBOZirXXXAAId0Uh\"",
          "id": "AAMkADk0ZGFihiMTIyZmJlYQBGAAAAAACOeACKvLOwQqTkIvTYg8kAAAAAAEMAA8kAAAIebouAAA=",
          "body":       {
             "contentType": "html",
             "content": "<html><head>\r\n<meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\"><\/head><body><link href=\"https://fonts.googleapis.com/css2?family=Source+Sans+Pro:wght@400;600&amp;display=swap\"><table border=\"0\" cellspacing=\"0\" cellpadding=\"0\" style=\"width:100%; border-collapse:collapse; padding:0; margin:0\"><tbody><tr><td><div align=\"center\"><table border=\"0\" cellspacing=\"0\" cellpadding=\"0\" style=\"background-color:white; width:100%; border-spacing:0; border-collapse:collapse; max-width:600pt\"><tbody><tr><td style=\"padding:30pt 33pt\"><span style=\"display:none\">Hello<\/span> <div style=\"border-bottom:1pt solid rgb(231,231,231); vertical-align:middle; padding-bottom:21pt\"><img src=\"https://abc.xyz.com/_images/roomfinder_icon64.png\" alt=\"logo\" style=\"width:28pt; height:30pt; display:inline-block; vertical-align:middle\"> <span style=\"vertical-align:middle; display:inline-block; font-family:'Source Sans Pro',sans-serif; font-weight:600; color:rgb(87,107,118); font-size:22pt; line-height:30pt; margin-left:6pt\">Demo<\/span> <\/div><div style=\"font-family:'Source Sans Pro',sans-serif; font-weight:600; font-size:32pt; line-height:36pt; color:rgb(39,39,39); margin-top:26pt; margin-bottom:8pt\">Hello<\/div><div style=\"font-family:'Source Sans Pro',sans-serif; font-weight:400; font-size:16pt; line-height:22pt; color:rgb(87,107,118); margin-bottom:16pt\">Log in to Condeco by pressing the button below on your mobile device.<\/div><div style=\"margin-bottom:15pt\"><div style=\"font-family:'Source Sans Pro',sans-serif; font-weight:400; font-size:18pt; line-height:36pt; letter-spacing:-0.5pt\"><a href=\"https://abc.xyz123.com?key=GBsG3gBoI4YV+fSfejXCbw6vgG6m4OCU7Czfn3PAKXtxVI9Ex\" style=\"font-family:'Source Sans Pro',sans-serif; font-weight:400; font-size:18pt; line-height:36pt; letter-spacing:-0.5pt; background-color:rgb(0,183,241); border-radius:6pt; color:rgb(255,255,255); display:inline-block; text-align:center; text-decoration:none; width:97pt\">Log me in<\/a> <\/div><\/div><div style=\"font-family:'Source Sans Pro',sans-serif; font-weight:400; font-size:14pt; line-height:22pt; color:rgb(87,107,118); margin-bottom:45pt\">This link will expire in 15 minutes.<\/div><div style=\"font-family:'Source Sans Pro',sans-serif; font-weight:600; font-size:20pt; line-height:22pt; color:rgb(39,39,39); margin-bottom:8pt\">On your desktop?<\/div><div style=\"font-family:'Source Sans Pro',sans-serif; font-weight:400; font-size:16pt; line-height:22pt; color:rgb(87,107,118); margin-top:6pt; margin-bottom:8pt\">You can also log in by scanning the QR code below in the app.<\/div><div style=\"margin-bottom:8pt\"><img alt=\"QR Code\" height=\"136\" width=\"136\" src=\"data:image/png;base64,iVBORw0KGzqP1BODMzMzMz86j9QTgzMzMzM/Oo/UE4MzMzMzPzqP1BODMzMzMz86j9QTgzMzMzM/Oo/UE4MzMzMzPzqP1BODMzMzMz86uDcGZmZmZm5lH7g3BmZmZmZuZR+4NwZmZmZmbmUfuDcGZmZmZm5lH7g3BmZmZmZuZR+4NwZmZmZmbmUfuDcGZmZmZm5kn/93//D/OYHJISst1mAAAAAElFTkSuQmCC\" style=\"width:136pt; height:136pt\"><\/div><div style=\"font-family:'Source Sans Pro',sans-serif; font-weight:400; font-size:14pt; line-height:22pt; color:rgb(87,107,118); margin-bottom:46pt\">This QR code will expire in 15 minutes.<\/div><div style=\"font-family:'Source Sans Pro',sans-serif; font-weight:400; font-size:14pt; line-height:22pt; color:rgb(87,107,118); padding-top:15pt; border-top:1pt solid rgb(231,231,231); margin-bottom:15pt\">This email is sent from an unmonitored account - do not reply.<\/div><div><img src=\"https://abc.xyz.com/_images/login/logo-color.png\" alt=\"logo\" style=\"width:113pt; height:31pt\"><\/div><\/td><\/tr><\/tbody><\/table><\/div><\/td><\/tr><\/tbody><\/table><\/body><\/html>"
          }
       }]
    }

Groovy

import groovy.json.JsonSlurper
import org.cyberneko.html.parsers.SAXParser
def ResponseMessage = messageExchange.response.responseContent
def object = new JsonSlurper().parseText(ResponseMessage)
def html = object.value[0].body.content
log.info "HTML 1 : " + html //here I am getting html page.
def content = new XmlSlurper( new SAXParser() ).parse( html ) //getting error at this line

预计

我想要 href 中的键值

键:GBsG3gBoI4YV+fSfejXCbw6vgG6m4OCU7Czfn3PAKXtxVI9Ex

最近我使用类似的基于正则表达式的代码来进行简单的网络爬虫:

def content = "<lotsoftags..><a href=\"https://abc.xyz123.com?key=GBsG3gBoI4YV+fSfejXCbw6vgG6m4OCU7Czfn3PAKXtxVI9Ex\" style=\"font-family:'Source Sans Pro',sans-serif; font-weight:400; font-size:18pt;\"<lotsoftags..>"

def keys = ( content =~ /<a href="[^"]+[?&]?key=([\w+]+)&?[^"]*"/ ).findAll()*.last()

assert keys[ 0 ] == 'GBsG3gBoI4YV+fSfejXCbw6vgG6m4OCU7Czfn3PAKXtxVI9Ex'