删除与 class 关联的 HTML 标签
Remove HTML tag associated with a class
我强迫自己学习如何仅在 AppleScript 中编写脚本,但我目前面临尝试删除带有 class 的特定标签的问题。我试图找到可靠的文档和示例,但目前似乎非常有限。
这是 HTML 我有:
<p>Bacon ipsum dolor amet pork chop landjaeger short ribs boudin short loin jowl <span class="foo">shoulder</span> biltong shankle capicola drumstick pork loin rump spare ribs ham hock. <span class="bar">Pig brisket</span> jowl ham pastrami <span class="foo">jerky</span> strip steak bacon doner. Short loin leberkas jowl, filet mignon turducken chicken ribeye shank tail swine strip steak pork loin sausage. Frankfurter ground round porchetta, pork short ribs jowl alcatra flank sausage.</p>
我想做的是删除一个特定的 class,这样它会删除 <span class="foo">
,结果:
<p>Bacon ipsum dolor amet pork chop landjaeger short ribs boudin short loin jowl shoulder biltong shankle capicola drumstick pork loin rump spare ribs ham hock. <span class="bar">Pig brisket</span> jowl ham pastrami jerky strip steak bacon doner. Short loin leberkas jowl, filet mignon turducken chicken ribeye shank tail swine strip steak pork loin sausage. Frankfurter ground round porchetta, pork short ribs jowl alcatra flank sausage.</p>
我知道如何使用 do shell script
和通过终端执行此操作,但我想了解可通过 AppleScript 字典获取的内容。
在研究中,我找到了一种方法来解析所有 HTML 标签:
on removeMarkupFromText(theText)
set tagDetected to false
set theCleanText to ""
repeat with a from 1 to length of theText
set theCurrentCharacter to character a of theText
if theCurrentCharacter is "<" then
set tagDetected to true
else if theCurrentCharacter is ">" then
set tagDetected to false
else if tagDetected is false then
set theCleanText to theCleanText & theCurrentCharacter as string
end if
end repeat
return theCleanText
end removeMarkupFromText
但这会删除所有 HTML 标签,这不是我想要的。搜索所以我能够找到如何在带有 Parsing HTML source code using AppleScript 的标签之间提取,但我不想解析文件。
我熟悉 BBEdit 的 Balance Tags
在下拉列表中称为 Balance
但是当我 运行:
tell application "BBEdit"
activate
find "<span class=\"foo\">" searching in text 1 of text document "test.html" options {search mode:grep, wrap around:true} with selecting match
balance tags
end tell
它变得贪婪并抓住第一个标签到倒数第二个结束标签之间的整行,中间有文本,而不是将自己隔离到第一个标签和它的文本。
进一步研究 tag
下的字典 我在 find tag
中做了 运行 我可以做到: set spanTarget to (find tag "span" start_offset counter)
然后用 class 定位标签|class| of attributes of tag of spanTarget
并使用 balance tags
但我仍然 运行 遇到与以前相同的问题。
所以在 pure AppleScript 中,我如何删除与 class 关联的标签而不是贪婪的?
这是正则表达式的工作,可通过使用现在支持的 AppleScriptObjC 桥获得。将此代码粘贴到脚本编辑器中,然后 运行 它:
use AppleScript version "2.5" -- for El Capitan or later
use framework "Foundation"
use scripting additions
on stringByMatching:thePattern inString:theString replacingWith:theTemplate
set theNSString to current application's NSString's stringWithString:theString
set theOptions to (current application's NSRegularExpressionDotMatchesLineSeparators as integer) + (current application's NSRegularExpressionAnchorsMatchLines as integer)
set theExpression to current application's NSRegularExpression's regularExpressionWithPattern:thePattern options:theOptions |error|:(missing value)
set theResult to theExpression's stringByReplacingMatchesInString:theNSString options:theOptions range:{location:0, |length|:theNSString's |length|()} withTemplate:theTemplate
return theResult as text
end stringByMatching:inString:replacingWith:
set theHTML to "<p>Bacon ipsum dolor amet pork chop landjaeger short ribs boudin short loin jowl <span class='foo'>SHOULDER</span> biltong shankle capicola drumstick pork loin rump spare ribs ham hock. <span class='bar'>PIG BRISKET</span> jowl ham pastrami <span class='foo'>JERKY</span> strip steak bacon doner. Short loin leberkas jowl, filet mignon turducken chicken ribeye shank tail swine strip steak pork loin sausage. Frankfurter ground round porchetta, pork short ribs jowl alcatra flank sausage.</p>"
set modifiedHTML to its stringByMatching:"<span .*?>(.*?)</span>" inString:theHTML replacingWith:""
这适用于格式良好的 HTML,但正如上面用户 foo 指出的那样,浏览器可以处理格式错误的 HTML,但您可能不能。
我相信 Ron 的回答是一个很好的方法,但是如果您不想使用正则表达式,可以使用下面的代码来实现。看到 Ron 回答后,我不打算 post 它,但我已经创建了它,所以我想我至少会给你第二个选择,因为你正在尝试学习。
on run
set theHTML to "<p>Bacon ipsum dolor amet pork chop landjaeger short ribs boudin short loin jowl <span class=\"foo\">shoulder</span> biltong shankle capicola drumstick pork loin rump spare ribs ham hock. <span class=\"bar\">Pig brisket</span> jowl ham pastrami <span class=\"foo\">jerky</span> strip steak bacon doner. Short loin leberkas jowl, filet mignon turducken chicken ribeye shank tail swine strip steak pork loin sausage. Frankfurter ground round porchetta, pork short ribs jowl alcatra flank sausage.</p>"
set theHTML to removeTag(theHTML, "<span class=\"foo\">", "</span>")
end run
on removeTag(theText, startTag, endTag)
if theText contains startTag then
set AppleScript's text item delimiters to {""}
set AppleScript's text item delimiters to startTag
set tempText to text items of (theText as string)
set AppleScript's text item delimiters to {""}
set middleText to item 2 of tempText as string
if middleText contains endTag then
set AppleScript's text item delimiters to endTag
set tempText2 to text items of (middleText as string)
set AppleScript's text item delimiters to {""}
set newString to implode(tempText2, endTag)
set item 2 of tempText to newString
end if
set newString to implode(tempText, startTag)
removeTag(newString, startTag, endTag) -- recursive
else
return theText
end if
end removeTag
on implode(parts, tag)
set newString to items 1 thru 2 of parts as string
if (count of parts) > 2 then
set newList to {newString, items 3 thru -1 of parts}
set AppleScript's text item delimiters to tag
set newString to (newList as string)
set AppleScript's text item delimiters to {""}
end if
return newString
end implode
您可以在 find
命令中为 BBEdit 或 TextWrangler:
使用正则表达式
To select the tag (Non-Greedy), use this command:
find "<span class=\"foo\">.+?</span>" searching in text 1 of text document 1 options {search mode:grep, wrap around:true} with selecting match
来自 .+?</span>
模式的信息:
.
匹配任意字符(换行符除外)
+
表示任何字符重复一次或多次
?
表示非贪婪量词
- 因此该模式匹配一个开始
span
标签,后跟一个或多个除 return 之外的任何字符,后跟一个结束 span
标签,非-贪心量词达到我们想要的结果,防止 BBEdit 溢出结束 </span>
标签并匹配多个标签。
要跨行匹配模式,只需将 (?s)
放在模式的开头,如下所示:
find "(?s)<span class=\"foo\">.+?</span>" searching in text 1 of text document 1 options {search mode:grep, wrap around:true} with selecting match
- 命令匹配没有换行符的标签:
<span class="foo">shoulder</span>
- 或者,命令匹配带有换行符的标签:
<span class="foo">shoulder
</span>
- 或者,命令匹配多行标签:
<span class="foo">shoulder
xxxx
yyyy
zzzz</span>
从 AppleScript,您可以使用 replace 命令(BBEdit 或 TextWrangler)找到一个模式并删除所有匹配的字符串,就像这样
replace "(?s)<span class=\"foo\">.+?</span>" using "" searching in text 1 of text document 1 options {search mode:grep, wrap around:true}
我强迫自己学习如何仅在 AppleScript 中编写脚本,但我目前面临尝试删除带有 class 的特定标签的问题。我试图找到可靠的文档和示例,但目前似乎非常有限。
这是 HTML 我有:
<p>Bacon ipsum dolor amet pork chop landjaeger short ribs boudin short loin jowl <span class="foo">shoulder</span> biltong shankle capicola drumstick pork loin rump spare ribs ham hock. <span class="bar">Pig brisket</span> jowl ham pastrami <span class="foo">jerky</span> strip steak bacon doner. Short loin leberkas jowl, filet mignon turducken chicken ribeye shank tail swine strip steak pork loin sausage. Frankfurter ground round porchetta, pork short ribs jowl alcatra flank sausage.</p>
我想做的是删除一个特定的 class,这样它会删除 <span class="foo">
,结果:
<p>Bacon ipsum dolor amet pork chop landjaeger short ribs boudin short loin jowl shoulder biltong shankle capicola drumstick pork loin rump spare ribs ham hock. <span class="bar">Pig brisket</span> jowl ham pastrami jerky strip steak bacon doner. Short loin leberkas jowl, filet mignon turducken chicken ribeye shank tail swine strip steak pork loin sausage. Frankfurter ground round porchetta, pork short ribs jowl alcatra flank sausage.</p>
我知道如何使用 do shell script
和通过终端执行此操作,但我想了解可通过 AppleScript 字典获取的内容。
在研究中,我找到了一种方法来解析所有 HTML 标签:
on removeMarkupFromText(theText)
set tagDetected to false
set theCleanText to ""
repeat with a from 1 to length of theText
set theCurrentCharacter to character a of theText
if theCurrentCharacter is "<" then
set tagDetected to true
else if theCurrentCharacter is ">" then
set tagDetected to false
else if tagDetected is false then
set theCleanText to theCleanText & theCurrentCharacter as string
end if
end repeat
return theCleanText
end removeMarkupFromText
但这会删除所有 HTML 标签,这不是我想要的。搜索所以我能够找到如何在带有 Parsing HTML source code using AppleScript 的标签之间提取,但我不想解析文件。
我熟悉 BBEdit 的 Balance Tags
在下拉列表中称为 Balance
但是当我 运行:
tell application "BBEdit"
activate
find "<span class=\"foo\">" searching in text 1 of text document "test.html" options {search mode:grep, wrap around:true} with selecting match
balance tags
end tell
它变得贪婪并抓住第一个标签到倒数第二个结束标签之间的整行,中间有文本,而不是将自己隔离到第一个标签和它的文本。
进一步研究 tag
下的字典 我在 find tag
中做了 运行 我可以做到: set spanTarget to (find tag "span" start_offset counter)
然后用 class 定位标签|class| of attributes of tag of spanTarget
并使用 balance tags
但我仍然 运行 遇到与以前相同的问题。
所以在 pure AppleScript 中,我如何删除与 class 关联的标签而不是贪婪的?
这是正则表达式的工作,可通过使用现在支持的 AppleScriptObjC 桥获得。将此代码粘贴到脚本编辑器中,然后 运行 它:
use AppleScript version "2.5" -- for El Capitan or later
use framework "Foundation"
use scripting additions
on stringByMatching:thePattern inString:theString replacingWith:theTemplate
set theNSString to current application's NSString's stringWithString:theString
set theOptions to (current application's NSRegularExpressionDotMatchesLineSeparators as integer) + (current application's NSRegularExpressionAnchorsMatchLines as integer)
set theExpression to current application's NSRegularExpression's regularExpressionWithPattern:thePattern options:theOptions |error|:(missing value)
set theResult to theExpression's stringByReplacingMatchesInString:theNSString options:theOptions range:{location:0, |length|:theNSString's |length|()} withTemplate:theTemplate
return theResult as text
end stringByMatching:inString:replacingWith:
set theHTML to "<p>Bacon ipsum dolor amet pork chop landjaeger short ribs boudin short loin jowl <span class='foo'>SHOULDER</span> biltong shankle capicola drumstick pork loin rump spare ribs ham hock. <span class='bar'>PIG BRISKET</span> jowl ham pastrami <span class='foo'>JERKY</span> strip steak bacon doner. Short loin leberkas jowl, filet mignon turducken chicken ribeye shank tail swine strip steak pork loin sausage. Frankfurter ground round porchetta, pork short ribs jowl alcatra flank sausage.</p>"
set modifiedHTML to its stringByMatching:"<span .*?>(.*?)</span>" inString:theHTML replacingWith:""
这适用于格式良好的 HTML,但正如上面用户 foo 指出的那样,浏览器可以处理格式错误的 HTML,但您可能不能。
我相信 Ron 的回答是一个很好的方法,但是如果您不想使用正则表达式,可以使用下面的代码来实现。看到 Ron 回答后,我不打算 post 它,但我已经创建了它,所以我想我至少会给你第二个选择,因为你正在尝试学习。
on run
set theHTML to "<p>Bacon ipsum dolor amet pork chop landjaeger short ribs boudin short loin jowl <span class=\"foo\">shoulder</span> biltong shankle capicola drumstick pork loin rump spare ribs ham hock. <span class=\"bar\">Pig brisket</span> jowl ham pastrami <span class=\"foo\">jerky</span> strip steak bacon doner. Short loin leberkas jowl, filet mignon turducken chicken ribeye shank tail swine strip steak pork loin sausage. Frankfurter ground round porchetta, pork short ribs jowl alcatra flank sausage.</p>"
set theHTML to removeTag(theHTML, "<span class=\"foo\">", "</span>")
end run
on removeTag(theText, startTag, endTag)
if theText contains startTag then
set AppleScript's text item delimiters to {""}
set AppleScript's text item delimiters to startTag
set tempText to text items of (theText as string)
set AppleScript's text item delimiters to {""}
set middleText to item 2 of tempText as string
if middleText contains endTag then
set AppleScript's text item delimiters to endTag
set tempText2 to text items of (middleText as string)
set AppleScript's text item delimiters to {""}
set newString to implode(tempText2, endTag)
set item 2 of tempText to newString
end if
set newString to implode(tempText, startTag)
removeTag(newString, startTag, endTag) -- recursive
else
return theText
end if
end removeTag
on implode(parts, tag)
set newString to items 1 thru 2 of parts as string
if (count of parts) > 2 then
set newList to {newString, items 3 thru -1 of parts}
set AppleScript's text item delimiters to tag
set newString to (newList as string)
set AppleScript's text item delimiters to {""}
end if
return newString
end implode
您可以在 find
命令中为 BBEdit 或 TextWrangler:
To select the tag (Non-Greedy), use this command:
find "<span class=\"foo\">.+?</span>" searching in text 1 of text document 1 options {search mode:grep, wrap around:true} with selecting match
来自 .+?</span>
模式的信息:
.
匹配任意字符(换行符除外)+
表示任何字符重复一次或多次?
表示非贪婪量词- 因此该模式匹配一个开始
span
标签,后跟一个或多个除 return 之外的任何字符,后跟一个结束span
标签,非-贪心量词达到我们想要的结果,防止 BBEdit 溢出结束</span>
标签并匹配多个标签。
要跨行匹配模式,只需将 (?s)
放在模式的开头,如下所示:
find "(?s)<span class=\"foo\">.+?</span>" searching in text 1 of text document 1 options {search mode:grep, wrap around:true} with selecting match
- 命令匹配没有换行符的标签:
<span class="foo">shoulder</span>
- 或者,命令匹配带有换行符的标签:
<span class="foo">shoulder
</span>
- 或者,命令匹配多行标签:
<span class="foo">shoulder
xxxx
yyyy
zzzz</span>
从 AppleScript,您可以使用 replace 命令(BBEdit 或 TextWrangler)找到一个模式并删除所有匹配的字符串,就像这样
replace "(?s)<span class=\"foo\">.+?</span>" using "" searching in text 1 of text document 1 options {search mode:grep, wrap around:true}