使用 AppleScript 从文本文档中提取两个字符串之间的字符串

Question

我对编写代码很陌生。我一直在寻找在文本文档中找到字符串然后在下一行 returning 部分字符串的所有方法。理想情况下，最终目标是将这个提取的字符串放入 excel 文件，但我还没有接近那一步。我一直在尝试很多不同的选择，但我终其一生都无法让它发挥作用。我觉得我已经很接近了，这让我很难受，因为我就是想不通我哪里出了问题。

目标：在不知道该人姓名的情况下从下面的文本中提取发布职位的人的姓名。我知道字符串 "Job posted by" 会立即预置我要查找的名称，而且我知道“·”会紧跟在名称后面。文本文档中没有其他任何地方出现这些环绕字符串。

I'm running OS X El Capitan
file name for this example is ExtractedTextOutput.txt
file location for this example is "/Users/RaquelBianca/Desktop/ExtractTextOutput2.txt"

到目前为止，我的尝试如下（我的问题是它似乎只是 return 整个文本文档，而不是我正在寻找的名称）

set theFile to ("/Users/RaquelBianca/Desktop/ExtractTextOutput2.txt")
set theFileContents to read theFile

set output to {}
set od to AppleScript's text item delimiters
set AppleScript's text item delimiters to {"
"}

set all_lines to every text item of theFileContents
repeat with the_line in all_lines
if "Job posted by" is not in the_line then
    set output to output & the_line
else
    set AppleScript's text item delimiters to {"Job posted by"}
    set latter_part to last text item of the_line
    set AppleScript's text item delimiters to {" "}
    set last_word to last text item of latter_part
    set output to output & ("$ " & last_word as string)
end if
end repeat

set AppleScript's text item delimiters to {"
"}

set output to output as string
set AppleScript's text item delimiters to od
return output

非常感谢任何和所有的帮助和想法。

文件中的示例文本： 2016 年 9 月 2 日在大纽约地区 Datadog 的应用程序安全工程师职位 |领英 60 首页简介职位描述我的网络工作  搜索人员、职位、公司等...  进阶  商业服务  转到 Lynda.c 应用安全工程师数据狗大纽约地区发布于 15 天前 93 次浏览 1 明矾在这里工作在公司网站上申请我们的使命是为云操作带来理智，我们需要您在我们的平台上构建有弹性且安全的应用程序。你会做什么执行代码和设计审查，贡献代码以提高整个 Datadog 产品的安全性教育您的工程师同事有关代码和基础设施的安全性监控生产应用程序是否存在异常 activity 对整个公司的应用程序安全问题进行优先级排序和跟踪帮助改进我们的安全政策和流程职位发布者瑞安·埃尔伯格·第二名 Datadog 大纽约地区技术人才招聘主管发送邮件

Answer 1

我只是很难确定您的第二个分隔符到底是什么。您的文本示例显示“·”，但是当我检查“Elberg”之后和“2nd...”之前的内容时，我发现了 4 个字符：代码 32 (space)、代码 194 (¬)、代码183 (∑), 代码 32 (space).

在下面的脚本中，我使用了代码 194。当我 cut/paste 将您的文本示例写入文件时，它会起作用。这是脚本：

set theFile to ("/Users/RaquelBianca/Desktop/ExtractTextOutput2.txt")
-- your separator seems to be code 32 (space), code 194 (¬), code 183 (∑), code 32 (space)
set Separator to ASCII character 194 -- is it correct ?

set theFileContents to read theFile
set myAuthor to ""
set AppleScript's text item delimiters to {"Job posted by "}
if (count of text item of theFileContents) is 2 then
set Part2 to text item 2 of theFileContents -- this part starts just after "Job posted by "
set AppleScript's text item delimiters to {Separator}
set myAuthor to text item 1 of Part2
end if

log "result=//" & myAuthor & "//" -- show the result in variable myAuthor

注意：如果文本不包含"Job posted by "，那么myAuthor就是''.

Answer 2

您使用 AppleScript's text item delimiters 的想法是正确的，但是您尝试提取名称的方式给您带来了麻烦。不过，首先，我将介绍一些可以改进脚本的方法：

set all_lines to every text item of theFileContents
repeat with the_line in all_lines
    if "Job posted by" is not in the_line then
    set output to output & the_line
else
    …
end repeat

无需将文件内容分成几行；如果需要，AppleScript 可以对整个段落或更多段落进行操作。

删除这些不必要的步骤（并添加新步骤以使其适用于整个文件）大大缩小了脚本：

set theFile to ("/Users/RaquelBianca/Desktop/ExtractTextOutput2.txt")
set theFileContents to read theFile

set output to {}
set od to AppleScript's text item delimiters

if "Job posted by" is in theFileContents
    set AppleScript's text item delimiters to {"Job posted by"}
    set latter_part to last text item of theFileContents
    set AppleScript's text item delimiters to {" "}
    set last_word to last text item of latter_part
    set output to output & ("$ " & last_word as string)
else
    display alert "Poster of job listing not found"
    set output to theFileContents
end if

set AppleScript's text item delimiters to od
return output

这就是给你错误输出的原因：

set last_word to last text item of latter_part
set output to output & ("$ " & last_word as string)

这是不正确的。这不是您想要的 last 个词；那是文件的最后一句话！要提取职位列表的海报，将其更改为以下内容：

repeat with theWord in latterPart
    if the first character in theWord is "¬" then exit repeat
    set output to output & theWord
end repeat

由于 AppleScript 奇怪的 Unicode 处理，无论出于何种原因，当运行通过脚本时，将名称与其他文本分开的点 (·) 将转换为“¬∑”。因此，我们改为查找“¬”。

一些最后的代码修复：

您的某些变量名称使用 the_snake_case，而其他变量名称使用 theCamelCase。使用一个或另一个约定通常是个好主意，所以我也修复了它。

我假设你出于某种原因想要在输出中使用美元符号，所以我保留了它。如果你不想要它，只需将 set output to "$ " 替换为 set output to "".

因此，您最终的工作脚本如下所示：

set theFile to "/Users/RaquelBianca/Desktop/ExtractTextOutput2.txt"
set theFileContents to read theFile as text

set output to "$ "
set od to AppleScript's text item delimiters

if "Job posted by" is in theFileContents then
    set AppleScript's text item delimiters to {"Job posted by"}
    set latterPart to last text item of theFileContents
    set AppleScript's text item delimiters to {" "}
    repeat with theWord in latterPart
        if the first character in theWord is "¬" then exit repeat
        set output to output & theWord
    end repeat
else
    display alert "Poster of job listing not found"
    set output to theFileContents
end if

set AppleScript's text item delimiters to od
return output

使用 AppleScript 从文本文档中提取两个字符串之间的字符串

extract string between two strings from text document using AppleScript

macos

shell

applescript

extract

automator