多个模式的一个参数 - grep

One parameter for multiple patterns - grep

我正在尝试从终端搜索 pdf 文件。我的尝试是从终端提供搜索字符串。搜索字符串可以是一个词、多个带 (AND,OR) 的词或一个确切的短语。我只想为所有搜索查询保留一个参数。我会将以下命令保存为 shell 脚本,并将调用 shell 脚本作为 zsh 中 .aliases 的别名或 bash shell.

根据 sjr 的回答,此处:search multiple pdf files

我曾这样使用 sjr 的答案:

find  -name '*.pdf' -exec sh -c 'pdftotext "{}" - |
      grep -E -m'' --line-buffered --label="{}" '""' ''' \;

</code> 采用路径 </p> <p><code>限制结果数

</code> 是上下文参数(它接受 -A , -B , -C ,单独或联合)</p> <p><code> 获取搜索字符串

我面临的问题是 </code> 值。正如我之前所说,我希望这个参数传递我的搜索字符串,它可以是一个短语或一个词或多个具有 AND / OR 关系的词。</p> <p>我无法获得想要的结果,直到我关注 Robin Green 的评论后,我才获得短语搜索的搜索结果。但是短语结果仍然不准确。 </p> <p><strong>编辑</strong> 判断文本:</p> <pre><code>The original rule was that you could not claim for psychiatric injury in negligence. There was no liability for psychiatric injury unless there was also physical injury (Victorian Rly Commrs v Coultas [1888]). The courts were worried both about fraudulent claims and that if they allowed claims, the floodgates would open. The claimant was 15 metres away behind a tram and did not see the accident but later saw blood on the road. She suffered nervous shock and had a miscarriage. She sued for negligence. The court held that it was not reasonably foreseeable that someone so far away would suffer shock and no duty of care was owed. White v Chief Constable of South Yorkshire [1998] The claimants were police officers who all had some part in helping victims at Hillsborough and suffered psychiatric injury. The House of Lords held that rescuers did not have a special position and had to follow the normal rules for primary and secondary victims. They were not in physical danger and not therefore primary victims. Neither could they establish they had a close relationship with the injured so failed as secondary victims. It is necessary to define `nervous shock' which is the rather quaint term still sometimes used by lawyers for various kinds of psychiatric injury...rest of para

word1可以是:休克,(神经性休克)

word2可以是:精神病

exact phrase:(紧张)

命令

alias s='sh /path/shell/script.sh'
export p='path/pdf/files'

在终端中:

s "$p" 10 -5 "word1/|word2"          #for OR search
s "$p" 10 -5 "word1.*word2.*word3"   #for AND search
s "$p" 10 -5  ""exact phrase""       #for phrase search

第二个测试样本: 一个示例 pdf 文件,因为 pdf 文档上的命令 运行s:Test-File。它的 4 页(361 pg 文件的一部分)

如果我们运行下面的命令就可以了,正如解决方案中提到的:

s "$p" 10 -5 'doctrine of basic structure' > ~/desktop/BSD.txt && open ~/desktop/BSD.txt

我们将获得相关文本,并且将避免遍历整个文件。认为这将是一种很酷的方式来阅读我们想要的内容,而不是采用传统方法。

您需要:

  • double-quoted 命令字符串传递给 sh -c 以便扩展嵌入的 shell-variable 引用(然后需要将 embedded " 实例转义为 \").

  • printf %q 引用正则表达式以安全包含在命令字符串中 - 请注意,这需要 bashkshzsh 作为shell.

dir=
numMatches=
context=
regexQuoted=$(printf %q "")

find "${dir}" -type f -name '*.pdf' -exec sh -c "pdftotext \"{}\" - |
  grep -E -m${numMatches} --with-filename --label=\"{}\" ${context} ${regexQuoted}" \;

3 个调用场景将是:

s "$p" 10 -5 'word1|word2'          #for OR search
s "$p" 10 -5 'word1.*word2.*word3'  #for AND search
s "$p" 10 -5 'exact phrase'         #for phrase search

请注意,无需转义 |,也无需在 exact phrase.

周围添加额外的双引号层

另请注意,我已将 --line-buffered 替换为 --with-filename,因为我认为这就是您的意思(匹配行以 PDF 文件路径为前缀)。


请注意,使用上述方法必须为 每个 输入路径创建一个 shell 实例,这是低效的,因此请考虑如下重写您的命令,这也避免了 printf %q 的需要(假设 regex=):

find "${dir}" -type f -name '*.pdf' | 
  while IFS= read -r file; do
    pdftotext "$f" - |
      grep -E -m${numMatches} --with-filename --label="$f" ${context} "${regex}"
  done

以上假定您的文件名没有嵌入换行符,这很少是 real-world 关注的问题。如果是,有办法解决问题。

此解决方案的另一个优点是它仅使用 POSIX-compliant shell 功能,但请注意 grep 命令使用非标准选项。