使用 CLI (Linux) 在文本文件中提取“和”之间的所有链接

Question

我有一个非常大的文本 (.sql) 文件，我想将其中的所有 link 放入一个干净的文本文件中，其中 link每一行都是一个

我找到了以下命令 grep -Eo "https?://\S+?\.html" filename.txt > newFile.txt 来自 anubhava，几乎对我有用，link：

不幸的是，它并不完全有效：问题一：在上面的link中，网页以.html结尾。就我而言并非如此。它们没有共同的结尾，所以我只需要在第二个 ' 符号之前完成。

问题 2：我不希望它复制 ' 符号。

举个例子，（因为，我觉得我在这里解释得不好）：

说，我的文件是这样写的：

Not him old music think his found enjoy merry. Listening acuteness dependent at or an. 'https://I_want_this' Apartments thoroughly unsatiable terminated sex how themselves. She are ten hours wrong walls stand early. 'https://I_want_this_too'. Domestic perceive on an ladyship extended received do. Why jennings our whatever his learning gay perceive. Is against no he without subject. Bed connection unreserved preference partiality not unaffected. Years merit trees so think in hoped we as.

我想要

https://I_want_this
https://I_want_this_too

作为输出文件。

很抱歉这个简单的问题，但我对这整件事都不熟悉，grep/sed 等对我来说不是那么容易理解，尤其是。当我希望它搜索特殊字符时，例如 /,'," 等

Answer 1

您可以使用 GNU grep 命令，例如

grep -Po "'\Khttps?://[^\s']+" file

详情:

P 启用 PCRE 正则表达式引擎
o 只输出匹配，不匹配行
'\Khttps?://[^\s']+ - 匹配 '，然后在与 \K 的匹配中省略它，然后匹配 http，然后是可选的 s，://，然后是除空格和 ' 个字符之外的一个或多个字符。

参见online demo：

#!/bin/bash
s="Not him old music think his found enjoy merry. Listening acuteness dependent at or an. 'https://I_want_this' Apartments thoroughly unsatiable terminated sex how themselves. She are ten hours wrong walls stand early. 'https://I_want_this_too'. Domestic perceive on an ladyship extended received do. Why jennings our whatever his learning gay perceive. Is against no he without subject. Bed connection unreserved preference partiality not unaffected. Years merit trees so think in hoped we as."
grep -Po "'\Khttps?://[^\s']+" <<< "$s"

输出：

https://I_want_this
https://I_want_this_too

Answer 2

使用您显示的示例，请尝试以下 awk 代码。在 GNU awk 中编写和测试，应该在任何 awk.

中工作

awk '
{
  while(match([=10=],/7https?:\/\/[^7]*/)){
    print substr([=10=],RSTART+1,RLENGTH-1)
    [=10=]=substr([=10=],RSTART+RLENGTH)
  }
}
'  Input_file

解释： 简单的解释就是，在主程序中使用 while 循环和运行 awk的match函数在其中。其中 match 函数具有正则表达式 7https?:\/\/[^7]*（匹配 'http 或 'https 后跟 :// 直到下一次出现 '），然后打印 [=匹配值的 32=]（通过 match 函数）。

使用 CLI (Linux) 在文本文件中提取“和”之间的所有链接

Extract all links between ' and ' in a text file, using CLI (Linux)

regex

search

grep

text

sed