从大文本文件中提取 JSon 格式

Question

简介

嗨！我正在尝试从 300K 行的文本文件中提取 JSon，该文件结合了文本输出和来自 HTTP 结果的 JSon 格式。行中的大尺寸使其无法手动保留 JSon。

有问题

没有太多选择，我可能需要使用命令行手动修复它。这是文件内部的样子：

[2K  100.00% - C: 164148 / 164149 - S: 263 - F: 3686 - dhcp-140-247-148-215.fas.harvard.edu:443 - id3.sshws.me
[2K  100.00% - C: 164149 / 164149 - S: 263 - F: 3686 - public-1300503051.cos.ap-shanghai.myqcloud.com:443 - id3.sshws.me
[2K
[
  {
    "Request": {
      "ProxyHost": "pro.ant.design",
      "ProxyPort": 443,
      "Bug": "pro.ant.design",
      "Method": "HEAD",
      "Target": "id3.sshws.me",
      "Payload": "GET wss://pro.ant.design/ HTTP/1.1[crlf]Host: [host][crlf]Upgrade: websocket[crlf][crlf]"
    },
    "ResponseLine": [
      "HTTP/1.1 101 Switching Protocol",
      "Server: cloudflare"
    ]
  },
  {
    "Request": {
      "ProxyHost": "industrialtech.ft.com",
      "ProxyPort": 443,
      "Bug": "industrialtech.ft.com",
      "Method": "HEAD",
      "Target": "id3.sshws.me",
      "Payload": "GET wss://industrialtech.ft.com/ HTTP/1.1[crlf]Host: [host][crlf]Upgrade: websocket[crlf][crlf]"
    },
    "ResponseLine": [
      "HTTP/1.1 101 Switching Protocol",
      "Server: cloudflare"
    ]
  }
]

如果使用 RegEx，会出现以下几个问题：

它有多个JSon对象
不属于 JSon 的文本字符串有 [ 和 :

我在尝试使用 sed 正则表达式时意识到了这个问题。

sed '/^[/,/^]/!d'

Answer 1

您可以删除所有以 [ 开头的行和任何 non-whitespace 字符：

sed '/^\[[^[:space:]]/d' file > newfile

详情:

^ - 行首
\[ - [ 字符
[^[:space:]] - 任意 non-whitespace 个字符。

Answer 2

另一种方法是；获得特殊字符的优势。如果有人想从输出中删除进度条并仅提取适当的输出：

使用nano <output_file>
您会看到第一行文本中有新行 unicode got readed as ^M^[。我假设它与 [crlf]
使用 sed -e "/\^M^[/d" 删除包含特定 unicode 的行。

使用 \ 转义 ^ 作为 RegEx。

确保始终找到从终端读取的模式，而不是在文本编辑器应用程序中，因为其中一些无法读取 Unicode。

从大文本文件中提取 JSon 格式

Extract JSon Format from Large Text File

regex

json

sed