如何使用 PowerShell 在具有近一百万行的多行文本文件中进行多字符串替换？

Question

我有一些具有以下格式的日志文件（mosquitto 代理日志）

1652855102: New connection from xx.xx.xx.xx on port xxxx.
1652855102: Socket error on client <unknown>, disconnecting.
1652855102: Received PUBLISH from 16838547124974742985 (d0, q1, r0, m235, 's/uc/mpd/16838547124974742985', ... (30 bytes))
1652855102: Sending PUBACK to 16838547124974742985 (m235, rc0)
1652855102: Sending PUBLISH to mqtt_7811e829.e2e5e8 (d0, q1, r0, m42277, 's/uc/mpd/16838547124974742985', ... (30 bytes))
1652855102: Sending PUBLISH to dad3d73-013c-4274-a782-cdd6f2ebbc77 (d0, q0, r0, m0, 's/uc/mpd/16838547124974742985', ... (30 bytes))
1652855102: Received PUBACK from mqtt_7811e829.e2e5e8 (Mid: 42277, RC:0)
1652855103: Received DISCONNECT from 16838547082470259932
1652855103: Client 16838547082470259932 disconnected.

格式如下： UTC time in seconds since epoch|:| Message body

我想将日志文件转换成：

1652855102|New connection from xx.xx.xx.xx on port xxxx.
1652855102|Socket error on client <unknown>, disconnecting.
1652855102|Received PUBLISH from 16838547124974742985 (d0, q1, r0, m235, 's/uc/mpd/16838547124974742985', ... (30 bytes))
1652855102|Sending PUBACK to 16838547124974742985 (m235, rc0)

我了解到这里有3组需要抓包
Group 1 - 前 10 位数字
Group 2 - 一个冒号和一个 space
Group 3 - space 之后的所有内容都是消息

$RegEx = '(?ms)^([\d]{10})(:\s)(.+)'
(Get-Content -Raw ..\sample-mosquitto.log) -Replace $RegEx, '|'

我的脚本只替换了第一行，对其余行不起作用。

有什么方法可以运行捕获并替换所有行而无需实际执行 for-each 吗？

Answer 1

我建议使用

$RegEx = '^(\d{10}):\s+(.+)'
(Get-Content ..\sample-mosquitto.log) -Replace $RegEx, '|'

也就是去掉-Raw你想逐行操作，只抓取你需要保留的数据（:\s+不需要保留，也不抓取).使用 |.

查看更新后的替换字符串

参见regex demo。

如果将整个文件加载到内存中没有问题，您可以继续使用 -Raw 选项和多行匹配正则表达式：

$RegEx = '(?m)^(\d{10}):\s+(.+)'
(Get-Content ..\sample-mosquitto.log -Raw) -Replace $RegEx, '|'

请注意已删除的 s 标记使 . 匹配换行字符 (LF)。

此外，请注意，如果您不关心 : 和空格之后的行中是否有任何字符，即如果您需要将十位数字后的 : 替换为 |，你可以使用

$RegEx = '(?m)(?<=^\d{10}):\s+'
(Get-Content ..\sample-mosquitto.log -Raw) -Replace $RegEx, '|'

参见 this regex demo。

最后：如果您不需要删除 : 之后的换行符，请将 \s 替换为 [^\S\r\n]、[\s-[\r\n]] 或 [\p{Zs}\t]:

$RegEx = '(?m)(?<=^\d{10}):[\p{Zs}\t]+'

这将只匹配水平空白字符。

Answer 2

您唯一的问题是使用内联正则表达式选项 s，即使元字符 . 匹配 [=] 的 SingleLine 选项41=]任何字符，包括换行符；实际上，这会导致您的正则表达式在所有行中匹配 整个字符串 ；没有这个选项，.匹配除换行符以外的任何字符，这正是您在这里寻找的字符。

还要注意字符 class \s 匹配 任何白色 space 字符，因此也匹配换行符；虽然这在您的情况下不是问题，但最好逐字匹配 (space)，如果您知道在该位置只能出现 space。

最后，请注意使用一个捕获组就足够了，因为您不需要捕获分隔符字符串，因为您要用固定字符替换它，并且您不需要捕获字符串的其余部分，因为它应该保持不变。

因此（请注意，仍然需要内联选项 m (Multiline) 以使 ^ 匹配每行的开头） :

(Get-Content -Raw ..\sample-mosquitto.log) -replace '(?m)^(\d{10}): ', '|'

Re 性能:

假设文件作为一个整体适合内存（即使对于大型文本文件通常也是如此），使用 -Raw 开关将文件读入单个文件，multi-line字符串确实是处理文件最快的方法；相比之下，使用 Get-Content without -Raw would result in line-by-line streaming, which is not only inherently slower, but slowed down by each line getting decorated with metadata - see GitHub issue #7537 讨论潜在的未来 opt-out.

但是 - 再次假设文件作为一个整体适合内存 - 是一种 加速 Get-Content 的方法 line-by-line processing，即 via -ReadCount 0，导致 all 行作为 单个数组（只需要一个单个输出对象，整个只用元数据装饰）：

# Slower than -Raw, but reasonably fast line-by-line processing, # thanks to -ReadCount 0 (Get-Content -ReadCount 0 ..\sample-mosquitto.log) -replace '^(\d{10}): ', '|'

如何使用 PowerShell 在具有近一百万行的多行文本文件中进行多字符串替换？

How to do multi string replace in a multiline text file that has almost a million lines using PowerShell?

regex

powershell