清除文本中位置不正确的 CR+LF
Clean improperly positioned CR+LF in texts
我有一个 TXT 文件,我想导入到 Excel 以供研究。但是,在导入之前,我正在为文本格式而苦苦挣扎。这完全是一团糟,屁股你可以看到:
| 1020941333 | 569|SP |500000343 | 9|18.05.2011|15:27:00|18.05.2011|
18.05.2011|Y-0444871-ENCR | 1,93 |BRL |8000800000 |
Juros, Comissões e T | | |
| | | |
|CLB082902 | | | |COEL |COEL |
Y-0444871 |
| 1020941586 | 43|SP |500000344 |43|18.05.2011|15:41:43|18.05.2011|
18.05.2011|B-0447039-ENCR | 9,02 |BRL |8000800000 |
Juros, Comissões e T | | |
| | | |
|CLB082902 | | | |COEL |COEL |
B-0447039 |
| 1021245920 | 956|SP |500000489 | 6|14.06.2011|15:24:02|14.06.2011|
14.06.2011|B-0447039-ENCR | 8,95 |BRL |8000800000 |
Juros, Comissões e T | | |
| | | |
|CLB082902 | | | |COEL |COEL |
B-0447039
|
所以我一直在寻找为什么文字如此奇怪的答案。我发现是因为某些 CR+LF(Carriage Return + Line Feed)位置不当。我手动做了一些更正,通过这些我可以看到可以更好地组织文本,如下所示:
--------------------------------------------------------------------------------
| Nº documento | LL.|TpDoc.|Nº doc.ref|LL|Entrado em|Hora |Data doc. |Dt.lçto. |Elemento PEP | Valor/moeda ACC|MdACC|Cl.custo |Denom.classe custo |Material | Qtd.entr.|Texto breve material |UML |Doc.compra| Item|Texto do pedido |Usuário |DEs |Est |Nº ref.estorno |Empr. |EmFI |Definição do projeto
--------------------------------------------------------------------------------
| 1016939462 | 1|WE |5000058364| 1|22.02.2010|10:52:43|22.02.2010|22.02.2010|Y0444871PROJELMC | 540,93 |BRL |8000124000 |Serviço de Terceiro | | 1,000 | |UR |4501328844| 1|ESTUDOS E PROJ. REDE |CLB055760 | | | |COEL |COEL |Y-0444871 |
| 1020016002 | 1|WE |5000053667| 1|15.02.2011|11:56:05|15.02.2011|15.02.2011|B0447039PROJELMC | 2.011,84 |BRL |8000124000 |Serviço de Terceiro | | 1,000 | |UR |4501633481| 1|ESTUDOS E PROJ. REDE |CLB093440 | | | |COEL |COEL |B-0447039 |
| 1020258918 | 798|SP |500000121 | 8|15.03.2011|18:06:18|15.03.2011|15.03.2011|B-0447039-ENCR | 6,92 |BRL |8000800000 |Juros, Comissões e T | | | | | | | |CLB107395 | | | |COEL |COEL |B-0447039 |
| 1020585116 | 761|SP |500000225 | 1|15.04.2011|14:13:44|15.04.2011|15.04.2011|Y-0444871-ENCR | 1,88 |BRL |8000800000 |Juros, Comissões e T | | | | | | | |CLB145327 | | | |COEL |COEL |Y-0444871 |
| 1020586939 | 184|SP |500000230 | 4|15.04.2011|16:22:41|15.04.2011|15.04.2011|B-0447039-ENCR | 7,03 |BRL |8000800000 |Juros, Comissões e T | | | | | | | |CLB145327 | | | |COEL |COEL |B-0447039 |
我还可以在文本中看到一个模式。每行都以此字符 |
开头。因此,对于不以'|'开头的每一行都应该与上一行连接。
问题原样:
| 1020941333 | 569|SP |500000343 | 9|18.05.2011|15:27:00|18.05.2011|
18.05.2011|Y-0444871-ENCR | 1,93 |BRL |8000800000 |
Juros, Comissões e T | | |
| | | |
|CLB082902 | | | |COEL |COEL |
Y-0444871 |
| 1020941586 | 43|SP |500000344 |43|18.05.2011|15:41:43|18.05.2011|
18.05.2011|B-0447039-ENCR | 9,02 |BRL |8000800000 |
Juros, Comissões e T | | |
| | | |
|CLB082902 | | | |COEL |COEL |
B-0447039 |
期望的输出
| 1020941333 | 569|SP |500000343 | 9|18.05.2011|15:27:00|18.05.2011|18.05.2011|Y-0444871-ENCR | 1,93 |BRL |8000800000 |Juros, Comissões e T | | | | | | | |CLB082902 | | | |COEL |COEL |Y-0444871 |
| 1020941586 | 43|SP |500000344 |43|18.05.2011|15:41:43|18.05.2011|18.05.2011|B-0447039-ENCR | 9,02 |BRL |8000800000 |Juros, Comissões e T | | | | | | | |CLB082902 | | | |COEL |COEL |B-0447039 |
我很难在 Notepad++ 中实现它。我无法手动完成,因为该文件有超过 490 万行。如果有人可以使用 Notepad++ 或其他可以更好地解决此问题的软件向我展示有关此问题的一些信息,我将不胜感激。
您可以使用正则表达式查找后跟换行符的竖线,并使用否定先行 (?!
来检查竖线右侧的内容不是开始换行的模式。然后替换为第一个捕获组以保留管道..
查找内容:
(\|)\R(?!\|[ \t]+\d+[ \t]+\|)
替换为:
</code></p>
<p><strong>说明</strong></p>
<ul>
<li><code>(\|)
匹配捕获组中的管道
\R
匹配unicode换行序列
(?!
否定前瞻
\|[ \t]+\d+[ \t]+\|
匹配管道,1+ 次 space 或制表符,1+ 位数字,1+ space 或制表符和管道
)
关闭否定前瞻
这将替换任何类型的换行符 w 后面没有空的竖线:
- Ctrl+H
- 查找内容:
\R(?!\|)
- 替换为:
LEAVE EMPTY
- 选中环绕
- 检查正则表达式
- 全部替换
解释:
\R # any kind of linebreak (ie. \r, \n, \r\n)
(?! # negative lookahead, zero length assertion that makes sure we do not have after:
\| # a pipe character
) # end lookahead
给定示例的结果:
| 1020941333 | 569|SP |500000343 | 9|18.05.2011|15:27:00|18.05.2011|18.05.2011|Y-0444871-ENCR | 1,93 |BRL |8000800000 |Juros, Comissões e T | | | | | | | |CLB082902 | | | |COEL |COEL |Y-0444871 |
| 1020941586 | 43|SP |500000344 |43|18.05.2011|15:41:43|18.05.2011|18.05.2011|B-0447039-ENCR | 9,02 |BRL |8000800000 |Juros, Comissões e T | | | | | | | |CLB082902 | | | |COEL |COEL |B-0447039 |
我有一个 TXT 文件,我想导入到 Excel 以供研究。但是,在导入之前,我正在为文本格式而苦苦挣扎。这完全是一团糟,屁股你可以看到:
| 1020941333 | 569|SP |500000343 | 9|18.05.2011|15:27:00|18.05.2011|
18.05.2011|Y-0444871-ENCR | 1,93 |BRL |8000800000 |
Juros, Comissões e T | | |
| | | |
|CLB082902 | | | |COEL |COEL |
Y-0444871 |
| 1020941586 | 43|SP |500000344 |43|18.05.2011|15:41:43|18.05.2011|
18.05.2011|B-0447039-ENCR | 9,02 |BRL |8000800000 |
Juros, Comissões e T | | |
| | | |
|CLB082902 | | | |COEL |COEL |
B-0447039 |
| 1021245920 | 956|SP |500000489 | 6|14.06.2011|15:24:02|14.06.2011|
14.06.2011|B-0447039-ENCR | 8,95 |BRL |8000800000 |
Juros, Comissões e T | | |
| | | |
|CLB082902 | | | |COEL |COEL |
B-0447039
|
所以我一直在寻找为什么文字如此奇怪的答案。我发现是因为某些 CR+LF(Carriage Return + Line Feed)位置不当。我手动做了一些更正,通过这些我可以看到可以更好地组织文本,如下所示:
--------------------------------------------------------------------------------
| Nº documento | LL.|TpDoc.|Nº doc.ref|LL|Entrado em|Hora |Data doc. |Dt.lçto. |Elemento PEP | Valor/moeda ACC|MdACC|Cl.custo |Denom.classe custo |Material | Qtd.entr.|Texto breve material |UML |Doc.compra| Item|Texto do pedido |Usuário |DEs |Est |Nº ref.estorno |Empr. |EmFI |Definição do projeto
--------------------------------------------------------------------------------
| 1016939462 | 1|WE |5000058364| 1|22.02.2010|10:52:43|22.02.2010|22.02.2010|Y0444871PROJELMC | 540,93 |BRL |8000124000 |Serviço de Terceiro | | 1,000 | |UR |4501328844| 1|ESTUDOS E PROJ. REDE |CLB055760 | | | |COEL |COEL |Y-0444871 |
| 1020016002 | 1|WE |5000053667| 1|15.02.2011|11:56:05|15.02.2011|15.02.2011|B0447039PROJELMC | 2.011,84 |BRL |8000124000 |Serviço de Terceiro | | 1,000 | |UR |4501633481| 1|ESTUDOS E PROJ. REDE |CLB093440 | | | |COEL |COEL |B-0447039 |
| 1020258918 | 798|SP |500000121 | 8|15.03.2011|18:06:18|15.03.2011|15.03.2011|B-0447039-ENCR | 6,92 |BRL |8000800000 |Juros, Comissões e T | | | | | | | |CLB107395 | | | |COEL |COEL |B-0447039 |
| 1020585116 | 761|SP |500000225 | 1|15.04.2011|14:13:44|15.04.2011|15.04.2011|Y-0444871-ENCR | 1,88 |BRL |8000800000 |Juros, Comissões e T | | | | | | | |CLB145327 | | | |COEL |COEL |Y-0444871 |
| 1020586939 | 184|SP |500000230 | 4|15.04.2011|16:22:41|15.04.2011|15.04.2011|B-0447039-ENCR | 7,03 |BRL |8000800000 |Juros, Comissões e T | | | | | | | |CLB145327 | | | |COEL |COEL |B-0447039 |
我还可以在文本中看到一个模式。每行都以此字符 |
开头。因此,对于不以'|'开头的每一行都应该与上一行连接。
问题原样:
| 1020941333 | 569|SP |500000343 | 9|18.05.2011|15:27:00|18.05.2011|
18.05.2011|Y-0444871-ENCR | 1,93 |BRL |8000800000 |
Juros, Comissões e T | | |
| | | |
|CLB082902 | | | |COEL |COEL |
Y-0444871 |
| 1020941586 | 43|SP |500000344 |43|18.05.2011|15:41:43|18.05.2011|
18.05.2011|B-0447039-ENCR | 9,02 |BRL |8000800000 |
Juros, Comissões e T | | |
| | | |
|CLB082902 | | | |COEL |COEL |
B-0447039 |
期望的输出
| 1020941333 | 569|SP |500000343 | 9|18.05.2011|15:27:00|18.05.2011|18.05.2011|Y-0444871-ENCR | 1,93 |BRL |8000800000 |Juros, Comissões e T | | | | | | | |CLB082902 | | | |COEL |COEL |Y-0444871 |
| 1020941586 | 43|SP |500000344 |43|18.05.2011|15:41:43|18.05.2011|18.05.2011|B-0447039-ENCR | 9,02 |BRL |8000800000 |Juros, Comissões e T | | | | | | | |CLB082902 | | | |COEL |COEL |B-0447039 |
我很难在 Notepad++ 中实现它。我无法手动完成,因为该文件有超过 490 万行。如果有人可以使用 Notepad++ 或其他可以更好地解决此问题的软件向我展示有关此问题的一些信息,我将不胜感激。
您可以使用正则表达式查找后跟换行符的竖线,并使用否定先行 (?!
来检查竖线右侧的内容不是开始换行的模式。然后替换为第一个捕获组以保留管道..
查找内容:
(\|)\R(?!\|[ \t]+\d+[ \t]+\|)
替换为:
</code></p>
<p><strong>说明</strong></p>
<ul>
<li><code>(\|)
匹配捕获组中的管道
\R
匹配unicode换行序列(?!
否定前瞻
\|[ \t]+\d+[ \t]+\|
匹配管道,1+ 次 space 或制表符,1+ 位数字,1+ space 或制表符和管道
)
关闭否定前瞻这将替换任何类型的换行符 w 后面没有空的竖线:
- Ctrl+H
- 查找内容:
\R(?!\|)
- 替换为:
LEAVE EMPTY
- 选中环绕
- 检查正则表达式
- 全部替换
解释:
\R # any kind of linebreak (ie. \r, \n, \r\n)
(?! # negative lookahead, zero length assertion that makes sure we do not have after:
\| # a pipe character
) # end lookahead
给定示例的结果:
| 1020941333 | 569|SP |500000343 | 9|18.05.2011|15:27:00|18.05.2011|18.05.2011|Y-0444871-ENCR | 1,93 |BRL |8000800000 |Juros, Comissões e T | | | | | | | |CLB082902 | | | |COEL |COEL |Y-0444871 |
| 1020941586 | 43|SP |500000344 |43|18.05.2011|15:41:43|18.05.2011|18.05.2011|B-0447039-ENCR | 9,02 |BRL |8000800000 |Juros, Comissões e T | | | | | | | |CLB082902 | | | |COEL |COEL |B-0447039 |