如何解析某些文本数据？

Question

我有一个格式如下的文本文件：

B2100 Door Driver Key Cylinder Switch Failure B2101 Head Rest Switch Circuit Failure B2102 Antenna Circuit Short to Ground`, plus 1000 lines more.

这就是我想要的样子：

B2100*Door Driver Key Cylinder Switch Failure B2101*Head Rest Switch Circuit Failure B2102*Antenna Circuit Short to Ground B2103*Antenna Not Connected B2104*Door Passenger Key Cylinder Switch Failure

这样我就可以在 LibreOffice Calc 中复制这些数据，它会将其格式化为两列代码并分别表示含义。

我的思考过程：
在 Bxxxx 上应用一个正则表达式，并在它前面放一个星号（它作为分隔符）和一个 \n 在意思之前（我不知道这是否有效？），并删除 white-space 直到遇到下一个字符。

我正在尝试隔离 B2100，但直到现在都失败了。我天真的尝试：

import re

text = """B2100 Door Driver Key Cylinder Switch Failure B2101   Head Rest Switch Circuit Failure B2102  Antenna Circuit Short to Ground B2103   Antenna Not Connected B2104 Door Passenger Key Cylinder Switch Failure B2105    Throttle Position Input Out of Range Low B2106  Throttle Position Input Out of Range High B2107 Front Wiper Motor Relay Circuit Short to Vbatt B2108    Trunk Key Cylinder Switch Failure"""
# text_arr = text.split("\^B[0-9][0-9][0-9][0-9]$\gi");
l = re.compile('\^B[0-9][0-9][0-9][0-9]$\gi').split(text)
print(l)

这输出：

['B2100\tDoor Driver Key Cylinder Switch Failure B2101\tHead Rest Switch Circuit Failure B2102\tAntenna Circuit Short to Ground B2103\tAntenna Not Connected B2104\tDoor Passenger Key Cylinder Switch Failure B2105\tThrottle Position Input Out of Range Low B2106\tThrottle Position Input Out of Range High B2107\tFront Wiper Motor Relay Circuit Short to Vbatt B2108\tTrunk Key Cylinder Switch Failure']

如何达到预期效果？

为了进一步分解，我想做的是：
将所有内容分解为代码 (B1001) 和含义（其后的文本）数组，然后分别对其应用每个操作（\n 事物）。如果您对如何做整件事有更好的想法，那就更好了。我很想听听。

Answer 1

首先，你的正则表达式是错误的 '\^B[0-9][0-9][0-9][0-9]$\gi

修改器在 Python
^ 和 $ 表示行的开头和结尾，与您的文本中的任何内容都不匹配
[0-9]的倍数可以用'[0-9]{4}'代替
如果你想忽略大小写，请使用 Python regex

考虑到这一点，实现您想要的功能的简单代码如下所示：

l = [x.strip() for x in re.compile('\s*(B\d{4})\s*', re.IGNORECASE).split(text)]
lines = ['*'.join(l[i:i+2]) for i in range(0,len(l),2)]

Answer 2

import re
text = """B2100 Door Driver Key Cylinder Switch Failure B2101   Head Rest Switch Circuit Failure B2102  Antenna Circuit Short to Ground B2103   Antenna Not Connected B2104 Door Passenger Key Cylinder Switch Failure B2105    Throttle Position Input Out of Range Low B2106  Throttle Position Input Out of Range High B2107 Front Wiper Motor Relay Circuit Short to Vbatt B2108    Trunk Key Cylinder Switch Failure"""

l = [i for i in re.split('(B[0-9]{4}\s+)', text) if i]
print '\n'.join(['{}*{}'.format(id_.strip(), label.strip()) for id_,label in zip(l[0::2], l[1::2])])

.split 如果您在正则表达式周围包含 () ，则可以在拆分后保留分隔符。以上产生输出：

B2100*Door Driver Key Cylinder Switch Failure
B2101*Head Rest Switch Circuit Failure
B2102*Antenna Circuit Short to Ground
B2103*Antenna Not Connected
B2104*Door Passenger Key Cylinder Switch Failure
B2105*Throttle Position Input Out of Range Low
B2106*Throttle Position Input Out of Range High
B2107*Front Wiper Motor Relay Circuit Short to Vbatt
B2108*Trunk Key Cylinder Switch Failure

Answer 3

基本上，您想要：

在输入中查找任何 Bxxxx 字符串。
用换行符替换它们之前的任何空格。
用 *.

这一切都可以用一个 re.sub():

re.sub(r'\s*(B\d{4})\s*', r'\n*', text).strip()

匹配模式：

\s*              # Any amount of whitespace
   (B\d{4})      # "B" followed by exactly 4 digits
           \s*   # Any amount of whitespace

替换模式：

\n               # Newline
               # The first parenthesized sequence from the matching pattern (B####)
    *            # Literal "*"

strip() 的目的是删除任何前导或尾随空格，包括将由第一个 B#### 序列的子部分产生的换行符。

Answer 4

重新导入

导入 pandas 作为 pd

pat=r"(B\d+)"

zzz=[i for i in re.split(pat,kkk) if i!='']

pd.DataFrame({'Col1': zzz[::2],'Col2':[i.strip() for i in zzz if re.match( pat,i) 是 None] })

第 1 列第 2 列

0 B2100 车门驱动钥匙筒开关故障

1 B2101 头枕开关电路故障

2 B2102 天线电路对地短路 3 B2100 车门驱动钥匙筒开关故障

如何解析某些文本数据？

How to parse certain text data?

python

text-processing