python 解码部分十六进制字符串

python decode partial hex strings

我正在使用漂亮的汤来解析电子邮件发票,但我 运行 遇到了涉及特殊字符的一致问题。

图像中显示了我要解析的文本。

但是我在找到元素并调用elem.text后从beautiful soup得到的是这样的:

'Hi Mike, It=E2=80=\r\n=99s probably not a big drama if you are having problems separating product=\r\ns from classes. It is not uncommon to receive an order for pole classes and=\r\n a bottle of Dry Hands.\r\nAlso, remember that we will have just straight up product orders that your =\r\nsystem will not be able to place into a class list, hence having the extra =\r\nsheet for any =E2=80=9Cerroneous=E2=80=9D orders will be handy.'

如您所见,apostrophe 现在由“=E2=80=99”表示,双引号为“=E2=80=9C”和“=E2=80=9D”并且文本中有看似随机的换行符,例如“product=\r\ns”。 图片中似乎没有换行符。

显然 "E2 80 99" 是 ' 的 unicode 十六进制表示,但我不明白为什么在完成 email.decode('utf-8') 之后我仍然可以看到这种形式送给美汤

这是元素

<td border:="" class='3D"td"' left="" middle="" padding:="" solid="" style='3D"color:' text-align:="" v="ertical-align:">Hi Mike, It=E2=80=
=99s probably not a big drama if you are having problems separating product=
s from classes. It is not uncommon to receive an order for pole classes and=
 a bottle of Dry Hands.
Also, remember that we will have just straight up product orders that your =
system will not be able to place into a class list, hence having the extra =
sheet for any =E2=80=9Cerroneous=E2=80=9D orders will be handy.</td>

如果需要,我可以 post 我的代码,但我想我一定是犯了一个简单的错误。

我查看了这个问题的答案 Decode Hex String in Python 3 但我认为整个字符串都是十六进制的,而不仅仅是随机的十六进制部分。 但老实说,我什至不确定如何搜索“解码部分十六进制字符串”

我最后的问题是

Q1 如何转换

'Hi Mike, It=E2=80=\r\n=99s probably not a big drama if you are having problems separating product=\r\ns from classes. It is not uncommon to receive an order for pole classes and=\r\n a bottle of Dry Hands.\r\nAlso, remember that we will have just straight up product orders that your =\r\nsystem will not be able to place into a class list, hence having the extra =\r\nsheet for any =E2=80=9Cerroneous=E2=80=9D orders will be handy.'

进入

'Hi Mike, It's probably not a big drama if you are having problems separating products from classes. It is not uncommon to receive an order for pole classes and=\r\n a bottle of Dry Hands.Also, remember that we will have just straight up product orders that your system will not be able to place into a class list, hence having the extra sheet for any "erroneous" orders will be handy.'

使用python 3,无需手动修复每个字符串并为每个可能的字符编写替换方法。

Q2 为什么这个“=\r\n”出现在我的字符串中,但没有出现在呈现的 html 中?

@JosefZ 的评论让我找到了答案。

Q1 有答案。

>>> import quopri
>>> print(quopri.decodestring(mystring).decode('utf-8'))
Hi Mike, It’s probably not a big drama if you are having problems separating products from classes. It is not uncommon to receive an order for pole classes and a bottle of Dry Hands.
Also, remember that we will have just straight up product orders that your system will not be able to place into a class list, hence having the extra sheet for any “erroneous” orders will be handy.

Q2 感谢@snakecharmerb,我现在知道看似随机的未表示行结尾是强制行长度为 80 个字符。

@snakecharmerb 为与我有同样问题的人写了一个比这个更好的答案。