如何从txt文件中删除奇怪的编码

How to remove weird encoding from txt file

我正在尝试处理这样的文本文件:

http://www.sec.gov/Archives/edgar/data/789019/000119312514289961/0001193125-14-289961.txt

如果您在文件中间看到类似以下内容:

</TEXT>
</DOCUMENT>
<DOCUMENT>
<TYPE>EXCEL
<SEQUENCE>21
<FILENAME>Financial_Report.xlsx
<DESCRIPTION>IDEA: XBRL DOCUMENT
<TEXT>
begin 644 Financial_Report.xlsx
M4$L#!!0`!@`(````(0!):[_C#0,``+!)```3``@"6T-O;G1E;G1?5'EP97-=
M+GAM;""B!`(HH``"````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M``````````````````````````````````````#,W,M.VT`4QO%]I;Z#Y6V5
M>([OK@@L>EFV2*4/,+4GQ,(W>08*;]^)N0BA%(2*U/^&B,2>\6+G[+YSM')
M==\%5V:V[3AL0EFK,#!#/3;M<+X)?YY]795A8)T>&MV-@]F$-\:&)\?OWQV=
MW4S&!O[NP6["G7/3QRBR]<[TVJ['R0S^D^TX]]KY?^?S:-+UA3XW4:Q4'M7C
MX,S@5FY_1GA\]-EL]67G@B_7_NW;)+/I;!A\NKUP/VL3ZFGJVEH[GS2Z&IHG
M4U9W$];^SN4:NVLG^\''"*.#$_:?_'W`W7W?_:.9V$IWIVWW3O8T377?1[
MG"]^C>/%^OE##J0<M]NV-LU87_;^":SM-!O=V)TQKN_6R^NZU^UPG_N9^<O%
M-EI>Y(V#[+_?<O`K<'`DD1PK)D4%RY)`<!21'"<E107*(H@2AB"H44H5B
MJE!0%8JJ0F%5**X*!5:AR!I39(TILL8466.*K#%%UI@B:TR1-:;(&E-DC2FR
M)A19$XJL"476A")K0I$UH<B:4&1-*+(F%%D3BJPI1=:4(FM*D36ER)I29$TI
MLJ8465.*K"E%UI0B:T:1-:/(FE%DS2BR9A19,XJL&476C")K1I$UH\B:4V3-
M*;+F%%ESBJPY1=:<(FM.D36GR)I39,TILA8460N*K`5%UH(B:T&1M:#(6E!D
M+2BR%A19"XJL)476DB)K29&UI,A:4F0M*;*6%%E+BJPE1=:2(FM%D;6BR%I1
M9*THLE8462N*K!5%UHHB:T61M:+(*HI"JRB*K:(HN(JBZ"J*PJLHBJ^B*,"*
MH@@KBD*L*(RQH#H6QEA.(8O3R.)4LCB=+$XIB]/*XM2R,+TLP12S!-/,$DPU
M2S#=+,&4LP33SA),/4LP_2S!%+0$T]"2_U;1<GX?CHF6O__^`W8YYH6%+-;=
M=,:^*%VT-?FKS3LVE^N-EO#GKS`(_/?BZ'WZMS.H^3]1N&9O/ZIW"_0FA_
M]VKR!YG9M>9AB="A93P/$_UVHM</?+(-R.SW'S6F.3`[6O8M'?\!``#__P,`
M4$L#!!0`!@`(````(0"U53`C]0```$P"```+``@"7W)E;',O+G)E;',@H@0"

这似乎是一个 excel 文件?还是 XBRL 文档?那是什么 ?我如何摆脱它(或以某种方式 "process" 它??)这持续了数千行所以我猜它是某些附加文件的某些 link 的某种编码?知道如何处理吗?

我正在尝试在 Python 中使用 BeautifulSoup:

from bs4 import BeautifulSoup

with open("textWithHtml.txt") as markup:
    soup = BeautifulSoup(markup.read())

with open("processedText.txt", "w") as f: 
    f.write(soup.get_text().encode('utf-8'))

但并非所有内容都被删除,我还注意到在某些情况下甚至没有删除所有 html 标签。有时 运行 代码两次删除的内容比第一次删除的内容更多BeautifulSoup 代码是 运行..

的时间

您正在查看的编码是 uuencode. In Python, you would use the uu 解码此 blob 的模块,或者只是 stringdata.decode('uu').

uuencode 是一种遗留格式,最初用于在电子邮件中嵌入二进制文件(当时只允许 7 位 US-ASCII;该格式在与使用他们自己令人困惑的字符编码的那一天)。这些天,您会希望看到 base64 扮演这个角色。

它显示了如何在从文件句柄读取或迭代一堆文本行时删除 uuencode blob。

使用此处提供的 sed 命令可以有效地解决问题: