如何对文本文件进行数据清理?
How to perform data cleaning for a text file?
我有一个包含很多行的文本文件,包括单词和数字,这里是一个例子:
2021-12-06 05:07:09.266 INFO: Additional ID 1638301749791
2021-12-06 05:07:09.266 INFO: Found
2021-12-06 05:07:09.267 INFO: ObjectStatus-ok factor 1163 factor five and six computed as it was before best weight ID 1638301749796
2021-12-06 05:07:09.267 INFO: disabled; computing power weight factor factor 19025.
2021-12-06 05:07:10.041 INFO: Wrote big factor 0.3568357342, Classificationfactortype-fail
2021-12-06 05:07:10.042 DEBUG: Duiu.0.0.2588650814
2021-12-06 05:07:10.743 INFO: Wrote .3254806495
我的问题是如何保留具有特定词“Classificationfactortype-fail”和“ObjectStatus-ok”的行,并删除所有其他行?我想将新的文本文件保存在目录中。
这是我写的代码:
ans = []
with open('test. txt') as rf:
for line in rf:
line = line.strip()
if "Classificationfactortype-fail" in line or "ObjectStatus-ok" in line:
ans.append(line)
with open('extracted_data.txt', 'w') as wf:
for line in ans:
wf.write(line)
如果每行都以时间码开头,那么 str.startswith() 将不起作用。
你可以简单地做:
if "Classificationfactortype-fail" in line or "ObjectStatus-ok" in line:
ans.append(line)
在你的第一个循环中。
我有一个包含很多行的文本文件,包括单词和数字,这里是一个例子:
2021-12-06 05:07:09.266 INFO: Additional ID 1638301749791
2021-12-06 05:07:09.266 INFO: Found
2021-12-06 05:07:09.267 INFO: ObjectStatus-ok factor 1163 factor five and six computed as it was before best weight ID 1638301749796
2021-12-06 05:07:09.267 INFO: disabled; computing power weight factor factor 19025.
2021-12-06 05:07:10.041 INFO: Wrote big factor 0.3568357342, Classificationfactortype-fail
2021-12-06 05:07:10.042 DEBUG: Duiu.0.0.2588650814
2021-12-06 05:07:10.743 INFO: Wrote .3254806495
我的问题是如何保留具有特定词“Classificationfactortype-fail”和“ObjectStatus-ok”的行,并删除所有其他行?我想将新的文本文件保存在目录中。
这是我写的代码:
ans = []
with open('test. txt') as rf:
for line in rf:
line = line.strip()
if "Classificationfactortype-fail" in line or "ObjectStatus-ok" in line:
ans.append(line)
with open('extracted_data.txt', 'w') as wf:
for line in ans:
wf.write(line)
如果每行都以时间码开头,那么 str.startswith() 将不起作用。
你可以简单地做:
if "Classificationfactortype-fail" in line or "ObjectStatus-ok" in line:
ans.append(line)
在你的第一个循环中。