带有随机双引号的 CSV 文件
CSV file with random double quotes
我有一个 CSV 文件,在某些字段中包含双引号字符。当用 Python 解析时,它开始忽略这些引号之间的分隔符。例如:
5695|258|03/21/2012| 15:16:02.000|info|Microsoft-Windows-Defrag|shrink estimation, (C:)|36|"6ybSr: c{q6: |Application|WKS-WIN732test.test.local|http://schemas.microsoft.com/win/2004/08/events/event|0x0080000000000000|0|0||0|0|C:\Users\test\EventLog\win7-32-test-c-drive\Application.evtx
5770|258|03/24/2012| 04:21:02.000|info|Microsoft-Windows-Defrag|boot optimization, (C:)|36|00 00 00 00 d3 03 00 00 ae 03 00 00 00 00 00 00 22 b6 30 df 64 79 c7 f6 e2 6c 1c 00 00 00 00 00 00 00 00 00|Application|WKS-WIN732test.test.local|http://schemas.microsoft.com/win/2004/08/events/event|0x0080000000000000|0|0||0|0|C:\Users\test\EventLog\win7-32-test-c-drive\Application.evtx
5843|258|03/27/2012| 07:38:36.000|info|Microsoft-Windows-Defrag|boot optimization, (C:)|36|jbg54t5t"gfb:*&hgfh|Application|WKS-WIN732test.test.local|http://schemas.microsoft.com/win/2004/08/events/event|0x0080000000000000|0|0||0|0|C:\Users\test\EventLog\win7-32-test-c-drive\Application.evtx
因此,它将两个双引号之间的所有内容作为一个字段读取:
5695|258|03/21/2012| 15:16:02.000|info|Microsoft-Windows-Defrag|shrink estimation, (C:)|36|"6ybSr: c{q6: |Application|WKS-WIN732test.test.local|http://schemas.microsoft.com/win/2004/08/events/event|0x0080000000000000|0|0||0|0|C:\Users\test\EventLog\win7-32-test-c-drive\Application.evtx
^
5770|258|03/24/2012| 04:21:02.000|info|Microsoft-Windows-Defrag|boot optimization, (C:)|36|00 00 00 00 d3 03 00 00 ae 03 00 00 00 00 00 00 22 b6 30 df 64 79 c7 f6 e2 6c 1c 00 00 00 00 00 00 00 00 00|Application|WKS-WIN732test.test.local|http://schemas.microsoft.com/win/2004/08/events/event|0x0080000000000000|0|0||0|0|C:\Users\test\EventLog\win7-32-test-c-drive\Application.evtx
5843|258|03/27/2012| 07:38:36.000|info|Microsoft-Windows-Defrag|boot optimization, (C:)|36|jbg54t5t"gfb:*&hgfh|Application|WKS-WIN732test.test.local|http://schemas.microsoft.com/win/2004/08/events/event|0x0080000000000000|0|0||0|0|C:\Users\test\EventLog\win7-32-test-c-drive\Application.evtx
^
(参见上例中的插入符号 (^
))。
如何让它忽略双引号?
警告:我不想将整个文件读入 RAM 并替换字符。该解决方案必须在遍历 reader.
中的行时起作用
分隔符是竖线。我使用标准 CSV 技术读取它并使用已知编码对其进行解码:
import csv
known_encoding = 'utf-8' # for mwe, real code fetches for each file
with open(self.current_file.file_path, 'rb') as f:
reader = csv.reader(f, delimiter='|')
for row in reader:
row = [s.decode(known_encoding) for s in row]
# do stuff with data in row
我猜测 你的 CSV 文件从不包含带引号的字段,因此你可以使用 quoting
参数将其关闭:
csv.reader(f, delimiter='|', quoting=csv.QUOTE_NONE)
你可以这样设置 quoting
到 csv.QUOTE_NONE
:
import csv
with open('my_file', 'r') as f:
csvreader = csv.reader(f, delimiter='|', quoting=csv.QUOTE_NONE)
....
我有一个 CSV 文件,在某些字段中包含双引号字符。当用 Python 解析时,它开始忽略这些引号之间的分隔符。例如:
5695|258|03/21/2012| 15:16:02.000|info|Microsoft-Windows-Defrag|shrink estimation, (C:)|36|"6ybSr: c{q6: |Application|WKS-WIN732test.test.local|http://schemas.microsoft.com/win/2004/08/events/event|0x0080000000000000|0|0||0|0|C:\Users\test\EventLog\win7-32-test-c-drive\Application.evtx
5770|258|03/24/2012| 04:21:02.000|info|Microsoft-Windows-Defrag|boot optimization, (C:)|36|00 00 00 00 d3 03 00 00 ae 03 00 00 00 00 00 00 22 b6 30 df 64 79 c7 f6 e2 6c 1c 00 00 00 00 00 00 00 00 00|Application|WKS-WIN732test.test.local|http://schemas.microsoft.com/win/2004/08/events/event|0x0080000000000000|0|0||0|0|C:\Users\test\EventLog\win7-32-test-c-drive\Application.evtx
5843|258|03/27/2012| 07:38:36.000|info|Microsoft-Windows-Defrag|boot optimization, (C:)|36|jbg54t5t"gfb:*&hgfh|Application|WKS-WIN732test.test.local|http://schemas.microsoft.com/win/2004/08/events/event|0x0080000000000000|0|0||0|0|C:\Users\test\EventLog\win7-32-test-c-drive\Application.evtx
因此,它将两个双引号之间的所有内容作为一个字段读取:
5695|258|03/21/2012| 15:16:02.000|info|Microsoft-Windows-Defrag|shrink estimation, (C:)|36|"6ybSr: c{q6: |Application|WKS-WIN732test.test.local|http://schemas.microsoft.com/win/2004/08/events/event|0x0080000000000000|0|0||0|0|C:\Users\test\EventLog\win7-32-test-c-drive\Application.evtx
^
5770|258|03/24/2012| 04:21:02.000|info|Microsoft-Windows-Defrag|boot optimization, (C:)|36|00 00 00 00 d3 03 00 00 ae 03 00 00 00 00 00 00 22 b6 30 df 64 79 c7 f6 e2 6c 1c 00 00 00 00 00 00 00 00 00|Application|WKS-WIN732test.test.local|http://schemas.microsoft.com/win/2004/08/events/event|0x0080000000000000|0|0||0|0|C:\Users\test\EventLog\win7-32-test-c-drive\Application.evtx
5843|258|03/27/2012| 07:38:36.000|info|Microsoft-Windows-Defrag|boot optimization, (C:)|36|jbg54t5t"gfb:*&hgfh|Application|WKS-WIN732test.test.local|http://schemas.microsoft.com/win/2004/08/events/event|0x0080000000000000|0|0||0|0|C:\Users\test\EventLog\win7-32-test-c-drive\Application.evtx
^
(参见上例中的插入符号 (^
))。
如何让它忽略双引号?
警告:我不想将整个文件读入 RAM 并替换字符。该解决方案必须在遍历 reader.
中的行时起作用分隔符是竖线。我使用标准 CSV 技术读取它并使用已知编码对其进行解码:
import csv
known_encoding = 'utf-8' # for mwe, real code fetches for each file
with open(self.current_file.file_path, 'rb') as f:
reader = csv.reader(f, delimiter='|')
for row in reader:
row = [s.decode(known_encoding) for s in row]
# do stuff with data in row
我猜测 你的 CSV 文件从不包含带引号的字段,因此你可以使用 quoting
参数将其关闭:
csv.reader(f, delimiter='|', quoting=csv.QUOTE_NONE)
你可以这样设置 quoting
到 csv.QUOTE_NONE
:
import csv
with open('my_file', 'r') as f:
csvreader = csv.reader(f, delimiter='|', quoting=csv.QUOTE_NONE)
....