CSV 中的无效行疑难解答
Troubleshoot invalid rows in CSV
我正在处理一个非常大的 CSV 文件(将近 6 GB),而且它充满了错误。例如,如果我有以下 csv file/table:
+------------+-------------+------------+
| ID | Date | String |
+------------+-------------+------------+
| 123456 | 09-20-2019 | ABCDEFG |
| 123abc456 | 10-30-2019 | HIJKLMN |
| 7891011 | jdqhouehwf | OPQRSTU |
| 1010101 | 03-15-2018 | 8473737 |
| 4823.00 | 02-11-2015 | VWXYZ |
| 2348813.0 | 01-23-2016 | BAZ |
+------------+-------------+------------+
或:
"ID","Date","String"
123456,"09-20-2019","ABCDEFG"
123abc456,"10-30-2019","HIJKLMN"
7891011,"jdqhouehwf","OPQRSTU"
1010101,"03-15-2018",8473737
4823.00,"02-11-2015","VWXYZ"
"2348813.0","01-23-2016","BAZ"
我想要一个解决问题和修复文件的好方法。使用 pandas,我可以读取文件:
import pandas as pd
df = pd.read_csv(inputfile)
Pandas 总是会抱怨:
sys:1: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False
所以我想清理每一列。但由于它是一个非常大的文件,我不能只打印我的整个 table 以使用掩码输出并期望阅读它。我想要一种简单的方法来获取列并检查它是否符合类型。另外,如果可能的话,我想要一种删除坏行的方法 and/or 将行转换为正确的格式。总而言之,我希望文件看起来像(不包括行内注释):
"ID","Date","String"
123456,"09-20-2019","ABCDEFG"
# 123abc456,"10-30-2019","HIJKLMN" was deleted because the ID wasn't a number
# 7891011,"jdqhouehwf","OPQRSTU" was deleted because the data was not a date
1010101,"03-15-2018","8473737" # The last number could be converted to string
4823,"02-11-2015","VWXYZ" # The first number could be converted to integer
2348813,"01-23-2016","BAZ" # The ID number could be converted to int
def main():
from pathlib import Path
import csv
import datetime as dt
with Path("thing.csv").open("r") as file:
for row in csv.DictReader(file):
try:
row["ID"] = int(float(row["ID"]))
row["Date"] = dt.datetime.strptime(row["Date"], "%m-%d-%Y")
except (KeyError, ValueError):
continue
print(*row.values())
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
正如您标记的那样 sed
,这里有一个命令应该以非常高效和可移植的方式完成这项工作,但它有点不可读...
sed -n '1p;s/^"\{0,1\}\([0-9]\+\)\(\.[0-9]*\)\{0,1\}"\{0,1\}\(,"\(0[0-9]\|1[0-2]\)-\([0-2][0-9]\|3[01]\)-2[0-9]\{3\}",\)"\{0,1\}\([^"]*\)"\{0,1\}$/""/p' file
它的作用是:
- 打印header,即第一行(
1p
),
- 在所有行上尝试替换 (
s
) 命令并仅在替换成功时打印结果(因此仅当该行与搜索模式匹配时)s/…/…/p
.
关于替换模式 ""
,每个转义数字都指代相应的捕获组(\(…\)
;请记住,根据开始 \(
标记出现)。具体来说:
</code>指的是前导数(<code>[0-9]\+
),有无(\{0,1\}
)以下三项:
- 领先
"
,
- 尾随小数部分
\.[0-9]*
,
- 及以下
"
;
</code> 指日期,包括 <code>"
周围("\(0[0-9]\|1[0-2]\)-\([0-2][0-9]\|3[01]\)-2[0-9]\{3\}"
、请注意,我在这个正则表达式中不准确,因为它也会匹配 non-existing 日期,例如 2 月 31 日);
""
指的是(并将其放在 "
之间)最终的字母数字字符串,我几乎没有对它做任何假设 ([^"]*
)。
这应该能更好地匹配日期(除了始终匹配 2 月 29 日,无论年份如何):
sed -n '1p;s/^"\{0,1\}\([0-9]\+\)\(\.[0-9]*\)\{0,1\}"\{0,1\}\(,"\(\(0[0-9]\|1[0-2]\)-[0-2][0-9]\|\(0[469]\|11\)-30\|\(0[13578]\|1[02]\)-31\)-2[0-9]\{3\}",\)"\{0,1\}\([^"]*\)"\{0,1\}$/""/p' file
我正在处理一个非常大的 CSV 文件(将近 6 GB),而且它充满了错误。例如,如果我有以下 csv file/table:
+------------+-------------+------------+
| ID | Date | String |
+------------+-------------+------------+
| 123456 | 09-20-2019 | ABCDEFG |
| 123abc456 | 10-30-2019 | HIJKLMN |
| 7891011 | jdqhouehwf | OPQRSTU |
| 1010101 | 03-15-2018 | 8473737 |
| 4823.00 | 02-11-2015 | VWXYZ |
| 2348813.0 | 01-23-2016 | BAZ |
+------------+-------------+------------+
或:
"ID","Date","String"
123456,"09-20-2019","ABCDEFG"
123abc456,"10-30-2019","HIJKLMN"
7891011,"jdqhouehwf","OPQRSTU"
1010101,"03-15-2018",8473737
4823.00,"02-11-2015","VWXYZ"
"2348813.0","01-23-2016","BAZ"
我想要一个解决问题和修复文件的好方法。使用 pandas,我可以读取文件:
import pandas as pd
df = pd.read_csv(inputfile)
Pandas 总是会抱怨:
sys:1: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False
所以我想清理每一列。但由于它是一个非常大的文件,我不能只打印我的整个 table 以使用掩码输出并期望阅读它。我想要一种简单的方法来获取列并检查它是否符合类型。另外,如果可能的话,我想要一种删除坏行的方法 and/or 将行转换为正确的格式。总而言之,我希望文件看起来像(不包括行内注释):
"ID","Date","String"
123456,"09-20-2019","ABCDEFG"
# 123abc456,"10-30-2019","HIJKLMN" was deleted because the ID wasn't a number
# 7891011,"jdqhouehwf","OPQRSTU" was deleted because the data was not a date
1010101,"03-15-2018","8473737" # The last number could be converted to string
4823,"02-11-2015","VWXYZ" # The first number could be converted to integer
2348813,"01-23-2016","BAZ" # The ID number could be converted to int
def main():
from pathlib import Path
import csv
import datetime as dt
with Path("thing.csv").open("r") as file:
for row in csv.DictReader(file):
try:
row["ID"] = int(float(row["ID"]))
row["Date"] = dt.datetime.strptime(row["Date"], "%m-%d-%Y")
except (KeyError, ValueError):
continue
print(*row.values())
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
正如您标记的那样 sed
,这里有一个命令应该以非常高效和可移植的方式完成这项工作,但它有点不可读...
sed -n '1p;s/^"\{0,1\}\([0-9]\+\)\(\.[0-9]*\)\{0,1\}"\{0,1\}\(,"\(0[0-9]\|1[0-2]\)-\([0-2][0-9]\|3[01]\)-2[0-9]\{3\}",\)"\{0,1\}\([^"]*\)"\{0,1\}$/""/p' file
它的作用是:
- 打印header,即第一行(
1p
), - 在所有行上尝试替换 (
s
) 命令并仅在替换成功时打印结果(因此仅当该行与搜索模式匹配时)s/…/…/p
.
关于替换模式 ""
,每个转义数字都指代相应的捕获组(\(…\)
;请记住,根据开始 \(
标记出现)。具体来说:
</code>指的是前导数(<code>[0-9]\+
),有无(\{0,1\}
)以下三项:- 领先
"
, - 尾随小数部分
\.[0-9]*
, - 及以下
"
;
- 领先
</code> 指日期,包括 <code>"
周围("\(0[0-9]\|1[0-2]\)-\([0-2][0-9]\|3[01]\)-2[0-9]\{3\}"
、请注意,我在这个正则表达式中不准确,因为它也会匹配 non-existing 日期,例如 2 月 31 日);""
指的是(并将其放在"
之间)最终的字母数字字符串,我几乎没有对它做任何假设 ([^"]*
)。
这应该能更好地匹配日期(除了始终匹配 2 月 29 日,无论年份如何):
sed -n '1p;s/^"\{0,1\}\([0-9]\+\)\(\.[0-9]*\)\{0,1\}"\{0,1\}\(,"\(\(0[0-9]\|1[0-2]\)-[0-2][0-9]\|\(0[469]\|11\)-30\|\(0[13578]\|1[02]\)-31\)-2[0-9]\{3\}",\)"\{0,1\}\([^"]*\)"\{0,1\}$/""/p' file