使用 python 清理 csv 文件
clean csv file with python
我有一个要用 python.
清理的 csv 文件
它有由 << \n >> 或空行分隔的行。
我希望不以 << " >> 结尾的每一行都是 cut/pasted 到上一行。
这里有一个更明确的具体例子!\
我有 CSV 文件
*"id","name","age","city","remark"\
"1","kevin","27","paris","This is too bad"\
"8","angel","18","london","Incredible !!!"\
"14","maria","33","madrid","i can't believe it."\
"16","john","28","new york","hey men,\
\nhow do you did this"\
"22","naima","35","istanbul","i'm sure it's false,\
\
\nit can't be real"
"35","marco","26","roma","you'r my hero!"\
"39","lili","37","tokyo","all you need to knows.\
\n\nthe best way to upgrade easely"\
...*
我想要的 CSV 文件
*"id","name","age","city","remark"\
"1","kevin","27","paris","This is too bad"\
"8","angel","18","london","Incredible !!!"\
"14","maria","33","madrid","i can't believe it."\
"16","john","28","new york","hey men,how do you did this"\
"22","naima","35","istanbul","i'm sure it's false, it can't be real"\
"35","marco","26","roma","you'r my hero!"\
"39","lili","37","tokyo","all you need to knows. the best way to upgrade easely"\
...*
有人会怎么办?
预先感谢您的帮助!
我实际上正在尝试这个 python 代码 -->
text = open("input.csv", "r", encoding='utf-8')
text = ''.join([i for i in text])
text = text.replace("\n", "")
x = open("output.csv","w")
x.writelines(text)
x.close()
for this_row in read_file.readlines():
if not this_row.startswith('"'):
prev_row = prev_row.rstrip('\n') + this_row
else:
write_file.write(prev_row)
prev_row = this_row
只是草稿。
您可以将 str.join 与 list-cache 一起使用以获得增强
这里有几点需要说明:
您的 CSV 文件在您的备注中包含 ,
个字符。这意味着该字段必须用引号引起来(确实如此)。
CSV 文件允许在单个字段中包含换行符。这不会导致额外的数据行,但它确实会使文件对人类阅读来说很奇怪。
Python 的 CSV reader 将自动处理字段中的换行符。
最后,您的数据似乎编码不正常,您希望删除所有多余的换行符。每行还有一个不应存在的尾随反斜杠字符。
我建议采用以下方法:
- 使用 Python 的 CSV reader 一次正确读取一行(您有 7 行 + 一个 header)。
- 从备注字段中删除所有换行符。
例如:
import csv
with open('input.csv') as f_input, open('output.csv', 'w', newline='') as f_output:
csv_input = csv.reader(f_input)
csv_output = csv.writer(f_output)
for row in csv_input:
if len(row) == 5: # skip blank lines
row[4] = row[4].replace('\n', '').replace('\n', ' ').replace('\', '')
csv_output.writerow(row)
这会给你:
id,name,age,city,remark\
1,kevin,27,paris,This is too bad
8,angel,18,london,Incredible !!!
14,maria,33,madrid,i can't believe it.
16,john,28,new york,"hey men, how do you did this"
22,naima,35,istanbul,"i'm sure it's false, it can't be real"
35,marco,26,roma,you'r my hero!
39,lili,37,tokyo,all you need to knows. the best way to upgrade easely
input.csv
文件内容:
"id","name","age","city","remark"
"1","kevin","27","paris","This is too bad"
"8","angel","18","london","Incredible !!!"
"14","maria","33","madrid","i can't believe it."
"16","john","28","new york","hey men,
how do you did this"
"22","naima","35","istanbul","i'm sure it's false,
nit can't be real"
"35","marco","26","roma","you'r my hero!"
"39","lili","37","tokyo","all you need to knows.
the best way to upgrade easely"
可能(快速且简单)的解决方案如下:
with open('input.csv', 'r', encoding='utf-8') as file:
data = file.read()
clean_data = data.replace('"\n"', '"||"').replace("\n", "").replace('"||"', '"\n"')
with open('output.csv', 'w', encoding='utf-8') as file:
file.write(clean_data)
Returns output.csv
内容:
"id","name","age","city","remark"
"1","kevin","27","paris","This is too bad"
"8","angel","18","london","Incredible !!!"
"14","maria","33","madrid","i can't believe it."
"16","john","28","new york","hey men,how do you did this"
"22","naima","35","istanbul","i'm sure it's false,nit can't be real"
"35","marco","26","roma","you'r my hero!"
"39","lili","37","tokyo","all you need to knows.the best way to upgrade easely"
我有一个要用 python.
清理的 csv 文件
它有由 << \n >> 或空行分隔的行。
我希望不以 << " >> 结尾的每一行都是 cut/pasted 到上一行。
这里有一个更明确的具体例子!\
我有 CSV 文件
*"id","name","age","city","remark"\
"1","kevin","27","paris","This is too bad"\
"8","angel","18","london","Incredible !!!"\
"14","maria","33","madrid","i can't believe it."\
"16","john","28","new york","hey men,\
\nhow do you did this"\
"22","naima","35","istanbul","i'm sure it's false,\
\
\nit can't be real"
"35","marco","26","roma","you'r my hero!"\
"39","lili","37","tokyo","all you need to knows.\
\n\nthe best way to upgrade easely"\
...*
我想要的 CSV 文件
*"id","name","age","city","remark"\
"1","kevin","27","paris","This is too bad"\
"8","angel","18","london","Incredible !!!"\
"14","maria","33","madrid","i can't believe it."\
"16","john","28","new york","hey men,how do you did this"\
"22","naima","35","istanbul","i'm sure it's false, it can't be real"\
"35","marco","26","roma","you'r my hero!"\
"39","lili","37","tokyo","all you need to knows. the best way to upgrade easely"\
...*
有人会怎么办?
预先感谢您的帮助!
我实际上正在尝试这个 python 代码 -->
text = open("input.csv", "r", encoding='utf-8')
text = ''.join([i for i in text])
text = text.replace("\n", "")
x = open("output.csv","w")
x.writelines(text)
x.close()
for this_row in read_file.readlines():
if not this_row.startswith('"'):
prev_row = prev_row.rstrip('\n') + this_row
else:
write_file.write(prev_row)
prev_row = this_row
只是草稿。 您可以将 str.join 与 list-cache 一起使用以获得增强
这里有几点需要说明:
您的 CSV 文件在您的备注中包含
,
个字符。这意味着该字段必须用引号引起来(确实如此)。CSV 文件允许在单个字段中包含换行符。这不会导致额外的数据行,但它确实会使文件对人类阅读来说很奇怪。
Python 的 CSV reader 将自动处理字段中的换行符。
最后,您的数据似乎编码不正常,您希望删除所有多余的换行符。每行还有一个不应存在的尾随反斜杠字符。
我建议采用以下方法:
- 使用 Python 的 CSV reader 一次正确读取一行(您有 7 行 + 一个 header)。
- 从备注字段中删除所有换行符。
例如:
import csv
with open('input.csv') as f_input, open('output.csv', 'w', newline='') as f_output:
csv_input = csv.reader(f_input)
csv_output = csv.writer(f_output)
for row in csv_input:
if len(row) == 5: # skip blank lines
row[4] = row[4].replace('\n', '').replace('\n', ' ').replace('\', '')
csv_output.writerow(row)
这会给你:
id,name,age,city,remark\
1,kevin,27,paris,This is too bad
8,angel,18,london,Incredible !!!
14,maria,33,madrid,i can't believe it.
16,john,28,new york,"hey men, how do you did this"
22,naima,35,istanbul,"i'm sure it's false, it can't be real"
35,marco,26,roma,you'r my hero!
39,lili,37,tokyo,all you need to knows. the best way to upgrade easely
input.csv
文件内容:
"id","name","age","city","remark"
"1","kevin","27","paris","This is too bad"
"8","angel","18","london","Incredible !!!"
"14","maria","33","madrid","i can't believe it."
"16","john","28","new york","hey men,
how do you did this"
"22","naima","35","istanbul","i'm sure it's false,
nit can't be real"
"35","marco","26","roma","you'r my hero!"
"39","lili","37","tokyo","all you need to knows.
the best way to upgrade easely"
可能(快速且简单)的解决方案如下:
with open('input.csv', 'r', encoding='utf-8') as file:
data = file.read()
clean_data = data.replace('"\n"', '"||"').replace("\n", "").replace('"||"', '"\n"')
with open('output.csv', 'w', encoding='utf-8') as file:
file.write(clean_data)
Returns output.csv
内容:
"id","name","age","city","remark"
"1","kevin","27","paris","This is too bad"
"8","angel","18","london","Incredible !!!"
"14","maria","33","madrid","i can't believe it."
"16","john","28","new york","hey men,how do you did this"
"22","naima","35","istanbul","i'm sure it's false,nit can't be real"
"35","marco","26","roma","you'r my hero!"
"39","lili","37","tokyo","all you need to knows.the best way to upgrade easely"