修复 csv 文件中的数据
Fixing data in csv file
我得到了一个包含购买数据的 csv 文件,但它有一个问题:
它有 4 列,全部用逗号分隔,但其中包含价格的一列有许多值,逗号用作小数点分隔符。所以最后当我试图读取文件时,它读取这些行有 5 列并遇到错误。像这样:
transaction id,user id,purchase price,purchase date
1009497,490408,10,41674
1077573,490408,8,95,41676
所以 pd.read_csv
能够读取标签,读取第一行,但它停在第二行,因为它认为我给它的是 5 列而不是 4 列。什么是最有效的修复我的数据的方法?无法将所有小数点分隔符从逗号手动更改为点号。
更新:
我正在考虑将每一行读取为字符串,然后计算每一行中的逗号,如果它有 4 个逗号,那么我将使用正则表达式将该逗号周围的替换数据替换为“.”而不是","
如果您确定只有 purchase price
字段有此问题,您可以执行以下操作,但如果您的文件很大,则需要一段时间,但它有效:
import pandas as pd
with open('your_csv.csv', 'r') as f:
file_text = f.readlines()
with open('your_csv.csv', 'w') as f:
for line in file_text:
if len(line.split(',')) > 4:
line = '%s,%s,%s.%s,%s' % tuple([i for i in line.split(',')])
f.write(line)
csv = pd.read_csv('your_csv.csv')
print(csv)
我会这样做,当我尝试复制您的问题时,我有以下 DF:
transaction id user id purchase price purchase date Unnamed: 4
0 1009497 490408 10 41674 nan
1 1077573 490408 8 95 41676.0
# So basically I get a new column " Unnamed:4"
df['Unnamed: 4'] = df['Unnamed: 4'].astype(str) # Convert to string.....
df['purchase date'] = df['purchase date'].astype(str)
df.loc[df['Unnamed: 4'] != 'nan', 'purchase price'] = df['purchase price'].astype(str) + '.' + df['purchase date'] # When it's not nan, will merge with the purchase price
df.loc[df['Unnamed: 4'] != 'nan', 'purchase date'] = df['Unnamed: 4'].str.split('.').str[0] # When it's not nan, will reassign the purchase date
#Just drop the last column....
df.drop(columns=['Unnamed: 4'])
# You can return the purchase price to float
df['purchase price'] = df['purchase price'].astype(float)
我得到了一个包含购买数据的 csv 文件,但它有一个问题: 它有 4 列,全部用逗号分隔,但其中包含价格的一列有许多值,逗号用作小数点分隔符。所以最后当我试图读取文件时,它读取这些行有 5 列并遇到错误。像这样:
transaction id,user id,purchase price,purchase date
1009497,490408,10,41674
1077573,490408,8,95,41676
所以 pd.read_csv
能够读取标签,读取第一行,但它停在第二行,因为它认为我给它的是 5 列而不是 4 列。什么是最有效的修复我的数据的方法?无法将所有小数点分隔符从逗号手动更改为点号。
更新: 我正在考虑将每一行读取为字符串,然后计算每一行中的逗号,如果它有 4 个逗号,那么我将使用正则表达式将该逗号周围的替换数据替换为“.”而不是","
如果您确定只有 purchase price
字段有此问题,您可以执行以下操作,但如果您的文件很大,则需要一段时间,但它有效:
import pandas as pd
with open('your_csv.csv', 'r') as f:
file_text = f.readlines()
with open('your_csv.csv', 'w') as f:
for line in file_text:
if len(line.split(',')) > 4:
line = '%s,%s,%s.%s,%s' % tuple([i for i in line.split(',')])
f.write(line)
csv = pd.read_csv('your_csv.csv')
print(csv)
我会这样做,当我尝试复制您的问题时,我有以下 DF:
transaction id user id purchase price purchase date Unnamed: 4
0 1009497 490408 10 41674 nan
1 1077573 490408 8 95 41676.0
# So basically I get a new column " Unnamed:4"
df['Unnamed: 4'] = df['Unnamed: 4'].astype(str) # Convert to string.....
df['purchase date'] = df['purchase date'].astype(str)
df.loc[df['Unnamed: 4'] != 'nan', 'purchase price'] = df['purchase price'].astype(str) + '.' + df['purchase date'] # When it's not nan, will merge with the purchase price
df.loc[df['Unnamed: 4'] != 'nan', 'purchase date'] = df['Unnamed: 4'].str.split('.').str[0] # When it's not nan, will reassign the purchase date
#Just drop the last column....
df.drop(columns=['Unnamed: 4'])
# You can return the purchase price to float
df['purchase price'] = df['purchase price'].astype(float)