使用两个分隔符将 CSV 导入 pandas
Import CSV to pandas with two delimiters
我有一个带有两个分隔符 (;
) 和 (,
) 的 CSV,它看起来像这样:
vin;vorgangid;eventkm;D_8_lamsoni_w_time;D_8_lamsoni_w_value
V345578;295234545;13;-1000.0,-980.0;7.9921875,11.984375
V346670;329781064;13;-960.0,-940.0;7.9921875,11.984375
我想将其导入 pandas 数据框,其中 (;
) 作为列分隔符,(,
) 作为 [=17] 的分隔符=] 或 array
使用 float
作为数据类型。到目前为止,我正在使用这种方法,但我相信还有更简单的方法。
aa=0;
csv_import=pd.read_csv(folder+FileName, ';')
for col in csv_import.columns:
aa=aa+1
if type(csv_import[col][0])== str and aa>3:
# string to list of strings
csv_import[col]=csv_import[col].apply(lambda x:x.split(','))
# make the list of stings into a list of floats
csv_import[col]=csv_import[col].apply(lambda x: [float(y) for y in x])
首先使用 ;
作为分隔符读取 CSV:
df = pd.read_csv(filename, sep=';')
更新:
In [67]: num_cols = df.columns.difference(['vin','vorgangid','eventkm'])
In [68]: num_cols
Out[68]: Index(['D_8_lamsoni_w_time', 'D_8_lamsoni_w_value'], dtype='object')
In [69]: df[num_cols] = (df[num_cols].apply(lambda x: x.str.split(',', expand=True)
....: .stack()
....: .astype(float)
....: .unstack()
....: .values.tolist())
....: )
In [70]: df
Out[70]:
vin vorgangid eventkm D_8_lamsoni_w_time D_8_lamsoni_w_value
0 V345578 295234545 13 [-1000.0, -980.0] [7.9921875, 11.984375]
1 V346670 329781064 13 [-960.0, -940.0] [7.9921875, 11.984375]
In [71]: type(df.loc[0, 'D_8_lamsoni_w_value'][0])
Out[71]: float
旧答案:
现在我们可以将数字拆分为 "number" 列中的列表:
In [20]: df[['D_8_lamsoni_w_time', 'D_8_lamsoni_w_value']] = \
df[['D_8_lamsoni_w_time', 'D_8_lamsoni_w_value']].apply(lambda x: x.str.split(','))
In [21]: df
Out[21]:
vin vorgangid eventkm D_8_lamsoni_w_time D_8_lamsoni_w_value
0 V345578 295234545 13 [-1000.0, -980.0] [7.9921875, 11.984375]
1 V346670 329781064 13 [-960.0, -940.0] [7.9921875, 11.984375]
您可以在 read_csv
中使用参数 converters
并定义用于拆分的自定义函数:
def f(x):
return [float(i) for i in x.split(',')]
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp),
sep=";",
converters={'D_8_lamsoni_w_time':f, 'D_8_lamsoni_w_value':f})
print (df)
vin vorgangid eventkm D_8_lamsoni_w_time D_8_lamsoni_w_value
0 V345578 295234545 13 [-1000.0, -980.0] [7.9921875, 11.984375]
1 V346670 329781064 13 [-960.0, -940.0] [7.9921875, 11.984375]
在 4.
和 5.
列中使用 NaN
的另一个解决方案:
您可以使用带有分隔符 ;
的 read_csv
,然后将 str.split
应用于 4.
和 5.
iloc
选择的列并将 list
中的每个值转换为 float
:
import pandas as pd
import numpy as np
import io
temp=u"""vin;vorgangid;eventkm;D_8_lamsoni_w_time;D_8_lamsoni_w_value
V345578;295234545;13;-1000.0,-980.0;7.9921875,11.984375
V346670;329781064;13;-960.0,-940.0;7.9921875,11.984375"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), sep=";")
print (df)
vin vorgangid eventkm D_8_lamsoni_w_time D_8_lamsoni_w_value
0 V345578 295234545 13 -1000.0,-980.0 7.9921875,11.984375
1 V346670 329781064 13 -960.0,-940.0 7.9921875,11.984375
#split 4.th and 5th column and convert to numpy array
df.iloc[:,3] = df.iloc[:,3].str.split(',').apply(lambda x: [float(i) for i in x])
df.iloc[:,4] = df.iloc[:,4].str.split(',').apply(lambda x: [float(i) for i in x])
print (df)
vin vorgangid eventkm D_8_lamsoni_w_time D_8_lamsoni_w_value
0 V345578 295234545 13 [-1000.0, -980.0] [7.9921875, 11.984375]
1 V346670 329781064 13 [-960.0, -940.0] [7.9921875, 11.984375]
如果需要 numpy arrays
而不是 lists
:
#split 4.th and 5th column and convert to numpy array
df.iloc[:,3] = df.iloc[:,3].str.split(',').apply(lambda x: np.array([float(i) for i in x]))
df.iloc[:,4] = df.iloc[:,4].str.split(',').apply(lambda x: np.array([float(i) for i in x]))
print (df)
vin vorgangid eventkm D_8_lamsoni_w_time D_8_lamsoni_w_value
0 V345578 295234545 13 [-1000.0, -980.0] [7.9921875, 11.984375]
1 V346670 329781064 13 [-960.0, -940.0] [7.9921875, 11.984375]
print (type(df.iloc[0,3]))
<class 'numpy.ndarray'>
我尝试改进您的解决方案:
a=0;
csv_import=pd.read_csv(folder+FileName, ';')
for col in csv_import.columns:
a += 1
if type(csv_import.ix[0, col])== str and a>3:
# string to list of strings
csv_import[col]=csv_import[col].apply(lambda x: [float(y) for y in x.split(',')])
除了这里的其他更 pandas 具体的好答案之外,应该注意 Python 本身在字符串处理方面非常强大。您可以将 ';'
替换为 ','
的结果放在 StringIO
对象中,并从那里正常工作:
In [8]: import pandas as pd
In [9]: from cStringIO import StringIO
In [10]: pd.read_csv(StringIO(''.join(l.replace(';', ',') for l in open('stuff.csv'))))
Out[10]:
vin vorgangid eventkm D_8_lamsoni_w_time \
V345578 295234545 13 -1000.0 -980.0 7.992188
V346670 329781064 13 -960.0 -940.0 7.992188
D_8_lamsoni_w_value
V345578 295234545 11.984375
V346670 329781064 11.984375
我有一个带有两个分隔符 (;
) 和 (,
) 的 CSV,它看起来像这样:
vin;vorgangid;eventkm;D_8_lamsoni_w_time;D_8_lamsoni_w_value
V345578;295234545;13;-1000.0,-980.0;7.9921875,11.984375
V346670;329781064;13;-960.0,-940.0;7.9921875,11.984375
我想将其导入 pandas 数据框,其中 (;
) 作为列分隔符,(,
) 作为 [=17] 的分隔符=] 或 array
使用 float
作为数据类型。到目前为止,我正在使用这种方法,但我相信还有更简单的方法。
aa=0;
csv_import=pd.read_csv(folder+FileName, ';')
for col in csv_import.columns:
aa=aa+1
if type(csv_import[col][0])== str and aa>3:
# string to list of strings
csv_import[col]=csv_import[col].apply(lambda x:x.split(','))
# make the list of stings into a list of floats
csv_import[col]=csv_import[col].apply(lambda x: [float(y) for y in x])
首先使用 ;
作为分隔符读取 CSV:
df = pd.read_csv(filename, sep=';')
更新:
In [67]: num_cols = df.columns.difference(['vin','vorgangid','eventkm'])
In [68]: num_cols
Out[68]: Index(['D_8_lamsoni_w_time', 'D_8_lamsoni_w_value'], dtype='object')
In [69]: df[num_cols] = (df[num_cols].apply(lambda x: x.str.split(',', expand=True)
....: .stack()
....: .astype(float)
....: .unstack()
....: .values.tolist())
....: )
In [70]: df
Out[70]:
vin vorgangid eventkm D_8_lamsoni_w_time D_8_lamsoni_w_value
0 V345578 295234545 13 [-1000.0, -980.0] [7.9921875, 11.984375]
1 V346670 329781064 13 [-960.0, -940.0] [7.9921875, 11.984375]
In [71]: type(df.loc[0, 'D_8_lamsoni_w_value'][0])
Out[71]: float
旧答案:
现在我们可以将数字拆分为 "number" 列中的列表:
In [20]: df[['D_8_lamsoni_w_time', 'D_8_lamsoni_w_value']] = \
df[['D_8_lamsoni_w_time', 'D_8_lamsoni_w_value']].apply(lambda x: x.str.split(','))
In [21]: df
Out[21]:
vin vorgangid eventkm D_8_lamsoni_w_time D_8_lamsoni_w_value
0 V345578 295234545 13 [-1000.0, -980.0] [7.9921875, 11.984375]
1 V346670 329781064 13 [-960.0, -940.0] [7.9921875, 11.984375]
您可以在 read_csv
中使用参数 converters
并定义用于拆分的自定义函数:
def f(x):
return [float(i) for i in x.split(',')]
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp),
sep=";",
converters={'D_8_lamsoni_w_time':f, 'D_8_lamsoni_w_value':f})
print (df)
vin vorgangid eventkm D_8_lamsoni_w_time D_8_lamsoni_w_value
0 V345578 295234545 13 [-1000.0, -980.0] [7.9921875, 11.984375]
1 V346670 329781064 13 [-960.0, -940.0] [7.9921875, 11.984375]
在 4.
和 5.
列中使用 NaN
的另一个解决方案:
您可以使用带有分隔符 ;
的 read_csv
,然后将 str.split
应用于 4.
和 5.
iloc
选择的列并将 list
中的每个值转换为 float
:
import pandas as pd
import numpy as np
import io
temp=u"""vin;vorgangid;eventkm;D_8_lamsoni_w_time;D_8_lamsoni_w_value
V345578;295234545;13;-1000.0,-980.0;7.9921875,11.984375
V346670;329781064;13;-960.0,-940.0;7.9921875,11.984375"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), sep=";")
print (df)
vin vorgangid eventkm D_8_lamsoni_w_time D_8_lamsoni_w_value
0 V345578 295234545 13 -1000.0,-980.0 7.9921875,11.984375
1 V346670 329781064 13 -960.0,-940.0 7.9921875,11.984375
#split 4.th and 5th column and convert to numpy array
df.iloc[:,3] = df.iloc[:,3].str.split(',').apply(lambda x: [float(i) for i in x])
df.iloc[:,4] = df.iloc[:,4].str.split(',').apply(lambda x: [float(i) for i in x])
print (df)
vin vorgangid eventkm D_8_lamsoni_w_time D_8_lamsoni_w_value
0 V345578 295234545 13 [-1000.0, -980.0] [7.9921875, 11.984375]
1 V346670 329781064 13 [-960.0, -940.0] [7.9921875, 11.984375]
如果需要 numpy arrays
而不是 lists
:
#split 4.th and 5th column and convert to numpy array
df.iloc[:,3] = df.iloc[:,3].str.split(',').apply(lambda x: np.array([float(i) for i in x]))
df.iloc[:,4] = df.iloc[:,4].str.split(',').apply(lambda x: np.array([float(i) for i in x]))
print (df)
vin vorgangid eventkm D_8_lamsoni_w_time D_8_lamsoni_w_value
0 V345578 295234545 13 [-1000.0, -980.0] [7.9921875, 11.984375]
1 V346670 329781064 13 [-960.0, -940.0] [7.9921875, 11.984375]
print (type(df.iloc[0,3]))
<class 'numpy.ndarray'>
我尝试改进您的解决方案:
a=0;
csv_import=pd.read_csv(folder+FileName, ';')
for col in csv_import.columns:
a += 1
if type(csv_import.ix[0, col])== str and a>3:
# string to list of strings
csv_import[col]=csv_import[col].apply(lambda x: [float(y) for y in x.split(',')])
除了这里的其他更 pandas 具体的好答案之外,应该注意 Python 本身在字符串处理方面非常强大。您可以将 ';'
替换为 ','
的结果放在 StringIO
对象中,并从那里正常工作:
In [8]: import pandas as pd
In [9]: from cStringIO import StringIO
In [10]: pd.read_csv(StringIO(''.join(l.replace(';', ',') for l in open('stuff.csv'))))
Out[10]:
vin vorgangid eventkm D_8_lamsoni_w_time \
V345578 295234545 13 -1000.0 -980.0 7.992188
V346670 329781064 13 -960.0 -940.0 7.992188
D_8_lamsoni_w_value
V345578 295234545 11.984375
V346670 329781064 11.984375