在 Csv 文件中拆分列
Split columns in Csv file
我有一个 CSV 文件,它很乱。第一列很好,但所有其余数据都在第二列中。所有数据 VariableName1=Variable1, VariableName2=Variable2, VariableName3=Variable3
, ... 都在第二列。
<div class="jp-RenderedText jp-OutputArea-output" data-mime-type="text/plain">
<pre> var1 var2 \
1 SfgvbdvbUJ05-1 var3=10,var4=/a/n/anghelo_rujo_edited-...
2 OLBCANGR15 var3=10,var4=/c/a/cangrande_test.jpg,a...
3 ZAMdvFIA19 var3=10,var4=/p/i/pierluigi_zampaglion...
4 VINMUL18 var3=10,var4=/r/u/rudi_vindimian_mulle...
5 PRACLA16 var3=10,var4=/p/r/pracla16_podere_prad...
.. ... ...
175 WALLIM var3=25,var4=/w/a/walcher_limoncello_w...
239 SMROS20 var3=10,var4=/s/e/sella_e_mosca_rosato...
288 SAELAMB19 var3=10,var6=Modena,bottleml=750,box_size=1...
343 DILABB var3=40,var4=/d/i/dilabb_distillerie_l...
357 VANER19 var3=10,var4=/v/a/valdibella_kerasos_v...
var4 ... var9 var10 var11
1 NaN ... NaN NaN NaN
2 NaN ... NaN NaN NaN
3 NaN ... NaN NaN NaN
4 NaN ... NaN NaN NaN
5 NaN ... NaN NaN NaN
.. ... ... ... ... ...
175 NaN ... NaN NaN NaN
239 NaN ... NaN NaN NaN
288 NaN ... NaN NaN NaN
343 NaN ... NaN NaN NaN
357 NaN ... NaN NaN NaN
</pre>
</div>
我将第二列作为单独的新数据并将其拆分为 ,
。但是我无法将 VariableName1=Variable1
数据分成 VariableName
列。
当我使用 String Contains 执行此操作时,我卡在了 =...
部分。
请帮帮我。
我在处理这个 CSV 文件时遇到了麻烦。
我想要的是在每个列名下都有那个值。
var1 var2 var3 var4
ZAMffFIA19 10 2 /a/n/anghelo_rujo_edited...
VINMUfgvL18 25 1 /r/u/rudi_vindimian_mulle...
编辑:使用提取而不是替换:
keys = ['alchool', 'animal', 'alt_image']
for item in keys:
df[item] = df['data'].str.extract(f'{item}=(.*?)(,|$)')[0]
假设您有这样一个文件:
123 A=2,B=asdjhf,C=jhdkfhskdf,D=1254
54878754 A=45786,D=asgfd,C=1234
而且你的文件不是很大,你可以迭代地附加到数据框:
df = pd.DataFrame(columns=["sku", "A", "B", "C", "D"])
with open("data_mangled.csv") as f:
for line in f:
d = {}
col1, col2 = line.split()
d["sku"] = col1
cols = col2.split(",")
for item in cols:
k,v = item.split("=")
d[k] = v
for col in df.columns: # add potentially missing columns as None
if col not in d:
d[col] = None
df = df.append(d, ignore_index=True)
print(df)
这也可以处理某些列名在第二位缺失或被切换的情况。
输出:
sku A B C D
0 123 2 asdjhf jhdkfhskdf 1254
1 54878754 45786 None 1234 asgfd
编辑:对于您的具体数据:
with open("data_real.txt") as f:
# use the first line as column names in the dataframe
col_names = f.readline()
df = pd.DataFrame(columns=col_names.split(","))
print(col_names)
for line in f:
d = {}
# lines have more than 2 columns, but the trailing values are empty
# so the format is col1,large_col2,,,,,,,
col1, *col2 = line.split(",")
d["sku"] = col1
for item in col2:
try:
if item.strip(): # disregard the empty trailing columns
k,v = item.split("=")
# we split on comma, so have occasional "
k = k.strip('"')
v = v.strip('"')
d[k] = v
except ValueError as e:
# there is a column value with missing key
print("Could not assign to column:", d["sku"], item)
for col in df.columns:
if col not in d:
d[col] = None
df = df.append(d, ignore_index=True)
print(df)
df.to_csv("data_parsed.csv") # save
其中一列不是键=值格式:
无法分配给列:PRACLA16 16 个月 less
注意:较新的Python版本会抱怨append
已弃用,我在这里选择忽略这一点,可以通过将dict转换为dataframe并加入两个dataframes来解决。
我有一个 CSV 文件,它很乱。第一列很好,但所有其余数据都在第二列中。所有数据 VariableName1=Variable1, VariableName2=Variable2, VariableName3=Variable3
, ... 都在第二列。
<div class="jp-RenderedText jp-OutputArea-output" data-mime-type="text/plain">
<pre> var1 var2 \
1 SfgvbdvbUJ05-1 var3=10,var4=/a/n/anghelo_rujo_edited-...
2 OLBCANGR15 var3=10,var4=/c/a/cangrande_test.jpg,a...
3 ZAMdvFIA19 var3=10,var4=/p/i/pierluigi_zampaglion...
4 VINMUL18 var3=10,var4=/r/u/rudi_vindimian_mulle...
5 PRACLA16 var3=10,var4=/p/r/pracla16_podere_prad...
.. ... ...
175 WALLIM var3=25,var4=/w/a/walcher_limoncello_w...
239 SMROS20 var3=10,var4=/s/e/sella_e_mosca_rosato...
288 SAELAMB19 var3=10,var6=Modena,bottleml=750,box_size=1...
343 DILABB var3=40,var4=/d/i/dilabb_distillerie_l...
357 VANER19 var3=10,var4=/v/a/valdibella_kerasos_v...
var4 ... var9 var10 var11
1 NaN ... NaN NaN NaN
2 NaN ... NaN NaN NaN
3 NaN ... NaN NaN NaN
4 NaN ... NaN NaN NaN
5 NaN ... NaN NaN NaN
.. ... ... ... ... ...
175 NaN ... NaN NaN NaN
239 NaN ... NaN NaN NaN
288 NaN ... NaN NaN NaN
343 NaN ... NaN NaN NaN
357 NaN ... NaN NaN NaN
</pre>
</div>
我将第二列作为单独的新数据并将其拆分为 ,
。但是我无法将 VariableName1=Variable1
数据分成 VariableName
列。
当我使用 String Contains 执行此操作时,我卡在了 =...
部分。
请帮帮我。 我在处理这个 CSV 文件时遇到了麻烦。 我想要的是在每个列名下都有那个值。
var1 var2 var3 var4
ZAMffFIA19 10 2 /a/n/anghelo_rujo_edited...
VINMUfgvL18 25 1 /r/u/rudi_vindimian_mulle...
编辑:使用提取而不是替换:
keys = ['alchool', 'animal', 'alt_image']
for item in keys:
df[item] = df['data'].str.extract(f'{item}=(.*?)(,|$)')[0]
假设您有这样一个文件:
123 A=2,B=asdjhf,C=jhdkfhskdf,D=1254
54878754 A=45786,D=asgfd,C=1234
而且你的文件不是很大,你可以迭代地附加到数据框:
df = pd.DataFrame(columns=["sku", "A", "B", "C", "D"])
with open("data_mangled.csv") as f:
for line in f:
d = {}
col1, col2 = line.split()
d["sku"] = col1
cols = col2.split(",")
for item in cols:
k,v = item.split("=")
d[k] = v
for col in df.columns: # add potentially missing columns as None
if col not in d:
d[col] = None
df = df.append(d, ignore_index=True)
print(df)
这也可以处理某些列名在第二位缺失或被切换的情况。
输出:
sku A B C D
0 123 2 asdjhf jhdkfhskdf 1254
1 54878754 45786 None 1234 asgfd
编辑:对于您的具体数据:
with open("data_real.txt") as f:
# use the first line as column names in the dataframe
col_names = f.readline()
df = pd.DataFrame(columns=col_names.split(","))
print(col_names)
for line in f:
d = {}
# lines have more than 2 columns, but the trailing values are empty
# so the format is col1,large_col2,,,,,,,
col1, *col2 = line.split(",")
d["sku"] = col1
for item in col2:
try:
if item.strip(): # disregard the empty trailing columns
k,v = item.split("=")
# we split on comma, so have occasional "
k = k.strip('"')
v = v.strip('"')
d[k] = v
except ValueError as e:
# there is a column value with missing key
print("Could not assign to column:", d["sku"], item)
for col in df.columns:
if col not in d:
d[col] = None
df = df.append(d, ignore_index=True)
print(df)
df.to_csv("data_parsed.csv") # save
其中一列不是键=值格式: 无法分配给列:PRACLA16 16 个月 less
注意:较新的Python版本会抱怨append
已弃用,我在这里选择忽略这一点,可以通过将dict转换为dataframe并加入两个dataframes来解决。