如何使用 python 正则表达式将 csv 文件中的奇数分隔符替换为其他内容?
How to replace odd separators from a csv file to something else using python regex?
我有一个 csv 文件。第三列可以包含像这样的奇数分隔符:
Name,Last Name,Job,ID
John,Smith,Architect,ID2020
Taylor,Swift,Singer,Songwriter,ID2020-123
我知道第三列有时会出错,而且下一列总是以 ID 开头。奇怪的逗号在歌手和词曲作者之间。我怎样才能用 tilda 替换奇怪的逗号,这样就可以使用 pandas 读取文件而不会出错?实际文件有 30 列,因此可能需要正则表达式。谢谢你的时间。
IIUC,尝试:
with open("original.csv") as infile:
rows = infile.read().splitlines()
with open("output.csv", "w") as outfile:
for row in rows:
name, lname, *rest = row.split(",")
job = "~".join(rest[:-1])
ID = rest[-1]
outfile.write(f"{name},{lname},{job},{ID}\n")
df = pd.read_csv("output.csv")
>>> df
Name Last Name Job ID
0 John Smith Architect ID2020
1 Taylor Swift Singer~Songwriter ID2020-123
另一种在 CSV 文件中分隔列的“标准”方法是使用分号。
下面的逻辑执行一些字符串处理以拆分和重新加入,使用分号进行重新加入...
with open("somefile.csv") as infile:
data = infile.read().splitlines()
with open("someotherfile.csv", "w") as outfile:
for row in data:
splitrow = row.split(",")
if len(splitrow) > 4:
splitrow[2] = f"{splitrow[2]},{splitrow.pop(3)}"
outfile.write(";".join(splitrow)+'\n')
import pandas as pd
df = pd.read_csv("someotherfile.csv", sep=';')
print(df)
输出
Name Last Name Job ID
0 John Smith Architect ID2020
1 Taylor Swift Singer,Songwriter ID2020-123
尝试以下方法:
import pandas as pd
import csv
data = []
with open('input.csv') as f_input:
csv_input = csv.reader(f_input)
header = next(csv_input)
for row in csv_input:
data.append([*row[:2], ' '.join(row[2:-1]), row[-1]])
df = pd.DataFrame(data, columns=header)
print(df)
对于您的示例,这给出了:
Name Last Name Job ID
0 John Smith Architect ID2020
1 Taylor Swift Singer Songwriter ID2020-123
这假定不需要的逗号仅出现在作业列中。它采用 Name
和 Last Names
字段,然后合并所有字段,直到最后一个 ID
字段。所以实际上 Job
字段可以有任意数量的逗号。
这需要根据所有其他列的位置进行调整。
IIUC 你可以这样做,在文本文件中逐行查看。
“每列也以 ID 开头”你希望它只是数字吗?我删除了解决方案中每一行的 ID。
import pandas as pd
from collections import defaultdict
d = defaultdict(list)
with open("input_list.txt") as f:
next(f)
for line in f:
name, lname, *job, ID = line.strip().split(",")
d["Name"].append(name)
d["Last Name"].append(lname)
d["Job"].append(" ".join(job))
d["ID"].append(ID[2:])
df = pd.DataFrame(d)
print(df)
Name Last Name Job ID
0 John Smith Architect 2020
1 Taylor Swift Singer Songwriter 2020-123
我有一个 csv 文件。第三列可以包含像这样的奇数分隔符:
Name,Last Name,Job,ID
John,Smith,Architect,ID2020
Taylor,Swift,Singer,Songwriter,ID2020-123
我知道第三列有时会出错,而且下一列总是以 ID 开头。奇怪的逗号在歌手和词曲作者之间。我怎样才能用 tilda 替换奇怪的逗号,这样就可以使用 pandas 读取文件而不会出错?实际文件有 30 列,因此可能需要正则表达式。谢谢你的时间。
IIUC,尝试:
with open("original.csv") as infile:
rows = infile.read().splitlines()
with open("output.csv", "w") as outfile:
for row in rows:
name, lname, *rest = row.split(",")
job = "~".join(rest[:-1])
ID = rest[-1]
outfile.write(f"{name},{lname},{job},{ID}\n")
df = pd.read_csv("output.csv")
>>> df
Name Last Name Job ID
0 John Smith Architect ID2020
1 Taylor Swift Singer~Songwriter ID2020-123
另一种在 CSV 文件中分隔列的“标准”方法是使用分号。
下面的逻辑执行一些字符串处理以拆分和重新加入,使用分号进行重新加入...
with open("somefile.csv") as infile:
data = infile.read().splitlines()
with open("someotherfile.csv", "w") as outfile:
for row in data:
splitrow = row.split(",")
if len(splitrow) > 4:
splitrow[2] = f"{splitrow[2]},{splitrow.pop(3)}"
outfile.write(";".join(splitrow)+'\n')
import pandas as pd
df = pd.read_csv("someotherfile.csv", sep=';')
print(df)
输出
Name Last Name Job ID
0 John Smith Architect ID2020
1 Taylor Swift Singer,Songwriter ID2020-123
尝试以下方法:
import pandas as pd
import csv
data = []
with open('input.csv') as f_input:
csv_input = csv.reader(f_input)
header = next(csv_input)
for row in csv_input:
data.append([*row[:2], ' '.join(row[2:-1]), row[-1]])
df = pd.DataFrame(data, columns=header)
print(df)
对于您的示例,这给出了:
Name Last Name Job ID
0 John Smith Architect ID2020
1 Taylor Swift Singer Songwriter ID2020-123
这假定不需要的逗号仅出现在作业列中。它采用 Name
和 Last Names
字段,然后合并所有字段,直到最后一个 ID
字段。所以实际上 Job
字段可以有任意数量的逗号。
这需要根据所有其他列的位置进行调整。
IIUC 你可以这样做,在文本文件中逐行查看。 “每列也以 ID 开头”你希望它只是数字吗?我删除了解决方案中每一行的 ID。
import pandas as pd
from collections import defaultdict
d = defaultdict(list)
with open("input_list.txt") as f:
next(f)
for line in f:
name, lname, *job, ID = line.strip().split(",")
d["Name"].append(name)
d["Last Name"].append(lname)
d["Job"].append(" ".join(job))
d["ID"].append(ID[2:])
df = pd.DataFrame(d)
print(df)
Name Last Name Job ID
0 John Smith Architect 2020
1 Taylor Swift Singer Songwriter 2020-123