如何使用 python 正则表达式将 csv 文件中的奇数分隔符替换为其他内容?

How to replace odd separators from a csv file to something else using python regex?

我有一个 csv 文件。第三列可以包含像这样的奇数分隔符:

Name,Last Name,Job,ID
John,Smith,Architect,ID2020
Taylor,Swift,Singer,Songwriter,ID2020-123

我知道第三列有时会出错,而且下一列总是以 ID 开头。奇怪的逗号在歌手和词曲作者之间。我怎样才能用 tilda 替换奇怪的逗号,这样就可以使用 pandas 读取文件而不会出错?实际文件有 30 列,因此可能需要正则表达式。谢谢你的时间。

IIUC,尝试:

with open("original.csv") as infile:
    rows = infile.read().splitlines()

with open("output.csv", "w") as outfile:
    for row in rows:
        name, lname, *rest = row.split(",")
        job = "~".join(rest[:-1])
        ID = rest[-1]
        outfile.write(f"{name},{lname},{job},{ID}\n")

df = pd.read_csv("output.csv")
>>> df
     Name Last Name                Job          ID
0    John     Smith          Architect      ID2020
1  Taylor     Swift  Singer~Songwriter  ID2020-123

另一种在 CSV 文件中分隔列的“标准”方法是使用分号。

下面的逻辑执行一些字符串处理以拆分和重新加入,使用分号进行重新加入...

with open("somefile.csv") as infile:
    data = infile.read().splitlines()


with open("someotherfile.csv", "w") as outfile:
    for row in data:
        splitrow = row.split(",")
        if len(splitrow) > 4:
            splitrow[2] = f"{splitrow[2]},{splitrow.pop(3)}"
        outfile.write(";".join(splitrow)+'\n')

import pandas as pd
df = pd.read_csv("someotherfile.csv", sep=';')
print(df)

输出

     Name Last Name                Job          ID
0    John     Smith          Architect      ID2020
1  Taylor     Swift  Singer,Songwriter  ID2020-123

尝试以下方法:

import pandas as pd
import csv

data = []

with open('input.csv') as f_input:
    csv_input = csv.reader(f_input)
    header = next(csv_input)
    
    for row in csv_input:
        data.append([*row[:2], ' '.join(row[2:-1]), row[-1]])

df = pd.DataFrame(data, columns=header)
print(df)

对于您的示例,这给出了:

     Name Last Name                    Job          ID
0    John     Smith              Architect      ID2020
1  Taylor     Swift      Singer Songwriter  ID2020-123

这假定不需要的逗号仅出现在作业列中。它采用 NameLast Names 字段,然后合并所有字段,直到最后一个 ID 字段。所以实际上 Job 字段可以有任意数量的逗号。

这需要根据所有其他列的位置进行调整。

IIUC 你可以这样做,在文本文件中逐行查看。 “每列也以 ID 开头”你希望它只是数字吗?我删除了解决方案中每一行的 ID。

import pandas as pd
from collections import defaultdict

d = defaultdict(list)

with open("input_list.txt") as f:
    next(f)
    for line in f:
        name, lname, *job, ID = line.strip().split(",")
        d["Name"].append(name)
        d["Last Name"].append(lname)
        d["Job"].append(" ".join(job))
        d["ID"].append(ID[2:])

df = pd.DataFrame(d)

print(df)

    Name    Last Name   Job                ID
0   John    Smith       Architect          2020
1   Taylor  Swift       Singer Songwriter  2020-123