通过读取文件从行中删除某些字符并将其保存到文件中

Remove certian character from line by reading a file and save it to the file

我有问题。我有一个损坏的 csv 文件。最后一列是自由文本,我的分隔符是 ; 不幸的是有些用户在自由文本中使用 ;,例如This is a longer text and;ups that should not be。我现在想逐行阅读文件,在第二个 ; 之后,所有内容都应替换为 ,。我打印出此 csv 文件的哪一行已损坏。如何读取文件并同时替换它?或者我应该保存行+输出并在之后替换它?

不幸的是,我不知道如何解决这种问题。

import pandas as pd

with open("sample.csv", encoding="UTF-8") as file:
    for i, line in enumerate(file):
      x = line.split(";")
      if(len(x) > 3):
        print(i, ": ", line)
        cleaned_x = (', '.join(x[2:]))
        # Add cleaned_x to x
        new_line = x[0] + ";" + x[1]  + ";" + cleaned_x
        print(new_line)

df = pd.read_csv("file.csv", encoding="utf-8", sep=";")

我有什么

customerId;name;text
1;Josey;I want to go at 05pm
2;Mike;Check this out --> öl
2;Frank;This is a longer text and;ups that should not be
2;Max;okay;
3;Josey;here is everythink good

我想要的

customerId;name;text
1;Josey;I want to go at 05pm
2;Mike;Check this out --> öl
2;Frank;This is a longer text and,ups that should not be
2;Max;okay,
3;Josey;here is everythink good

您可以将行保存在数组中并创建一个新文件。

import csv

new_sample = []
with open("sample.csv", encoding="UTF-8") as file:
for i, line in enumerate(file):
    x = line.split(";")
    if(len(x) > 3):
        print(i, ": ", line)
        cleaned_x = (', '.join(x[2:]))
        # Add cleaned_x to x
        new_line = x[0] + ";" + x[1]  + ";" + cleaned_x
        print(new_line)
        new_sample.append(new_line)
    else:
        new_sample.append(line)

with open("new_sample.csv", "w", encoding="UTF-8") as new_file:  # Open in write mode.
    writer = csv.writer(new_file)
    for row in new_sample:
        writer.writerow(row)

定义一个自定义函数来读取 csv 文件,然后从 rowscols:

创建一个新的数据框
def read_csv(path):
    with open(path) as file:
        for line in file:
            *v, t = line.strip().split(';', 2)
            yield [*v, t.replace(';', ',')]

cols, *rows = read_csv('sample.csv')
df = pd.DataFrame(rows, columns=cols)

print(df)
  customerId   name                                              text
0          1  Josey                              I want to go at 05pm
1          2   Mike                            Check this out --> öl
2          2  Frank  This is a longer text and,ups that should not be
3          2    Max                                             okay,
4          3  Josey                           here is everythink good

仅供参考,如果您使用 Python 的 csv 库编写初始文件,它将处理嵌入;正确

import csv

with open("test.csv", "w") as f:
    writer = csv.writer(f, delimiter=";")
    writer.writerow(["hello", "world", "hello;world"])

# test.csv contains hello;world;"hello;world"
# which will be read as three fields using csv.reader

以下是解决问题的方法。我将写入一个新文件。可以在 read/write 模式下打开一个文件,但它更复杂,因为你需要读取一行,移动文件中的位置,写入新数据,同时确保你不会覆盖下一个字节行...使用新文件然后重命名它要容易得多。

import csv

with open("input.csv") as in_file, open("output.csv", "w") as out_file:

    reader = csv.reader(in_file, delimiter=";")
    writer = csv.writer(out_file, delimiter=";")

    for line in reader:  # line is a list containing the fields
        if len(line) > 3:
            line = line[:2] + [", ".join(line[2:])]
        writer.writerow(line)

如果您不需要保存固定文件,则不需要打开“output.csv”或创建编写器。更正后打印 line 以显示字段列表 ["hello", "world", "hello;world"]

如果您希望打印最终会出现在文件中的字符串,您需要将包含分号的字段用引号引起来。

line = [f"\"{item}\"" if ";" in item else item for item in line]
print(";".join(line))
# hello;world;"hello;world"

Pandas(版本>=1.3.0)允许在遇到带有on_bad_lines参数的错误行时调用函数来处理错误行:

callable, function with signature (bad_line: list[str]) -> list[str] | None that will process a single bad line. bad_line is a list of strings split by the sep. If the function returns None, the bad line will be ignored. If the function returns a new list of strings with more elements than expected, a ParserWarning will be emitted while dropping extra elements. Only supported when engine="python"

因此您可以简单地读取文件:

df = pd.read_csv('sample.csv', sep=';', engine='python', on_bad_lines=lambda x: x[:2] + [';'.join(x[2:])])

然后将其保存为您喜欢的任何格式。或者实现题中定义的输出:

df['text'] = df['text'].str.replace(';', ',')
df.to_csv('output.csv', sep=';')