通过读取文件从行中删除某些字符并将其保存到文件中
Remove certian character from line by reading a file and save it to the file
我有问题。我有一个损坏的 csv 文件。最后一列是自由文本,我的分隔符是 ;
不幸的是有些用户在自由文本中使用 ;
,例如This is a longer text and;ups that should not be
。我现在想逐行阅读文件,在第二个 ;
之后,所有内容都应替换为 ,
。我打印出此 csv 文件的哪一行已损坏。如何读取文件并同时替换它?或者我应该保存行+输出并在之后替换它?
不幸的是,我不知道如何解决这种问题。
import pandas as pd
with open("sample.csv", encoding="UTF-8") as file:
for i, line in enumerate(file):
x = line.split(";")
if(len(x) > 3):
print(i, ": ", line)
cleaned_x = (', '.join(x[2:]))
# Add cleaned_x to x
new_line = x[0] + ";" + x[1] + ";" + cleaned_x
print(new_line)
df = pd.read_csv("file.csv", encoding="utf-8", sep=";")
我有什么
customerId;name;text
1;Josey;I want to go at 05pm
2;Mike;Check this out --> öl
2;Frank;This is a longer text and;ups that should not be
2;Max;okay;
3;Josey;here is everythink good
我想要的
customerId;name;text
1;Josey;I want to go at 05pm
2;Mike;Check this out --> öl
2;Frank;This is a longer text and,ups that should not be
2;Max;okay,
3;Josey;here is everythink good
您可以将行保存在数组中并创建一个新文件。
import csv
new_sample = []
with open("sample.csv", encoding="UTF-8") as file:
for i, line in enumerate(file):
x = line.split(";")
if(len(x) > 3):
print(i, ": ", line)
cleaned_x = (', '.join(x[2:]))
# Add cleaned_x to x
new_line = x[0] + ";" + x[1] + ";" + cleaned_x
print(new_line)
new_sample.append(new_line)
else:
new_sample.append(line)
with open("new_sample.csv", "w", encoding="UTF-8") as new_file: # Open in write mode.
writer = csv.writer(new_file)
for row in new_sample:
writer.writerow(row)
定义一个自定义函数来读取 csv 文件,然后从 rows
和 cols
:
创建一个新的数据框
def read_csv(path):
with open(path) as file:
for line in file:
*v, t = line.strip().split(';', 2)
yield [*v, t.replace(';', ',')]
cols, *rows = read_csv('sample.csv')
df = pd.DataFrame(rows, columns=cols)
print(df)
customerId name text
0 1 Josey I want to go at 05pm
1 2 Mike Check this out --> öl
2 2 Frank This is a longer text and,ups that should not be
3 2 Max okay,
4 3 Josey here is everythink good
仅供参考,如果您使用 Python 的 csv 库编写初始文件,它将处理嵌入;正确
import csv
with open("test.csv", "w") as f:
writer = csv.writer(f, delimiter=";")
writer.writerow(["hello", "world", "hello;world"])
# test.csv contains hello;world;"hello;world"
# which will be read as three fields using csv.reader
以下是解决问题的方法。我将写入一个新文件。可以在 read/write 模式下打开一个文件,但它更复杂,因为你需要读取一行,移动文件中的位置,写入新数据,同时确保你不会覆盖下一个字节行...使用新文件然后重命名它要容易得多。
import csv
with open("input.csv") as in_file, open("output.csv", "w") as out_file:
reader = csv.reader(in_file, delimiter=";")
writer = csv.writer(out_file, delimiter=";")
for line in reader: # line is a list containing the fields
if len(line) > 3:
line = line[:2] + [", ".join(line[2:])]
writer.writerow(line)
如果您不需要保存固定文件,则不需要打开“output.csv”或创建编写器。更正后打印 line
以显示字段列表 ["hello", "world", "hello;world"]
如果您希望打印最终会出现在文件中的字符串,您需要将包含分号的字段用引号引起来。
line = [f"\"{item}\"" if ";" in item else item for item in line]
print(";".join(line))
# hello;world;"hello;world"
Pandas(版本>=1.3.0
)允许在遇到带有on_bad_lines参数的错误行时调用函数来处理错误行:
callable, function with signature (bad_line: list[str]) -> list[str] |
None that will process a single bad line. bad_line is a list of
strings split by the sep. If the function returns None, the bad line
will be ignored. If the function returns a new list of strings with
more elements than expected, a ParserWarning will be emitted while
dropping extra elements. Only supported when engine="python"
因此您可以简单地读取文件:
df = pd.read_csv('sample.csv', sep=';', engine='python', on_bad_lines=lambda x: x[:2] + [';'.join(x[2:])])
然后将其保存为您喜欢的任何格式。或者实现题中定义的输出:
df['text'] = df['text'].str.replace(';', ',')
df.to_csv('output.csv', sep=';')
我有问题。我有一个损坏的 csv 文件。最后一列是自由文本,我的分隔符是 ;
不幸的是有些用户在自由文本中使用 ;
,例如This is a longer text and;ups that should not be
。我现在想逐行阅读文件,在第二个 ;
之后,所有内容都应替换为 ,
。我打印出此 csv 文件的哪一行已损坏。如何读取文件并同时替换它?或者我应该保存行+输出并在之后替换它?
不幸的是,我不知道如何解决这种问题。
import pandas as pd
with open("sample.csv", encoding="UTF-8") as file:
for i, line in enumerate(file):
x = line.split(";")
if(len(x) > 3):
print(i, ": ", line)
cleaned_x = (', '.join(x[2:]))
# Add cleaned_x to x
new_line = x[0] + ";" + x[1] + ";" + cleaned_x
print(new_line)
df = pd.read_csv("file.csv", encoding="utf-8", sep=";")
我有什么
customerId;name;text
1;Josey;I want to go at 05pm
2;Mike;Check this out --> öl
2;Frank;This is a longer text and;ups that should not be
2;Max;okay;
3;Josey;here is everythink good
我想要的
customerId;name;text
1;Josey;I want to go at 05pm
2;Mike;Check this out --> öl
2;Frank;This is a longer text and,ups that should not be
2;Max;okay,
3;Josey;here is everythink good
您可以将行保存在数组中并创建一个新文件。
import csv
new_sample = []
with open("sample.csv", encoding="UTF-8") as file:
for i, line in enumerate(file):
x = line.split(";")
if(len(x) > 3):
print(i, ": ", line)
cleaned_x = (', '.join(x[2:]))
# Add cleaned_x to x
new_line = x[0] + ";" + x[1] + ";" + cleaned_x
print(new_line)
new_sample.append(new_line)
else:
new_sample.append(line)
with open("new_sample.csv", "w", encoding="UTF-8") as new_file: # Open in write mode.
writer = csv.writer(new_file)
for row in new_sample:
writer.writerow(row)
定义一个自定义函数来读取 csv 文件,然后从 rows
和 cols
:
def read_csv(path):
with open(path) as file:
for line in file:
*v, t = line.strip().split(';', 2)
yield [*v, t.replace(';', ',')]
cols, *rows = read_csv('sample.csv')
df = pd.DataFrame(rows, columns=cols)
print(df)
customerId name text
0 1 Josey I want to go at 05pm
1 2 Mike Check this out --> öl
2 2 Frank This is a longer text and,ups that should not be
3 2 Max okay,
4 3 Josey here is everythink good
仅供参考,如果您使用 Python 的 csv 库编写初始文件,它将处理嵌入;正确
import csv
with open("test.csv", "w") as f:
writer = csv.writer(f, delimiter=";")
writer.writerow(["hello", "world", "hello;world"])
# test.csv contains hello;world;"hello;world"
# which will be read as three fields using csv.reader
以下是解决问题的方法。我将写入一个新文件。可以在 read/write 模式下打开一个文件,但它更复杂,因为你需要读取一行,移动文件中的位置,写入新数据,同时确保你不会覆盖下一个字节行...使用新文件然后重命名它要容易得多。
import csv
with open("input.csv") as in_file, open("output.csv", "w") as out_file:
reader = csv.reader(in_file, delimiter=";")
writer = csv.writer(out_file, delimiter=";")
for line in reader: # line is a list containing the fields
if len(line) > 3:
line = line[:2] + [", ".join(line[2:])]
writer.writerow(line)
如果您不需要保存固定文件,则不需要打开“output.csv”或创建编写器。更正后打印 line
以显示字段列表 ["hello", "world", "hello;world"]
如果您希望打印最终会出现在文件中的字符串,您需要将包含分号的字段用引号引起来。
line = [f"\"{item}\"" if ";" in item else item for item in line]
print(";".join(line))
# hello;world;"hello;world"
Pandas(版本>=1.3.0
)允许在遇到带有on_bad_lines参数的错误行时调用函数来处理错误行:
callable, function with signature (bad_line: list[str]) -> list[str] | None that will process a single bad line. bad_line is a list of strings split by the sep. If the function returns None, the bad line will be ignored. If the function returns a new list of strings with more elements than expected, a ParserWarning will be emitted while dropping extra elements. Only supported when engine="python"
因此您可以简单地读取文件:
df = pd.read_csv('sample.csv', sep=';', engine='python', on_bad_lines=lambda x: x[:2] + [';'.join(x[2:])])
然后将其保存为您喜欢的任何格式。或者实现题中定义的输出:
df['text'] = df['text'].str.replace(';', ',')
df.to_csv('output.csv', sep=';')