如何使用 Python/Pandas 将 non-CSV 文本文件转换为 CSV 文件?
How do I transform a non-CSV text file into a CSV using Python/Pandas?
我有一个如下所示的文本文件:
Id Number: 12345678
Location: 1234561791234567090-8.9
Street: 999 Street AVE
Buyer: john doe
Id Number: 12345688
Location: 3582561791254567090-8.9
Street: 123 Street AVE
Buyer: Jane doe @ buyer % LLC
Id Number: 12345689
Location: 8542561791254567090-8.9
Street: 854 Street AVE
Buyer: Jake and Bob: Owner%LLC: Inc
我希望文件看起来像这样:
Id Number
Location
Street
Buyer
12345678
1234561791234567090-8.9
999 Street AVE
john doe
12345688
3582561791254567090-8.9
123 Street AVE
Jane doe @ buyer % LLC
12345689
8542561791254567090-8.9
854 Street AVE
Jake and Bob: Owner%LLC: Inc
我试过以下方法:
# 1 Read text file and ignore bad lines (lines with extra colons thus reading as extra fields).
tr = pd.read_csv('C:\File Path\test.txt', sep=':', header=None, error_bad_lines=False)
# 2 Convert into a dataframe/pivot table.
ndf = pd.DataFrame(tr.pivot(index=None, columns=0, values=1))
# 3 Clean up the pivot table to remove NaNs and reset the index (line by line).
nf2 = ndf.apply(lambda x: x.dropna().reset_index(drop=True))
这是最后一行 (#3) 的来源:
当我执行上述操作并导出为 CSV 时,headers 排列如下:
(index)
Street
Buyer
Id Number
Location
数据填写得很好,但在某些时候买家字段变得不准确但是其余字段是准确的整个DF.
我的猜测:
当我 运行 我的脚本的第 1 部分出现以下错误 507 次:
b'Skipping line 500: expected 2 fields, saw 3\nSkipping line 728: expected 2 fields, saw 3\
在新 DF 的末尾,我恰好缺少 507 个 Byer 字段条目。所以我认为当我放弃我的坏线时,该领域正在推动我的数据。
痛点:
Buyer 字段有时会有额外的冒号和其他奇怪的字符。因此,当我尝试使用冒号作为分隔符时,我 运行 遇到了问题。
我是 Python 的新手,而且我对使用函数非常陌生。我主要使用 Pandas 来处理一些基本级别的数据。所以用伟大的迈克尔斯科特的话来说:“像我五岁时一样向我解释。”非常感谢任何愿意提供帮助的人。
我会尝试逐行读取文件,将键值对分成 dict
的列表,看起来像:
data = [
{
"Id Number": 12345678,
"Location": 1234561791234567090-8.9,
...
},
{
"Id Number": ...
}
]
# easy to create the dataframe from here
your_df = pd.DataFrame(data)
这是一个演示基础知识的最小示例:
cat split_test.txt
Id Number: 12345678
Location: 1234561791234567090-8.9
Street: 999 Street AVE
Buyer: john doe
Id Number: 12345688
Location: 3582561791254567090-8.9
Street: 123 Street AVE
Buyer: Jane doe @ buyer % LLC
Id Number: 12345689
Location: 8542561791254567090-8.9
Street: 854 Street AVE
Buyer: Jake and Bob: Owner%LLC: Inc
import csv
with open("split_test.txt", "r") as f:
id_val = "Id Number"
list_var = []
for line in f:
split_line = line.strip().split(':')
print(split_line)
if split_line[0] == id_val:
d = {}
d[split_line[0]] = split_line[1]
list_var.append(d)
else:
d.update({split_line[0]: split_line[1]})
list_var
[{'Id Number': ' 12345689',
'Location': ' 8542561791254567090-8.9',
'Street': ' 854 Street AVE',
'Buyer': ' Jake and Bob'},
{'Id Number': ' 12345678',
'Location': ' 1234561791234567090-8.9',
'Street': ' 999 Street AVE',
'Buyer': ' john doe'},
{'Id Number': ' 12345688',
'Location': ' 3582561791254567090-8.9',
'Street': ' 123 Street AVE',
'Buyer': ' Jane doe @ buyer % LLC'}]
with open("split_ex.csv", "w") as csv_file:
field_names = list_var[0].keys()
csv_writer = csv.DictWriter(csv_file, fieldnames=field_names)
csv_writer.writeheader()
for row in list_var:
csv_writer.writerow(row)
这就是我阅读和使用拆分的意思。与其他答案非常相似。未经测试,我不记得输入线是否包含 eol,所以我也将其剥离。
with open('myfile.txt') as f:
data = [] # holds database
record = {} # holds built up record
for inputline in f:
key,value = inputline.strip().split(':',1)
if key == "Id Number": # new record starting
if len(record):
data.append(record) # write previous record
record = {}
record.update({key:value})
if len(record):
data.append(record) # out final record
df = pd.DataFrame(data)
我有一个如下所示的文本文件:
Id Number: 12345678
Location: 1234561791234567090-8.9
Street: 999 Street AVE
Buyer: john doe
Id Number: 12345688
Location: 3582561791254567090-8.9
Street: 123 Street AVE
Buyer: Jane doe @ buyer % LLC
Id Number: 12345689
Location: 8542561791254567090-8.9
Street: 854 Street AVE
Buyer: Jake and Bob: Owner%LLC: Inc
我希望文件看起来像这样:
Id Number | Location | Street | Buyer |
---|---|---|---|
12345678 | 1234561791234567090-8.9 | 999 Street AVE | john doe |
12345688 | 3582561791254567090-8.9 | 123 Street AVE | Jane doe @ buyer % LLC |
12345689 | 8542561791254567090-8.9 | 854 Street AVE | Jake and Bob: Owner%LLC: Inc |
我试过以下方法:
# 1 Read text file and ignore bad lines (lines with extra colons thus reading as extra fields).
tr = pd.read_csv('C:\File Path\test.txt', sep=':', header=None, error_bad_lines=False)
# 2 Convert into a dataframe/pivot table.
ndf = pd.DataFrame(tr.pivot(index=None, columns=0, values=1))
# 3 Clean up the pivot table to remove NaNs and reset the index (line by line).
nf2 = ndf.apply(lambda x: x.dropna().reset_index(drop=True))
这是最后一行 (#3) 的来源:
当我执行上述操作并导出为 CSV 时,headers 排列如下:
(index) | Street | Buyer | Id Number | Location |
---|
数据填写得很好,但在某些时候买家字段变得不准确但是其余字段是准确的整个DF.
我的猜测:
当我 运行 我的脚本的第 1 部分出现以下错误 507 次:
b'Skipping line 500: expected 2 fields, saw 3\nSkipping line 728: expected 2 fields, saw 3\
在新 DF 的末尾,我恰好缺少 507 个 Byer 字段条目。所以我认为当我放弃我的坏线时,该领域正在推动我的数据。
痛点:
Buyer 字段有时会有额外的冒号和其他奇怪的字符。因此,当我尝试使用冒号作为分隔符时,我 运行 遇到了问题。
我是 Python 的新手,而且我对使用函数非常陌生。我主要使用 Pandas 来处理一些基本级别的数据。所以用伟大的迈克尔斯科特的话来说:“像我五岁时一样向我解释。”非常感谢任何愿意提供帮助的人。
我会尝试逐行读取文件,将键值对分成 dict
的列表,看起来像:
data = [
{
"Id Number": 12345678,
"Location": 1234561791234567090-8.9,
...
},
{
"Id Number": ...
}
]
# easy to create the dataframe from here
your_df = pd.DataFrame(data)
这是一个演示基础知识的最小示例:
cat split_test.txt
Id Number: 12345678
Location: 1234561791234567090-8.9
Street: 999 Street AVE
Buyer: john doe
Id Number: 12345688
Location: 3582561791254567090-8.9
Street: 123 Street AVE
Buyer: Jane doe @ buyer % LLC
Id Number: 12345689
Location: 8542561791254567090-8.9
Street: 854 Street AVE
Buyer: Jake and Bob: Owner%LLC: Inc
import csv
with open("split_test.txt", "r") as f:
id_val = "Id Number"
list_var = []
for line in f:
split_line = line.strip().split(':')
print(split_line)
if split_line[0] == id_val:
d = {}
d[split_line[0]] = split_line[1]
list_var.append(d)
else:
d.update({split_line[0]: split_line[1]})
list_var
[{'Id Number': ' 12345689',
'Location': ' 8542561791254567090-8.9',
'Street': ' 854 Street AVE',
'Buyer': ' Jake and Bob'},
{'Id Number': ' 12345678',
'Location': ' 1234561791234567090-8.9',
'Street': ' 999 Street AVE',
'Buyer': ' john doe'},
{'Id Number': ' 12345688',
'Location': ' 3582561791254567090-8.9',
'Street': ' 123 Street AVE',
'Buyer': ' Jane doe @ buyer % LLC'}]
with open("split_ex.csv", "w") as csv_file:
field_names = list_var[0].keys()
csv_writer = csv.DictWriter(csv_file, fieldnames=field_names)
csv_writer.writeheader()
for row in list_var:
csv_writer.writerow(row)
这就是我阅读和使用拆分的意思。与其他答案非常相似。未经测试,我不记得输入线是否包含 eol,所以我也将其剥离。
with open('myfile.txt') as f:
data = [] # holds database
record = {} # holds built up record
for inputline in f:
key,value = inputline.strip().split(':',1)
if key == "Id Number": # new record starting
if len(record):
data.append(record) # write previous record
record = {}
record.update({key:value})
if len(record):
data.append(record) # out final record
df = pd.DataFrame(data)