Pre-process 之前的数据文件 pandas read_csv
Pre-process data file before pandas read_csv
我使用 SAP 的数据输出,但它既不是 CSV,因为它不引用包含其分隔符的字符串,也不是固定宽度,因为它有 multi-byte 个字符。这有点像 "fixed width" character-wise.
为了将其放入 pandas 我目前正在读取文件,获取分隔符位置,将分隔符周围的每一行切片,然后将其保存到适当的 CSV 文件中,这样我就可以轻松阅读了。
我看到 pandas read_csv 可以得到一个文件缓冲区。我如何将我的流直接传递给它,而不保存 csv 文件?我应该做一个发电机吗?我可以在不给文件句柄的情况下获得 csv.writer.writerow 输出吗?
这是我的代码:
import pandas as pd
caminho= r'C:\Users\user\Documents\SAP\Tests\'
arquivo = "ExpComp_01.txt"
tipo_dado = {"KEY_GUID":"object", "DEL_IND":"object", "HDR_GUID":"object", , "PRICE":"object", "LEADTIME":"int16", "MANUFACTURER":"object", "LOAD_TIME":"object", "APPR_TIME":"object", "SEND_TIME":"object", "DESCRIPTION":"object"}
def desmembra(linha, limites):
# This functions receives each delimiter's index and cuts around it
posicao=limites[0]
for limite in limites[1:]:
yield linha[posicao+1:limite]
posicao=limite
def pre_processa(arquivo):
import csv
import os
# Translates SAP output in standard CSV
with open(arquivo,"r", encoding="mbcs") as entrada, open(arquivo[:-3] +
"csv", "w", newline="", encoding="mbcs") as saida:
escreve=csv.writer(saida,csv.QUOTE_MINIMAL, delimiter=";").writerow
for line in entrada:
# Find heading
if line[0]=="|":
delimitadores = [x for x, v in enumerate(line) if v == '|']
if line[-2] != "|":
delimitadores.append(None)
cabecalho_teste=line[:50]
escreve([campo.strip() for campo in desmembra(line,delimitadores)])
break
for line in entrada:
if line[0]=="|" and line[:50]!=cabecalho_teste:
escreve([campo.strip() for campo in desmembra(line, delimitadores)])
pre_processa(caminho+arquivo)
dados = pd.read_csv(caminho + arquivo[:-3] + "csv", sep=";",
header=0, encoding="mbcs", dtype=tipo_dado)
此外,如果您可以分享最佳做法:
我有奇怪的日期时间字符串 20.120.813.132.432
我可以使用
成功转换
dados["SEND_TIME"]=pd.to_datetime(dados["SEND_TIME"], format="%Y%m%d%H%M%S")
dados["SEND_TIME"].replace(regex=False,inplace=True,to_replace=r'.',value=r'')
我无法为它编写解析器,因为我有以不同字符串格式存储的日期。是在导入期间指定一个转换器来执行它还是让 pandas 最后执行它 column-wise 会更快?
我有一个代码 99999999
的类似问题,我必须向 99.999.999
添加点。我不知道我是 应该写一个转换器还是等到导入之后再做一个 df.replace
EDIT -- 示例数据:
| KEY_GUID|DEL_IND| HDR_GUID|Prod_CD |DESCRIPTION | PRICE|LEADTIME|MANUFACTURER| LOAD_TIME|APPR_TIME | SEND_TIME|
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|000427507E64FB29E2006281548EB186| |4C1AD7E25DC50D61E10000000A19FF83|75123636|Vneráéíoaeot.sadot.m | 29,55 |30 | |20.120.813.132.432 |20120813132929|20.120.505.010.157 |
|000527507E64FB29E2006281548EB186| |4C1AD7E25DC50D61E10000000A19FF83|75123643|Tnerasodaeot|sadot.m | 122,91 |30 | |20.120.813.132.432 |20120813132929|20.120.505.010.141 |
|0005DB50112F9E69E10000000A1D2028| |384BB350BF56315DE20062700D627978|75123676|Dnerasodáeot.sadot.m |252.446,99 |3 |POLAND |20.121.226.175.640 |20121226183608|20.121.222.000.015 |
|000627507E64FB29E2006281548EB186| |4C1AD7E25DC50D61E10000000A19FF83|75123652|Pner|sodaeot.sadot.m | 657,49 |30 | |20.120.813.132.432 |20120813132929|20.120.505.010.128 |
|000727507E64FB29E2006281548EB186| |4C1AD7E25DC50D61E10000000A19FF83| |Rnerasodaeot.sadot.m | 523,63 |30 | |20.120.813.132.432 |20120813132929|20.120.707.010.119 |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| KEY_GUID|DEL_IND| HDR_GUID|Prod_CD |DESCRIPTION | PRICE|LEADTIME|MANUFACTURER| LOAD_TIME|APPR_TIME | SEND_TIME|
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |000827507E64FB29E2006281548EB186| |4C1AD7E25DC50D61E10000000A19FF83|75123603|Inerasodéeot.sadot.m | 2.073,63 |30 | |20.120.813.132.432 |20120813132929|20.120.505.010.127 |
|000927507E64FB29E2006281548EB186| |4C1AD7E25DC50D61E10000000A19FF83|75123662|Ane|asodaeot.sadot.m | 0,22 |30 | |20.120.813.132.432 |20120813132929|20.120.505.010.135 |
|000A27507E64FB29E2006281548EB186| |4C1AD7E25DC50D61E10000000A19FF83|75123626|Pneraíodaeot.sadot.m | 300,75 |30 | |20.120.813.132.432 |20120813132929|20.120.505.010.140 |
|000B27507E64FB29E2006281548EB186| |4C1AD7E25DC50D61E10000000A19FF83| |Aneraéodaeot.sadot.m | 1,19 |30 | |20.120.813.132.432 |20120813132929|20.120.505.010.131 |
|000C27507E64FB29E2006281548EB186| |4C1AD7E25DC50D61E10000000A19FF83|75123613|Cnerasodaeot.sadot.m | 30,90 |30 | |20.120.813.132.432 |20120813132929|20.120.505.010.144 |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
我将处理具有其他字段的其他表。都是这种一般形式。我只能相信标题中的分隔符。我也可能在数据中重复了标题。它看起来像矩阵打印输出。
如果你想在不先写入 CSV 的情况下构建一个 DataFrame,那么你不需要
需要 pd.read_csv
。虽然可以使用 io.BytesIO
或
cString.StringIO
写入内存中的类文件对象,它不会
转换值的可迭代意义(如 desmembra(line, delimitadores)
)
到单个字符串只是为了用 pd.read_csv
.
重新解析它
相反,使用 pd.DataFrame
更直接,因为 pd.DataFrame
可以接受行数据的迭代器。
使用普通 Python 对值进行逐一运算通常不是最快的方法。通常,对整列使用 Pandas 函数会更快。因此,我会先将 arquivo
解析为字符串的 DataFrame,然后使用 Pandas 函数将列 post 处理为正确的数据类型和值。
import pandas as pd
import os
import csv
import io
caminho = r'C:\Users\u5en\Documents\SAP\Testes\'
arquivo = os.path.join(caminho, "ExpComp_01.txt")
arquivo_csv = os.path.splitext(arquivo)[0] + '.csv'
def desmembra(linha, limites):
# This functions receives each delimiter's index and cuts around it
return [linha[limites[i]+1:limites[i+1]].strip()
for i in range(len(limites[:-1]))]
def pre_processa(arquivo, enc):
# Translates SAP output into an iterator of lists of strings
with io.open(arquivo, "r", encoding=enc) as entrada:
for line in entrada:
# Find heading
if line[0] == "|":
delimitadores = [x for x, v in enumerate(line) if v == '|']
if line[-2] != "|":
delimitadores.append(None)
cabecalho_teste = line[:50]
yield desmembra(line, delimitadores)
break
for line in entrada:
if line[0] == "|" and line[:50] != cabecalho_teste:
yield desmembra(line, delimitadores)
def post_process(dados):
dados['LEADTIME'] = dados['LEADTIME'].astype('int16')
for col in ('SEND_TIME', 'LOAD_TIME', 'PRICE'):
dados[col] = dados[col].str.replace(r'.', '')
for col in ('SEND_TIME', 'LOAD_TIME', 'APPR_TIME'):
dados[col] = pd.to_datetime(dados[col], format="%Y%m%d%H%M%S")
return dados
enc = 'mbcs'
saida = pre_processa(arquivo, enc)
header = next(saida)
dados = pd.DataFrame(saida, columns=header)
dados = post_process(dados)
print(dados)
产量
KEY_GUID DEL_IND HDR_GUID \
0 000427507E64FB29E2006281548EB186 4C1AD7E25DC50D61E10000000A19FF83
1 000527507E64FB29E2006281548EB186 4C1AD7E25DC50D61E10000000A19FF83
2 0005DB50112F9E69E10000000A1D2028 384BB350BF56315DE20062700D627978
3 000627507E64FB29E2006281548EB186 4C1AD7E25DC50D61E10000000A19FF83
4 000727507E64FB29E2006281548EB186 4C1AD7E25DC50D61E10000000A19FF83
5 000927507E64FB29E2006281548EB186 4C1AD7E25DC50D61E10000000A19FF83
6 000A27507E64FB29E2006281548EB186 4C1AD7E25DC50D61E10000000A19FF83
7 000B27507E64FB29E2006281548EB186 4C1AD7E25DC50D61E10000000A19FF83
8 000C27507E64FB29E2006281548EB186 4C1AD7E25DC50D61E10000000A19FF83
Prod_CD DESCRIPTION PRICE LEADTIME MANUFACTURER \
0 75123636 Vneráéíoaeot.sadot.m 29,55 30
1 75123643 Tnerasodaeot|sadot.m 122,91 30
2 75123676 Dnerasodáeot.sadot.m 252446,99 3 POLAND
3 75123652 Pner|sodaeot.sadot.m 657,49 30
4 Rnerasodaeot.sadot.m 523,63 30
5 75123662 Ane|asodaeot.sadot.m 0,22 30
6 75123626 Pneraíodaeot.sadot.m 300,75 30
7 Aneraéodaeot.sadot.m 1,19 30
8 75123613 Cnerasodaeot.sadot.m 30,90 30
LOAD_TIME APPR_TIME SEND_TIME
0 2012-08-13 13:24:32 2012-08-13 13:29:29 2012-05-05 01:01:57
1 2012-08-13 13:24:32 2012-08-13 13:29:29 2012-05-05 01:01:41
2 2012-12-26 17:56:40 2012-12-26 18:36:08 2012-12-22 00:00:15
3 2012-08-13 13:24:32 2012-08-13 13:29:29 2012-05-05 01:01:28
4 2012-08-13 13:24:32 2012-08-13 13:29:29 2012-07-07 01:01:19
5 2012-08-13 13:24:32 2012-08-13 13:29:29 2012-05-05 01:01:35
6 2012-08-13 13:24:32 2012-08-13 13:29:29 2012-05-05 01:01:40
7 2012-08-13 13:24:32 2012-08-13 13:29:29 2012-05-05 01:01:31
8 2012-08-13 13:24:32 2012-08-13 13:29:29 2012-05-05 01:01:44
我使用 SAP 的数据输出,但它既不是 CSV,因为它不引用包含其分隔符的字符串,也不是固定宽度,因为它有 multi-byte 个字符。这有点像 "fixed width" character-wise.
为了将其放入 pandas 我目前正在读取文件,获取分隔符位置,将分隔符周围的每一行切片,然后将其保存到适当的 CSV 文件中,这样我就可以轻松阅读了。
我看到 pandas read_csv 可以得到一个文件缓冲区。我如何将我的流直接传递给它,而不保存 csv 文件?我应该做一个发电机吗?我可以在不给文件句柄的情况下获得 csv.writer.writerow 输出吗?
这是我的代码:
import pandas as pd
caminho= r'C:\Users\user\Documents\SAP\Tests\'
arquivo = "ExpComp_01.txt"
tipo_dado = {"KEY_GUID":"object", "DEL_IND":"object", "HDR_GUID":"object", , "PRICE":"object", "LEADTIME":"int16", "MANUFACTURER":"object", "LOAD_TIME":"object", "APPR_TIME":"object", "SEND_TIME":"object", "DESCRIPTION":"object"}
def desmembra(linha, limites):
# This functions receives each delimiter's index and cuts around it
posicao=limites[0]
for limite in limites[1:]:
yield linha[posicao+1:limite]
posicao=limite
def pre_processa(arquivo):
import csv
import os
# Translates SAP output in standard CSV
with open(arquivo,"r", encoding="mbcs") as entrada, open(arquivo[:-3] +
"csv", "w", newline="", encoding="mbcs") as saida:
escreve=csv.writer(saida,csv.QUOTE_MINIMAL, delimiter=";").writerow
for line in entrada:
# Find heading
if line[0]=="|":
delimitadores = [x for x, v in enumerate(line) if v == '|']
if line[-2] != "|":
delimitadores.append(None)
cabecalho_teste=line[:50]
escreve([campo.strip() for campo in desmembra(line,delimitadores)])
break
for line in entrada:
if line[0]=="|" and line[:50]!=cabecalho_teste:
escreve([campo.strip() for campo in desmembra(line, delimitadores)])
pre_processa(caminho+arquivo)
dados = pd.read_csv(caminho + arquivo[:-3] + "csv", sep=";",
header=0, encoding="mbcs", dtype=tipo_dado)
此外,如果您可以分享最佳做法:
我有奇怪的日期时间字符串 20.120.813.132.432
我可以使用
dados["SEND_TIME"]=pd.to_datetime(dados["SEND_TIME"], format="%Y%m%d%H%M%S")
dados["SEND_TIME"].replace(regex=False,inplace=True,to_replace=r'.',value=r'')
我无法为它编写解析器,因为我有以不同字符串格式存储的日期。是在导入期间指定一个转换器来执行它还是让 pandas 最后执行它 column-wise 会更快?
我有一个代码 99999999
的类似问题,我必须向 99.999.999
添加点。我不知道我是 应该写一个转换器还是等到导入之后再做一个 df.replace
EDIT -- 示例数据:
| KEY_GUID|DEL_IND| HDR_GUID|Prod_CD |DESCRIPTION | PRICE|LEADTIME|MANUFACTURER| LOAD_TIME|APPR_TIME | SEND_TIME|
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|000427507E64FB29E2006281548EB186| |4C1AD7E25DC50D61E10000000A19FF83|75123636|Vneráéíoaeot.sadot.m | 29,55 |30 | |20.120.813.132.432 |20120813132929|20.120.505.010.157 |
|000527507E64FB29E2006281548EB186| |4C1AD7E25DC50D61E10000000A19FF83|75123643|Tnerasodaeot|sadot.m | 122,91 |30 | |20.120.813.132.432 |20120813132929|20.120.505.010.141 |
|0005DB50112F9E69E10000000A1D2028| |384BB350BF56315DE20062700D627978|75123676|Dnerasodáeot.sadot.m |252.446,99 |3 |POLAND |20.121.226.175.640 |20121226183608|20.121.222.000.015 |
|000627507E64FB29E2006281548EB186| |4C1AD7E25DC50D61E10000000A19FF83|75123652|Pner|sodaeot.sadot.m | 657,49 |30 | |20.120.813.132.432 |20120813132929|20.120.505.010.128 |
|000727507E64FB29E2006281548EB186| |4C1AD7E25DC50D61E10000000A19FF83| |Rnerasodaeot.sadot.m | 523,63 |30 | |20.120.813.132.432 |20120813132929|20.120.707.010.119 |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| KEY_GUID|DEL_IND| HDR_GUID|Prod_CD |DESCRIPTION | PRICE|LEADTIME|MANUFACTURER| LOAD_TIME|APPR_TIME | SEND_TIME|
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |000827507E64FB29E2006281548EB186| |4C1AD7E25DC50D61E10000000A19FF83|75123603|Inerasodéeot.sadot.m | 2.073,63 |30 | |20.120.813.132.432 |20120813132929|20.120.505.010.127 |
|000927507E64FB29E2006281548EB186| |4C1AD7E25DC50D61E10000000A19FF83|75123662|Ane|asodaeot.sadot.m | 0,22 |30 | |20.120.813.132.432 |20120813132929|20.120.505.010.135 |
|000A27507E64FB29E2006281548EB186| |4C1AD7E25DC50D61E10000000A19FF83|75123626|Pneraíodaeot.sadot.m | 300,75 |30 | |20.120.813.132.432 |20120813132929|20.120.505.010.140 |
|000B27507E64FB29E2006281548EB186| |4C1AD7E25DC50D61E10000000A19FF83| |Aneraéodaeot.sadot.m | 1,19 |30 | |20.120.813.132.432 |20120813132929|20.120.505.010.131 |
|000C27507E64FB29E2006281548EB186| |4C1AD7E25DC50D61E10000000A19FF83|75123613|Cnerasodaeot.sadot.m | 30,90 |30 | |20.120.813.132.432 |20120813132929|20.120.505.010.144 |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
我将处理具有其他字段的其他表。都是这种一般形式。我只能相信标题中的分隔符。我也可能在数据中重复了标题。它看起来像矩阵打印输出。
如果你想在不先写入 CSV 的情况下构建一个 DataFrame,那么你不需要
需要 pd.read_csv
。虽然可以使用 io.BytesIO
或
cString.StringIO
写入内存中的类文件对象,它不会
转换值的可迭代意义(如 desmembra(line, delimitadores)
)
到单个字符串只是为了用 pd.read_csv
.
相反,使用 pd.DataFrame
更直接,因为 pd.DataFrame
可以接受行数据的迭代器。
使用普通 Python 对值进行逐一运算通常不是最快的方法。通常,对整列使用 Pandas 函数会更快。因此,我会先将 arquivo
解析为字符串的 DataFrame,然后使用 Pandas 函数将列 post 处理为正确的数据类型和值。
import pandas as pd
import os
import csv
import io
caminho = r'C:\Users\u5en\Documents\SAP\Testes\'
arquivo = os.path.join(caminho, "ExpComp_01.txt")
arquivo_csv = os.path.splitext(arquivo)[0] + '.csv'
def desmembra(linha, limites):
# This functions receives each delimiter's index and cuts around it
return [linha[limites[i]+1:limites[i+1]].strip()
for i in range(len(limites[:-1]))]
def pre_processa(arquivo, enc):
# Translates SAP output into an iterator of lists of strings
with io.open(arquivo, "r", encoding=enc) as entrada:
for line in entrada:
# Find heading
if line[0] == "|":
delimitadores = [x for x, v in enumerate(line) if v == '|']
if line[-2] != "|":
delimitadores.append(None)
cabecalho_teste = line[:50]
yield desmembra(line, delimitadores)
break
for line in entrada:
if line[0] == "|" and line[:50] != cabecalho_teste:
yield desmembra(line, delimitadores)
def post_process(dados):
dados['LEADTIME'] = dados['LEADTIME'].astype('int16')
for col in ('SEND_TIME', 'LOAD_TIME', 'PRICE'):
dados[col] = dados[col].str.replace(r'.', '')
for col in ('SEND_TIME', 'LOAD_TIME', 'APPR_TIME'):
dados[col] = pd.to_datetime(dados[col], format="%Y%m%d%H%M%S")
return dados
enc = 'mbcs'
saida = pre_processa(arquivo, enc)
header = next(saida)
dados = pd.DataFrame(saida, columns=header)
dados = post_process(dados)
print(dados)
产量
KEY_GUID DEL_IND HDR_GUID \
0 000427507E64FB29E2006281548EB186 4C1AD7E25DC50D61E10000000A19FF83
1 000527507E64FB29E2006281548EB186 4C1AD7E25DC50D61E10000000A19FF83
2 0005DB50112F9E69E10000000A1D2028 384BB350BF56315DE20062700D627978
3 000627507E64FB29E2006281548EB186 4C1AD7E25DC50D61E10000000A19FF83
4 000727507E64FB29E2006281548EB186 4C1AD7E25DC50D61E10000000A19FF83
5 000927507E64FB29E2006281548EB186 4C1AD7E25DC50D61E10000000A19FF83
6 000A27507E64FB29E2006281548EB186 4C1AD7E25DC50D61E10000000A19FF83
7 000B27507E64FB29E2006281548EB186 4C1AD7E25DC50D61E10000000A19FF83
8 000C27507E64FB29E2006281548EB186 4C1AD7E25DC50D61E10000000A19FF83
Prod_CD DESCRIPTION PRICE LEADTIME MANUFACTURER \
0 75123636 Vneráéíoaeot.sadot.m 29,55 30
1 75123643 Tnerasodaeot|sadot.m 122,91 30
2 75123676 Dnerasodáeot.sadot.m 252446,99 3 POLAND
3 75123652 Pner|sodaeot.sadot.m 657,49 30
4 Rnerasodaeot.sadot.m 523,63 30
5 75123662 Ane|asodaeot.sadot.m 0,22 30
6 75123626 Pneraíodaeot.sadot.m 300,75 30
7 Aneraéodaeot.sadot.m 1,19 30
8 75123613 Cnerasodaeot.sadot.m 30,90 30
LOAD_TIME APPR_TIME SEND_TIME
0 2012-08-13 13:24:32 2012-08-13 13:29:29 2012-05-05 01:01:57
1 2012-08-13 13:24:32 2012-08-13 13:29:29 2012-05-05 01:01:41
2 2012-12-26 17:56:40 2012-12-26 18:36:08 2012-12-22 00:00:15
3 2012-08-13 13:24:32 2012-08-13 13:29:29 2012-05-05 01:01:28
4 2012-08-13 13:24:32 2012-08-13 13:29:29 2012-07-07 01:01:19
5 2012-08-13 13:24:32 2012-08-13 13:29:29 2012-05-05 01:01:35
6 2012-08-13 13:24:32 2012-08-13 13:29:29 2012-05-05 01:01:40
7 2012-08-13 13:24:32 2012-08-13 13:29:29 2012-05-05 01:01:31
8 2012-08-13 13:24:32 2012-08-13 13:29:29 2012-05-05 01:01:44