创建 SAS 数据步骤以从 python 中的 pandas 数据帧导入 csv
Create SAS Data Step to import csv from pandas dataframe in python
我正在尝试创建一个可以复制和粘贴的字符串,以将数据框导入 SAS,其中开始和结束行是静态的,中间行需要根据列名、名称长度进行调整,以及列的数据类型,以便结果看起来像:
data **infile** %let _EFIERR_ = 0; /* set the ERROR detection macro variable */
infile **filepath** delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 ;
informat A Best32. ;
informat B Best32. ;
informat C Best32. ;
informat D Best32. ;
informat E . ;
format A Best12. ;
format B Best12. ;
format C Best12. ;
format D Best12. ;
format E . ;
input A
input B
input C
input D
input E $
;
if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection macro variable */
run;
我当前的代码省略了一列。根据输入数据框,它省略的列会发生变化。使用 A、B、C 和 D 的数据名,它从中间的打印集中遗漏了 D。添加 E 后,它从中间的一组印刷品中省去了 C。我从另一个数据集中删除了第一组印刷品中 5 列中的第 4 列。我不确定我做错了什么。这是我拥有的:
def sas_import_csv(df):
'''Takes a dataframe and prepares a data step to import the csv file to SAS.
'''
value_fmts = [np.float,np.int32,np.int64]
opening = '''data **infile** %let _EFIERR_ = 0; /* set the ERROR detection macro variable */
infile **filepath** delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 ;'''
closing = ''';
if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection macro variable */
run;'''
measurer = np.vectorize(len)
dfLen = measurer(df.values.astype(str)).max(axis=0)
print(f'{opening}')
for l,col in zip(dfLen,df.columns):
if df[col].dtypes in value_fmts: infmt = 'Best32. ;'
else: infmt = f'${l}. ;'
print(f'\tinformat {col} {infmt}')
for l2,col2 in zip(dfLen,df.columns):
if df[col2].dtypes in value_fmts: fmt = 'Best12. ;'
else: fmt = f'${l2}. ;'
print(f'\tformat {col2} {fmt}')
for col3 in df.columns:
if df[col3].dtypes in value_fmts: ct = ''
else: ct = '$'
print(f'\t\tinput {col3} {ct}')
print(closing)
dates = pd.date_range(start='1/1/2018', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
df['E'] = "some string"
sas_import_csv(df)
给出格式部分中缺少 C 列的输出:
data **infile** %let _EFIERR_ = 0; /* set the ERROR detection macro variable */
infile **filepath** delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 ;
informat A Best32. ;
informat B Best32. ;
informat C Best32. ;
informat D Best32. ;
informat E . ;
format A Best12. ;
format B Best12. ;
format D Best12. ;
format E . ;
input A
input B
input C
input D
input E $
;
if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection macro variable */
run;
这并没有回答为什么在一个实例中没有打印循环的问题,但这是一种更好的方法来完成我最初尝试做的事情。感谢@Tom 的指导。
from pandas.api.types import is_datetime64_any_dtype as is_datetime, is_object_dtype as is_object
def sas_import_csv(df,sas_date_fmt='yymmddn8.',filePath='',outName = 'X'):
'''Takes a dataframe and prepares a data step to import the csv file to SAS.
'''
value_fmts = [np.float,np.int32,np.int64]
opening = f"%let infile = '{filePath}';\ndata {outName}; %let _EFIERR_ = 0; /* set the ERROR detection macro variable */ \ninfile &infile delimiter = ',' MISSOVER DSD TRUNCOVER lrecl=32767 firstobs=2 ;"
inp = 'input '
fmt = 'format '
infmt = 'informat '
closing = "if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection macro variable */\nrun;"
measurer = np.vectorize(len)
dfLen = measurer(df.values.astype(str)).max(axis=0)
for l,col in zip(dfLen,df.columns):
if is_object(df[col]): inp = inp + f'{col} :${l}. '
elif is_datetime(df[col]):
inp = inp + f'{col} '
fmt = fmt + f'{col} {sas_date_fmt} '
infmt = infmt + f'{col} yymmdd10. '
else: inp = inp + f'{col} '
return f'{opening} {inp} ;\n{fmt} ;\n{infmt} ;\n{closing}'
现在,您只需将 print(c)
的输出复制并粘贴到 运行 下面的代码之后,即可将数据框读入 SAS:
import pandas as pd
dates = pd.date_range(start='1/1/2018', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
df['E'] = "some string"
df = df.reset_index().rename(columns = {'index':'Date'})
f = r'C:\Users\user\example.csv'
c = sas_import_csv(df,filePath=f)
df.to_csv(f,index=False)
print(c)
我正在尝试创建一个可以复制和粘贴的字符串,以将数据框导入 SAS,其中开始和结束行是静态的,中间行需要根据列名、名称长度进行调整,以及列的数据类型,以便结果看起来像:
data **infile** %let _EFIERR_ = 0; /* set the ERROR detection macro variable */
infile **filepath** delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 ;
informat A Best32. ;
informat B Best32. ;
informat C Best32. ;
informat D Best32. ;
informat E . ;
format A Best12. ;
format B Best12. ;
format C Best12. ;
format D Best12. ;
format E . ;
input A
input B
input C
input D
input E $
;
if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection macro variable */
run;
我当前的代码省略了一列。根据输入数据框,它省略的列会发生变化。使用 A、B、C 和 D 的数据名,它从中间的打印集中遗漏了 D。添加 E 后,它从中间的一组印刷品中省去了 C。我从另一个数据集中删除了第一组印刷品中 5 列中的第 4 列。我不确定我做错了什么。这是我拥有的:
def sas_import_csv(df):
'''Takes a dataframe and prepares a data step to import the csv file to SAS.
'''
value_fmts = [np.float,np.int32,np.int64]
opening = '''data **infile** %let _EFIERR_ = 0; /* set the ERROR detection macro variable */
infile **filepath** delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 ;'''
closing = ''';
if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection macro variable */
run;'''
measurer = np.vectorize(len)
dfLen = measurer(df.values.astype(str)).max(axis=0)
print(f'{opening}')
for l,col in zip(dfLen,df.columns):
if df[col].dtypes in value_fmts: infmt = 'Best32. ;'
else: infmt = f'${l}. ;'
print(f'\tinformat {col} {infmt}')
for l2,col2 in zip(dfLen,df.columns):
if df[col2].dtypes in value_fmts: fmt = 'Best12. ;'
else: fmt = f'${l2}. ;'
print(f'\tformat {col2} {fmt}')
for col3 in df.columns:
if df[col3].dtypes in value_fmts: ct = ''
else: ct = '$'
print(f'\t\tinput {col3} {ct}')
print(closing)
dates = pd.date_range(start='1/1/2018', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
df['E'] = "some string"
sas_import_csv(df)
给出格式部分中缺少 C 列的输出:
data **infile** %let _EFIERR_ = 0; /* set the ERROR detection macro variable */
infile **filepath** delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 ;
informat A Best32. ;
informat B Best32. ;
informat C Best32. ;
informat D Best32. ;
informat E . ;
format A Best12. ;
format B Best12. ;
format D Best12. ;
format E . ;
input A
input B
input C
input D
input E $
;
if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection macro variable */
run;
这并没有回答为什么在一个实例中没有打印循环的问题,但这是一种更好的方法来完成我最初尝试做的事情。感谢@Tom 的指导。
from pandas.api.types import is_datetime64_any_dtype as is_datetime, is_object_dtype as is_object
def sas_import_csv(df,sas_date_fmt='yymmddn8.',filePath='',outName = 'X'):
'''Takes a dataframe and prepares a data step to import the csv file to SAS.
'''
value_fmts = [np.float,np.int32,np.int64]
opening = f"%let infile = '{filePath}';\ndata {outName}; %let _EFIERR_ = 0; /* set the ERROR detection macro variable */ \ninfile &infile delimiter = ',' MISSOVER DSD TRUNCOVER lrecl=32767 firstobs=2 ;"
inp = 'input '
fmt = 'format '
infmt = 'informat '
closing = "if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection macro variable */\nrun;"
measurer = np.vectorize(len)
dfLen = measurer(df.values.astype(str)).max(axis=0)
for l,col in zip(dfLen,df.columns):
if is_object(df[col]): inp = inp + f'{col} :${l}. '
elif is_datetime(df[col]):
inp = inp + f'{col} '
fmt = fmt + f'{col} {sas_date_fmt} '
infmt = infmt + f'{col} yymmdd10. '
else: inp = inp + f'{col} '
return f'{opening} {inp} ;\n{fmt} ;\n{infmt} ;\n{closing}'
现在,您只需将 print(c)
的输出复制并粘贴到 运行 下面的代码之后,即可将数据框读入 SAS:
import pandas as pd
dates = pd.date_range(start='1/1/2018', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
df['E'] = "some string"
df = df.reset_index().rename(columns = {'index':'Date'})
f = r'C:\Users\user\example.csv'
c = sas_import_csv(df,filePath=f)
df.to_csv(f,index=False)
print(c)