创建 SAS 数据步骤以从 python 中的 pandas 数据帧导入 csv

Create SAS Data Step to import csv from pandas dataframe in python

我正在尝试创建一个可以复制和粘贴的字符串,以将数据框导入 SAS,其中开始和结束行是静态的,中间行需要根据列名、名称长度进行调整,以及列的数据类型,以便结果看起来像:

data **infile** %let _EFIERR_ = 0; /* set the ERROR detection macro variable */
infile **filepath** delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 ;
        informat A Best32. ;
        informat B Best32. ;
        informat C Best32. ;
        informat D Best32. ;
        informat E . ;
        format A Best12. ;
        format B Best12. ;
        format C Best12. ;
        format D Best12. ;
        format E . ;
                input A
                input B
                input C
                input D 
                input E $
;
if _ERROR_ then call symputx('_EFIERR_',1);  /* set ERROR detection macro variable */
run;

我当前的代码省略了一列。根据输入数据框,它省略的列会发生变化。使用 A、B、C 和 D 的数据名,它从中间的打印集中遗漏了 D。添加 E 后,它从中间的一组印刷品中省去了 C。我从另一个数据集中删除了第一组印刷品中 5 列中的第 4 列。我不确定我做错了什么。这是我拥有的:

def sas_import_csv(df):
    '''Takes a dataframe and prepares a data step to import the csv file to SAS.
    '''
    value_fmts = [np.float,np.int32,np.int64]
    opening = '''data **infile** %let _EFIERR_ = 0; /* set the ERROR detection macro variable */ 
infile **filepath** delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 ;'''
    closing = ''';
if _ERROR_ then call symputx('_EFIERR_',1);  /* set ERROR detection macro variable */
run;'''
    measurer = np.vectorize(len)
    dfLen = measurer(df.values.astype(str)).max(axis=0)
    print(f'{opening}')
    for l,col in zip(dfLen,df.columns):
        if df[col].dtypes in value_fmts: infmt = 'Best32. ;'
        else: infmt = f'${l}. ;'
        print(f'\tinformat {col} {infmt}')

    for l2,col2 in zip(dfLen,df.columns):
        if df[col2].dtypes in value_fmts: fmt = 'Best12. ;'
        else: fmt = f'${l2}. ;'
        print(f'\tformat {col2} {fmt}')

    for col3 in df.columns:
        if df[col3].dtypes in value_fmts: ct = ''
        else: ct = '$'
        print(f'\t\tinput {col3} {ct}')
    print(closing)

dates = pd.date_range(start='1/1/2018', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
df['E'] = "some string"
sas_import_csv(df)

给出格式部分中缺少 C 列的输出:

data **infile** %let _EFIERR_ = 0; /* set the ERROR detection macro variable */
infile **filepath** delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 ;
        informat A Best32. ;
        informat B Best32. ;
        informat C Best32. ;
        informat D Best32. ;
        informat E . ;
        format A Best12. ;
        format B Best12. ;
        format D Best12. ;
        format E . ;
                input A
                input B
                input C
                input D
                input E $
;
if _ERROR_ then call symputx('_EFIERR_',1);  /* set ERROR detection macro variable */
run;

这并没有回答为什么在一个实例中没有打印循环的问题,但这是一种更好的方法来完成我最初尝试做的事情。感谢@Tom 的指导。

from pandas.api.types import is_datetime64_any_dtype as is_datetime, is_object_dtype as is_object

def sas_import_csv(df,sas_date_fmt='yymmddn8.',filePath='',outName = 'X'):
    '''Takes a dataframe and prepares a data step to import the csv file to SAS.
    '''
    value_fmts = [np.float,np.int32,np.int64]
    opening = f"%let infile = '{filePath}';\ndata {outName}; %let _EFIERR_ = 0; /* set the ERROR detection macro variable */ \ninfile &infile delimiter = ',' MISSOVER DSD TRUNCOVER lrecl=32767 firstobs=2 ;"
    inp = 'input '
    fmt = 'format '
    infmt = 'informat '
    closing = "if _ERROR_ then call symputx('_EFIERR_',1);  /* set ERROR detection macro variable */\nrun;"
    measurer = np.vectorize(len)
    dfLen = measurer(df.values.astype(str)).max(axis=0)
    for l,col in zip(dfLen,df.columns):
        if is_object(df[col]): inp = inp + f'{col} :${l}. '
        elif is_datetime(df[col]): 
            inp = inp + f'{col} '
            fmt = fmt + f'{col} {sas_date_fmt} '
            infmt = infmt + f'{col} yymmdd10. '
        else: inp = inp + f'{col} '
    return f'{opening} {inp} ;\n{fmt} ;\n{infmt} ;\n{closing}'

现在,您只需将 print(c) 的输出复制并粘贴到 运行 下面的代码之后,即可将数据框读入 SAS:

import pandas as pd
dates = pd.date_range(start='1/1/2018', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
df['E'] = "some string"
df = df.reset_index().rename(columns = {'index':'Date'})
f = r'C:\Users\user\example.csv'
c = sas_import_csv(df,filePath=f)
df.to_csv(f,index=False)
print(c)