rxDataStep 使用滞后值

rxDataStep using lagged values

在 SAS 中,可以遍历数据集并使用滞后值。

我的方法是使用一个执行 "lag" 的函数,但这可能会在块的开头产生错误的值。例如,如果一个块从第 200,000 行开始,那么它将假设一个滞后值 NA,该滞后值应该来自第 199,999 行。

有解决办法吗?

您对分块问题的看法完全正确。解决方法是使用 rxGetrxSet 在块之间传递值。这是函数:

lagVar <- function(dataList) { 

     # .rxStartRow returns the overall row number of the first row in this
     # chunk. So - the first row of the first chunk is equal to one.
     # If this is the very first row, there's no previous value to use - so
     # it's just an NA.
     if(.rxStartRow == 1) {

        # Put the NA out front, then shift all the other values down one row.
        # newName is the desired name of the lagged variable, set using
        # transformObjects - see below
        dataList[[newName]] <- c(NA, dataList[[varToLag]][-.rxNumRows]) 

    } else {

        # If this isn't the very first chunk, we have to fetch the previous
        # value from the previous chunk using .rxGet, then shift all other
        # values down one row, just as before.
        dataList[[newName]] <- c(.rxGet("lastValue"),
                                 dataList[[varToLag]][-.rxNumRows]) 

      }

    # Finally, once this chunk is done processing, set its lastValue so that
    # the next chunk can use it.
    .rxSet("lastValue", dataList[[varToLag]][.rxNumRows])

    # Return dataList with the new variable
    dataList

}

以及如何在 rxDataStep 中使用它:

# Get a sample dataset
xdfPath <- file.path(rxGetOption("sampleDataDir"), "DJIAdaily.xdf")

# Set a path to a temporary file
xdfLagged <- tempfile(fileext = ".xdf")

# Sort the dataset chronologically - otherwise, the lagging will be random.
rxSort(inData = xdfPath,
       outFile = xdfLagged,
       sortByVars = "Date")

# Finally, put the lagging function to use:
rxDataStep(inData = xdfLagged, 
           outFile = xdfLagged,
           transformObjects = list(
               varToLag = "Open", 
               newName = "previousOpen"), 
           transformFunc = lagVar,
           append = "cols",
           overwrite = TRUE)

# Check the results
rxDataStep(xdfLagged, 
           varsToKeep = c("Date", "Open", "previousOpen"),
           numRows = 10)

这是另一种滞后方法:使用偏移日期进行自合并。这大大简化了编码,并且可以同时滞后于多个变量。缺点是 运行 比我使用 transformFunc 的答案要长 2-3 倍,并且需要数据集的第二个副本。

# Get a sample dataset
sourcePath <- file.path(rxGetOption("sampleDataDir"), "DJIAdaily.xdf")

# Set up paths for two copies of it
xdfPath <- tempfile(fileext = ".xdf")
xdfPathShifted <- tempfile(fileext = ".xdf")


# Convert "Date" to be Date-classed
rxDataStep(inData = sourcePath,
           outFile = xdfPath,
           transforms = list(Date = as.Date(Date)),
           overwrite = TRUE
)


# Then make the second copy, but shift all the dates up 
# one (or however much you want to lag)
# Use varsToKeep to subset to just the date and 
# the variables you want to lag
rxDataStep(inData = xdfPath,
           outFile = xdfPathShifted,
           varsToKeep = c("Date", "Open", "Close"),
           transforms = list(Date = as.Date(Date) + 1),
           overwrite = TRUE
)

# Create an output XDF (or just overwrite xdfPath)
xdfLagged2 <- tempfile(fileext = ".xdf")

# Use that incremented date to merge variables back on.
# duplicateVarExt will automatically tag variables from the 
# second dataset as "Lagged".
# Note that there's no need to sort manually in this one - 
# rxMerge does it automatically.
rxMerge(inData1 = xdfPath,
        inData2 = xdfPathShifted,
        outFile = xdfLagged2,
        matchVars = "Date",
        type = "left",
        duplicateVarExt = c("", "Lagged")
)