将 Excel xml 文件读入 R

Read Excel xml file into R

我正在尝试使用 xml2 来读取 Excel xml 文件,但遇到了困难,因为我拥有的文件与xml2 文档中的示例。我想从工作簿中读取其中一张工作表并将其用作数据框。

这个片段包含完整的结构,但只有一个单元格填充了一堆文本,而我要阅读的内容有 50,000 行数据。

<?xml version='1.0'?>
<?mso-application progid='Excel.Sheet'?>
<s:Workbook xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:s="urn:schemas-microsoft-com:office:spreadsheet">
  <s:Worksheet s:Name="DBCitation">
    <s:Table>
      <s:Row>
        <s:Cell>
          <s:Data s:Type="String">The suggested citation for your download is below. See metadata folder and citationsyntax.xls for more explanation</s:Data>
        </s:Cell>
      </s:Row>
      <s:Row>
        <s:Cell>
          <s:Data s:Type="String" />
        </s:Cell>
      </s:Row>
      <s:Row>
        <s:Cell>
          <s:Data s:Type="String">Acosta-Martinez, Veronica ; Balkcom, Kipling; Caesar-TonThat, Thecan; Franzluebbers, Alan; Gollany, Hero; Jabro, Jalal; Jin, Virginia; Johnson, Jane; Liebig, Mark; Phillips, Rebecca; Sainju, Upendra; Sistani, Karamat; Skinner, R; Smith, Douglas; Stevens, William; Stott, Diane; Varvel, Gary; Venterea, Rodney; Acosta-Martinez, Veronica; Archer, David; Barbour, Nancy; Bucholtz, Dennis; Dell , Curtis ; Dillard, Anthony; Gross, Jason; Johnson, Holly; Knapp, Steven; Polumsky, Robert; Simmons, Jason; Upchurch, Dan; Waldron, Sarah; Weyers, Sharon; Wood, Charles; Zobeck, Ted; 2017; Daily Weather; Weather Station; Greenhouse Gas Flux Measurement; Supporting Research Measurement; All Cell Comments; All locations; ; 1929-2015; Database ver. og=gn08222 Fort Collins, CO: USDA-ARS REAP Database. File downloaded 1/30/2017 12:08:20 PM. PID:d4fa2478b1b144f58333e8a433e838b9</s:Data>
        </s:Cell>
      </s:Row>
    </s:Table>

你可以使用gnumeric package的函数read.gnumeric.sheet

否则,使用 xml2 您可以执行以下操作:

readExcelXML <- function(filename, sheet) {
  doc <- read_xml(filename)
  ns <- xml_ns(doc)
  rows <- xml_find_all(doc, paste0(".//s:Worksheet[@s:Name='", sheet, "']/s:Table/s:Row"), ns = ns)
  values <- lapply(rows, . %>% xml_find_all(".//s:Cell/s:Data", ns = ns) %>% xml_text %>% unlist)

  columnNames <- values[[1]]

  dat <- do.call(rbind.data.frame, c(values[-1], stringsAsFactors = FALSE))
  names(dat) <- columnNames

  dat
}

要从 XML 中获取正确的列类型,您需要添加:

  # assign types from file (automatically),
  # NB: hard-coded 2nd row of the data to take types from
  types <- rows[[2]] %>% xml_find_all(".//s:Cell/s:Data") %>% xml_attrs %>% unlist %>% setNames(nm = names(dat))
  funcs <- c("Number" = as.numeric, "String" = as.character, "DateTime" = . %>% as.POSIXct(format = "%Y-%m-%dT%H:%M:%S."))

  for (iCol in names(dat)) {
    dat[[iCol]] <- funcs[[types[iCol]]](dat[[iCol]])
  }

到函数。