使用 readxlsb 和 cellranger::cell_limits() 获取所有小数位

Question

我正在使用 readxlsb and cell_limits(), from cellranger 从一系列 Excel 二进制工作簿 (.xlsb) 中导入一些杂乱数据。我正在努力获得足够的（所有）小数位。

这可以用 readxlsb 包提供的数据集来说明。在示例数据中，TestBook.xlsb，在 sheet Sheet3.1.1 中，单元格 E5。此单元格包含 e^1，具有一系列基本小数位 (2,71828182845905)，但仅导入六位小数 (2.718282)。

在我现实生活中的数据中，我在很多顶行中都有文本，这些文本将数据转换为章程，例如下面的 column.4，E5 所在的位置，原始数据为 ~16小数位。有没有一种方法可以调整代码（下面）以获取所有小数位而不会丢失 cellranger::cell_limits()?

# install.packages(c("readxlsb", "tidyverse"), dependencies = TRUE) 
library(readxlsb); library(tidyverse)

as_tibble(
  read_xlsb(path = system.file("extdata", "TestBook.xlsb", package = "readxlsb"), 
            sheet = "Sheet3.1.1",range = cellranger::cell_limits())
)

# A tibble: 5 x 7
  Some       column.2 column.3 column.4   column.5 column.6 column.7
  <date>     <chr>    <chr>    <chr>      <chr>    <chr>       <dbl>
1 NA         "data"   ""       "2.718282" ""       ""           3.14
2 NA         ""       "in"     ""         ""       ""          NA   
3 2021-05-21 ""       ""       "a"        ""       ""          NA   
4 NA         ""       ""       ""         "third"  ""       43972   
5 NA         ""       ""       ""         ""       "sheet"     NA

Answer 1

一个简单的解决方案可能是在导入时强制列类型加倍，即 col_types = c("double")。

首先调整小标题中显示的数字，

options(pillar.sigfig = 20)

现在，您将获得包含 Excel.

中所有数字的单元格 E5

library(tidyverse); library(readxlsb)
as_tibble(
  read_xlsb(path = system.file("extdata", "TestBook.xlsb", package = "readxlsb"), 
            sheet = "Sheet3.1.1",col_types = c("double"), 
            range = cellranger::cell_limits())
           )
# A tibble: 5 x 7
   Some column.2 column.3             column.4 column.5 column.6             column.7
  <dbl>    <dbl>    <dbl>                <dbl>    <dbl>    <dbl>                <dbl>
1    NA       NA       NA  2.718281828459045e0       NA       NA  3.141592653589793e0
2    NA       NA       NA NA                         NA       NA NA                  
3 44337       NA       NA NA                         NA       NA NA                  
4    NA       NA       NA NA                         NA       NA  4.3972           e4
5    NA       NA       NA NA

根据read_xlsb的vignette默认设置为猜测的col类型，从底层数据根据：

When implying types from the underlying spreadsheet data, the resultant type is the regarded as the ‘least fragile’.

Effectively the order is logical – datetime – integer – double – string
If 99 rows are of type ‘integer’ and 1 row is of type ‘double’, then all cells are regarded as ‘double’ in that column.
If 99 rows are of type ‘date’ and 1 row is of type ‘string’, then all cells are promoted to ‘string’

可能的调整（绕过自动猜测的列类型），cave：将所有内容都视为字符

library(tidyverse)
library(readxlsb)

# read everything as character
as_tibble(
  read_xlsb(path = system.file("extdata", "TestBook.xlsb", package = "readxlsb"), 
            sheet = "Sheet3.1.1",col_types = c("character"), cellranger::cell_limits())
) ->test.char


# read everything as double
as_tibble(
  read_xlsb(path = system.file("extdata", "TestBook.xlsb", package = "readxlsb"), 
            sheet = "Sheet3.1.1",col_types = c("double"), cellranger::cell_limits())
) ->test.dbl

# make a function that checks if a string is a date
is.date <- function(x) inherits(x, 'Date')

# combine character and double, has to be adjusted according to your real data
cbind(test.char %>%
        gather(key.character,value=character),
      test.dbl %>%
        gather(key=key.numeric,value=numeric)) %>% 
  tibble() %>% 
  rowwise() %>% 
  mutate(numeric=case_when(is.date(try(as.Date(character),silent=TRUE))==TRUE ~ NA_real_, TRUE ~ numeric)) %>% #set double to NA if character is date
  mutate(character=case_when(!is.na(numeric)~as.character(numeric), TRUE ~ character)) %>% #keep all remaining double
  select(key.character,character) %>%
  pivot_wider(names_from = key.character, values_from = character) %>%
  unnest(cols = c(Some, column.2, column.3, column.4, column.5, column.6, column.7))
#> # A tibble: 5 x 7
#>   Some       column.2 column.3 column.4        column.5 column.6 column.7       
#>   <chr>      <chr>    <chr>    <chr>           <chr>    <chr>    <chr>          
#> 1 ""         "data"   ""       "2.71828182845~ ""       ""       "3.14159265358~
#> 2 ""         ""       "in"     ""              ""       ""       ""             
#> 3 "2019-08-~ ""       ""       "a"             ""       ""       ""             
#> 4 ""         ""       ""       ""              "third"  ""       "2018-08-25"   
#> 5 ""         ""       ""       ""              ""       "sheet"  ""

^{由 reprex package (v2.0.0)}

于 2021-05-24 创建

使用 readxlsb 和 cellranger::cell_limits() 获取所有小数位

get all decimal places using readxlsb and cellranger::cell_limits()

import

excel

r

tidyverse