锅直方图有错误 "missing value where TRUE/FALSE needed"

Question

更新：

原来是类变量不同造成的。

非常感谢@r2evans，他在读取数据时通过将 interger64 转换为数字解决了这个问题。他的方法是有效的，但值得进一步研究的是他的解题逻辑。

出于保密原因，我删除了数据。

下面是上一题

我绘制了我数据中所有数字 clomuns 的直方图 table。

head(dt) %>%
  keep(is.numeric) %>% 
  gather() %>% na.omit() %>% 
  ggplot(aes(value)) +
  facet_wrap(~ key, scales = "free") +
  geom_histogram()

我选择 head() 是因为数据 table 太大了。

然后我遇到了这个错误：

Error in if (length(unique(intervals)) > 1 & any(diff(scale(intervals)) < : missing value where TRUE/FALSE needed

那我让

eg <- head(dt)
write.csv2(head(dt), "eg.csv")

并在 github.

上保存了例如 here

然后

eg <- fread("https://raw.githubusercontent.com/Deborah-Jia/Complete_Analysis_da2/main/eg.csv")

eg %>%
  keep(is.numeric) %>% 
  gather() %>% na.omit() %>% 
  ggplot(aes(value)) +
  facet_wrap(~ key, scales = "free") +
  geom_histogram()

我得到了正确的直方图！

当我保存数据并再次读取时发生了什么？或者有没有办法修复 dt？

PS: dt 也是通过保存 csv 并从 fread 读取创建的。当我使用

eg <- head(dt, 10000)

并保存在github，再读一遍。发生同样的错误。

是不是因为我的dt太长了（300万行），有些行错了？

Answer 1

问题症状是你的两个字段是出现不变。下载完整数据后 dt:

dt <- fread("https://raw.githubusercontent.com/Deborah-Jia/Complete_Analysis_da2/main/eg1.csv")
dt %>%
  keep(is.numeric) %>% 
  gather() %>%
  na.omit() %>%
  group_by(key) %>%
  summarize(v = var(value))
# Warning: attributes are not identical across measure variables;
# they will be dropped
# # A tibble: 9 x 2
#   key                         v
#   <chr>                   <dbl>
# 1 area_size_high        1.00e18
# 2 area_size_low         3.64e10
# 3 lot_size_high         8.76e17
# 4 lot_size_low          5.60e 5
# 5 price_huf_high        0.         ### problem!
# 6 price_huf_low         0.     
# 7 total_room_count_high 3.23e17
# 8 total_room_count_low  1.46e 0
# 9 V1                    8.33e 6

（当数据不变时，许多图往往会崩溃。）

不过，这令人困惑，因为 head(dt) 肯定显示不同的值（右侧）：

          V1         ds                            search_id property_type property_subtype price_huf_low price_huf_high
       <int>     <IDat>                               <char>        <char>           <char>         <i64>          <i64>
    1:     1 2021-02-15 ad2be212-0c25-4e3a-aabf-be089053beba         house             <NA>      45000000       69000000
    2:     2 2021-02-15 ab72ba19-d00f-49e2-8d0d-c6836f030758     apartment             <NA>             0       48000000
    3:     3 2021-02-06 24bbb050-2ecb-4078-a8dc-65e968f72f43     apartment             <NA>     150000000      200000000
    4:     4 2021-02-06 f7d87e6e-0f24-4d9e-ae82-2a448d6290bf     apartment             <NA>       2000000       29000000
    5:     5 2021-02-14 71ea3cc4-5326-4bbe-a2ff-20dbae0d9aa8     apartment             <NA>     200000000      400000000

（截断）。

然而，关键是看到 i64，注意这些是 64 位整数。

sapply(dt, function(z) class(z)[1])
#                    V1                    ds             search_id         property_type      property_subtype 
#             "integer"               "IDate"           "character"           "character"           "character" 
#         price_huf_low        price_huf_high         area_size_low        area_size_high          lot_size_low 
#           "integer64"           "integer64"             "integer"             "integer"             "integer" 
#         lot_size_high  total_room_count_low total_room_count_high              district 
#             "integer"             "integer"             "integer"           "character"

您可以通过以下两种方式之一解决此问题：

读进去就修复（推荐）：

dt <- fread("https://raw.githubusercontent.com/Deborah-Jia/Complete_Analysis_da2/main/eg1.csv",
            integer64 = "numeric")

用您环境中的数据修复它：

### data.table (since you used `fread`)
dt[, c("price_huf_low", "price_huf_high") := lapply(.SD, as.numeric),
   .SDcols = c("price_huf_low", "price_huf_high")]

### or dplyr
dt %>%
  mutate(across(starts_with("price"), as.numeric)) %>% # ... rest of your pipe
### if more than 'price_*' columns:
dt %>%
  mutate(across(where(~ inherits(., "integer64")), as.numeric)) %>% # ...

无论哪种方式，一旦将这两列转换为 numeric，就可以使用您的原始代码绘制它们：

dt %>%
  keep(is.numeric) %>% 
  gather() %>% na.omit() %>% 
  ggplot(aes(value)) +
  facet_wrap(~ key, scales = "free") +
  geom_histogram()

锅直方图有错误 "missing value where TRUE/FALSE needed"

pot histograms and had error "missing value where TRUE/FALSE needed"

r

histogram

下面是上一题