如何保留下载的数据文件中缺失的单元格？

Question

我从 https://cdsarc.cds.unistra.fr/viz-bin/cat/J/MNRAS/495/1706#/browse, and I'm trying to clean it up using Python and R. The HTML file looks like this 得到了一个糟糕的数据集：但是当我下载文件时，它包含额外的空格作为填充，以及缺少数据的地方。这意味着我无法使用 Python 的 .replace 方法将空格更改为 NA。下载原始文件后，我使用以下脚本将空格替换为逗号：

with open("./emerlin_vla_subaru/subaru.dat", 'r') as f:
    a=f.readlines()

with open("./emerlin_vla_subaru/subaru_fixed.dat" ,"w+") as f:
    for i in range(len(a)):
        c=a[i].split()
        f.write(",".join(c))
        f.write("\n")

但此方法会删除缺失的单元格并将数据向左移动以填充空白。我尝试使用 R，但它没有意识到数据中间有那些空白单元格。有谁知道我怎样才能清理数据，或者找到一个已经整理过的版本？

Answer 1

在 R 中，您可以 install.packages("rvest") 并使用

x <- (rvest::read_html("subaru.dat.gz") |> rvest::html_table())[[1L]]

将数据无损地放入数据帧中。您唯一需要的是 RAM，因为 R 是一种 RAM 密集型语言，而且您的 HTML 文件非常大。将数据读入内存大约需要 5 分钟。整个过程在我的笔记本电脑上使用了略高于 14 GiB 的内存。

输出应如下所示

> x
# A tibble: 376,380 x 28
   `RAdeg DEdegdeg ~ `Bmag (e)mag`  `Vmag (e)mag`  `rmag (e)mag`  `imag (e)mag` `zmag (e)mag` `ymag (e)mag` `[3.6] (e)mag`
   <chr>             <chr>          <chr>          <chr>          <chr>         <chr>         <chr>         <chr>         
 1 157.7550584  67.~ 21.6219      ~ 24.0         ~ 23.3316      ~ 22.0066     ~ 21.115      ~ ""            "19.87723732 ~
 2 157.7448037  67.~ 25.0246      ~ 23.9475      ~ 22.9581      ~ 22.2232     ~ 22.003      ~ ""            ""            
 3 157.7565074  67.~ 24.7904      ~ 25.4817      ~ 24.5512      ~ 23.5114     ~ 24.6046     ~ ""            ""            
 4 157.7241218  67.~ 25.1506      ~ 24.0261      ~ 22.7778      ~ 21.7253     ~ 21.2324     ~ "23.87630082~ "20.19021606 ~
 5 157.7430948  67.~ 24.0397      ~ 23.6024      ~ 22.9016      ~ 22.2357     ~ 22.0235     ~ ""            ""            
 6 157.7508459  67.~ 25.3215      ~ 25.3467      ~ 24.385       ~ 24.7648     ~ 24.5042     ~ ""            ""            
 7 157.728751   67.~ 23.7913      ~ 23.5786      ~ 22.4722      ~ 22.05       ~ 21.7701     ~ ""            ""            
 8 157.7336379  67.~ 25.5835      ~ 23.5972      ~ 22.0607      ~ 20.7663     ~ 20.3327     ~ ""            "19.24077034 ~
 9 157.7610664  67.~ 25.2398      ~ 24.6624      ~ 24.3885      ~ 24.1111     ~ 23.2762     ~ ""            ""            
10 157.7563166  67.~ 23.1946      ~ 28.0006      ~ 32.30285645  ~ 24.701      ~ 23.2054     ~ ""            ""            
# ... with 376,370 more rows, and 20 more variables: [4.5] (e)mag <chr>, Id--- <dbl>, za--- <dbl>, chiza--- <dbl>,
#   (e) (E) <chr>, (e) (E) <chr>, Nfilt--- <dbl>, e1--- <dbl>, e2--- <dbl>, Radpix <dbl>, RadRatio--- <dbl>,
#   BulgeA--- <dbl>, DiscA--- <dbl>, BulgeIndex--- <dbl>, DiscIndex--- <dbl>, BulgeFlux--- <dbl>, DiscFlux--- <dbl>,
#   FluxRatio--- <dbl>, snr--- <dbl>, SourceId--- <chr>

性能测量

> system.time(x <- (rvest::read_html("subaru.dat.gz") |> rvest::html_table())[[1L]])
   user  system elapsed 
 288.75    2.72  291.62

如何保留下载的数据文件中缺失的单元格？

How can I preserve missing cells in a downloaded data file?

python

r

data-cleaning