如何保留下载的数据文件中缺失的单元格?
How can I preserve missing cells in a downloaded data file?
我从 https://cdsarc.cds.unistra.fr/viz-bin/cat/J/MNRAS/495/1706#/browse, and I'm trying to clean it up using Python and R. The HTML file looks like this 得到了一个糟糕的数据集:
但是当我下载文件时,它包含额外的空格作为填充,以及缺少数据的地方。这意味着我无法使用 Python 的 .replace
方法将空格更改为 NA
。下载原始文件后,我使用以下脚本将空格替换为逗号:
with open("./emerlin_vla_subaru/subaru.dat", 'r') as f:
a=f.readlines()
with open("./emerlin_vla_subaru/subaru_fixed.dat" ,"w+") as f:
for i in range(len(a)):
c=a[i].split()
f.write(",".join(c))
f.write("\n")
但此方法会删除缺失的单元格并将数据向左移动以填充空白。我尝试使用 R,但它没有意识到数据中间有那些空白单元格。有谁知道我怎样才能清理数据,或者找到一个已经整理过的版本?
在 R 中,您可以 install.packages("rvest")
并使用
x <- (rvest::read_html("subaru.dat.gz") |> rvest::html_table())[[1L]]
将数据无损地放入数据帧中。您唯一需要的是 RAM,因为 R 是一种 RAM 密集型语言,而且您的 HTML 文件非常大。将数据读入内存大约需要 5 分钟。整个过程在我的笔记本电脑上使用了略高于 14 GiB 的内存。
输出应如下所示
> x
# A tibble: 376,380 x 28
`RAdeg DEdegdeg ~ `Bmag (e)mag` `Vmag (e)mag` `rmag (e)mag` `imag (e)mag` `zmag (e)mag` `ymag (e)mag` `[3.6] (e)mag`
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 157.7550584 67.~ 21.6219 ~ 24.0 ~ 23.3316 ~ 22.0066 ~ 21.115 ~ "" "19.87723732 ~
2 157.7448037 67.~ 25.0246 ~ 23.9475 ~ 22.9581 ~ 22.2232 ~ 22.003 ~ "" ""
3 157.7565074 67.~ 24.7904 ~ 25.4817 ~ 24.5512 ~ 23.5114 ~ 24.6046 ~ "" ""
4 157.7241218 67.~ 25.1506 ~ 24.0261 ~ 22.7778 ~ 21.7253 ~ 21.2324 ~ "23.87630082~ "20.19021606 ~
5 157.7430948 67.~ 24.0397 ~ 23.6024 ~ 22.9016 ~ 22.2357 ~ 22.0235 ~ "" ""
6 157.7508459 67.~ 25.3215 ~ 25.3467 ~ 24.385 ~ 24.7648 ~ 24.5042 ~ "" ""
7 157.728751 67.~ 23.7913 ~ 23.5786 ~ 22.4722 ~ 22.05 ~ 21.7701 ~ "" ""
8 157.7336379 67.~ 25.5835 ~ 23.5972 ~ 22.0607 ~ 20.7663 ~ 20.3327 ~ "" "19.24077034 ~
9 157.7610664 67.~ 25.2398 ~ 24.6624 ~ 24.3885 ~ 24.1111 ~ 23.2762 ~ "" ""
10 157.7563166 67.~ 23.1946 ~ 28.0006 ~ 32.30285645 ~ 24.701 ~ 23.2054 ~ "" ""
# ... with 376,370 more rows, and 20 more variables: [4.5] (e)mag <chr>, Id--- <dbl>, za--- <dbl>, chiza--- <dbl>,
# (e) (E) <chr>, (e) (E) <chr>, Nfilt--- <dbl>, e1--- <dbl>, e2--- <dbl>, Radpix <dbl>, RadRatio--- <dbl>,
# BulgeA--- <dbl>, DiscA--- <dbl>, BulgeIndex--- <dbl>, DiscIndex--- <dbl>, BulgeFlux--- <dbl>, DiscFlux--- <dbl>,
# FluxRatio--- <dbl>, snr--- <dbl>, SourceId--- <chr>
性能测量
> system.time(x <- (rvest::read_html("subaru.dat.gz") |> rvest::html_table())[[1L]])
user system elapsed
288.75 2.72 291.62
我从 https://cdsarc.cds.unistra.fr/viz-bin/cat/J/MNRAS/495/1706#/browse, and I'm trying to clean it up using Python and R. The HTML file looks like this 得到了一个糟糕的数据集:
.replace
方法将空格更改为 NA
。下载原始文件后,我使用以下脚本将空格替换为逗号:
with open("./emerlin_vla_subaru/subaru.dat", 'r') as f:
a=f.readlines()
with open("./emerlin_vla_subaru/subaru_fixed.dat" ,"w+") as f:
for i in range(len(a)):
c=a[i].split()
f.write(",".join(c))
f.write("\n")
但此方法会删除缺失的单元格并将数据向左移动以填充空白。我尝试使用 R,但它没有意识到数据中间有那些空白单元格。有谁知道我怎样才能清理数据,或者找到一个已经整理过的版本?
在 R 中,您可以 install.packages("rvest")
并使用
x <- (rvest::read_html("subaru.dat.gz") |> rvest::html_table())[[1L]]
将数据无损地放入数据帧中。您唯一需要的是 RAM,因为 R 是一种 RAM 密集型语言,而且您的 HTML 文件非常大。将数据读入内存大约需要 5 分钟。整个过程在我的笔记本电脑上使用了略高于 14 GiB 的内存。
输出应如下所示
> x
# A tibble: 376,380 x 28
`RAdeg DEdegdeg ~ `Bmag (e)mag` `Vmag (e)mag` `rmag (e)mag` `imag (e)mag` `zmag (e)mag` `ymag (e)mag` `[3.6] (e)mag`
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 157.7550584 67.~ 21.6219 ~ 24.0 ~ 23.3316 ~ 22.0066 ~ 21.115 ~ "" "19.87723732 ~
2 157.7448037 67.~ 25.0246 ~ 23.9475 ~ 22.9581 ~ 22.2232 ~ 22.003 ~ "" ""
3 157.7565074 67.~ 24.7904 ~ 25.4817 ~ 24.5512 ~ 23.5114 ~ 24.6046 ~ "" ""
4 157.7241218 67.~ 25.1506 ~ 24.0261 ~ 22.7778 ~ 21.7253 ~ 21.2324 ~ "23.87630082~ "20.19021606 ~
5 157.7430948 67.~ 24.0397 ~ 23.6024 ~ 22.9016 ~ 22.2357 ~ 22.0235 ~ "" ""
6 157.7508459 67.~ 25.3215 ~ 25.3467 ~ 24.385 ~ 24.7648 ~ 24.5042 ~ "" ""
7 157.728751 67.~ 23.7913 ~ 23.5786 ~ 22.4722 ~ 22.05 ~ 21.7701 ~ "" ""
8 157.7336379 67.~ 25.5835 ~ 23.5972 ~ 22.0607 ~ 20.7663 ~ 20.3327 ~ "" "19.24077034 ~
9 157.7610664 67.~ 25.2398 ~ 24.6624 ~ 24.3885 ~ 24.1111 ~ 23.2762 ~ "" ""
10 157.7563166 67.~ 23.1946 ~ 28.0006 ~ 32.30285645 ~ 24.701 ~ 23.2054 ~ "" ""
# ... with 376,370 more rows, and 20 more variables: [4.5] (e)mag <chr>, Id--- <dbl>, za--- <dbl>, chiza--- <dbl>,
# (e) (E) <chr>, (e) (E) <chr>, Nfilt--- <dbl>, e1--- <dbl>, e2--- <dbl>, Radpix <dbl>, RadRatio--- <dbl>,
# BulgeA--- <dbl>, DiscA--- <dbl>, BulgeIndex--- <dbl>, DiscIndex--- <dbl>, BulgeFlux--- <dbl>, DiscFlux--- <dbl>,
# FluxRatio--- <dbl>, snr--- <dbl>, SourceId--- <chr>
性能测量
> system.time(x <- (rvest::read_html("subaru.dat.gz") |> rvest::html_table())[[1L]])
user system elapsed
288.75 2.72 291.62