我如何摆脱我数据中的 ¦ 标志
how do i get rid of the ¦ sign in my data
我非常需要你的帮助。我从维基百科上抓取了一些数据,发现了这个 ¦ 标志。一开始我以为只是|但显然不是。
我的大部分细胞看起来像这样
table$Population
7004164110000000000¦16,411[7]
7007111260000000000¦11,126,000[13]
我正在尝试删除除 16,411 以外的所有内容,但首先我需要了解如何将其转换为其他内容。
感谢任何帮助,我快疯了,因为当我尝试 gsub 函数时它不起作用,然后 str_split_fixed 一个也不起作用...
dput(tables$Population)
给出
c("7007301655000000000¦30,165,500[6]", "7007241833000000000¦24,183,300[8]", "7007217070000000000¦21,707,000[10]", "7007150292310000000¦15,029,231[11]")
你需要用\
逃脱
test <- "7004164110000000000¦16,411"
gsub("\¦", "", test)
[1] "700416411000000000016,411"
编辑:是的,它也适用于列:
> gsub("\¦","",c("7007301655000000000¦30,165,500[6]", "7007241833000000000¦24,183,300[8]", "7007217070000000000¦21,707,000[10]", "7007150292310000000¦15,029,231[11]"))
[1] "700730165500000000030,165,500[6]" "700724183300000000024,183,300[8]"
[3] "700721707000000000021,707,000[10]" "700715029231000000015,029,231[11]"
EDIT2:按照@hrbrmstr 的建议替换字符,以下应该对您有用:
stringr::str_replace(c("7007301655000000000¦30,165,500[6]", "7007241833000000000¦24,183,300[8]", "7007217070000000000¦21,707,000[10]", "7007150292310000000¦15,029,231[11]"),
+ "[^[:ascii:]]+","")
[1] "700730165500000000030,165,500[6]" "700724183300000000024,183,300[8]"
[3] "700721707000000000021,707,000[10]" "700715029231000000015,029,231[11]"
这是将 table 解析为数据框的另一种方法:
library(rvest)
pg <- read_html("https://en.wikipedia.org/wiki/List_of_cities_proper_by_population")
html_node(pg, "table.wikitable") %>%
html_table() %>%
dplyr::tbl_df() %>%
janitor::clean_names() %>% # THE LINE BELOW DOES THE MAGIC YOU ORIGINALLY ASKE FOR BUT IN A DIFFERENT WAY
tidyr::separate(population, c("sortkey", "population"), sep="[^[:ascii:]]+") %>%
dplyr::mutate(
population = gsub("\[.*$", "", population)
) %>%
readr::type_convert()
## # A tibble: 87 x 9
## rank city image sortkey population definition totalarea_km populationdensi… country
## <int> <chr> <lgl> <dbl> <dbl> <chr> <chr> <dbl> <chr>
## 1 1 Chongqing NA 7.01e18 30165500. Municipality 700482403000… 366. China
## 2 2 Shanghai NA 7.01e18 24183300. Municipality 700363405000… 3814. China
## 3 3 Beijing NA 7.01e18 21707000. Municipality 700416411000… 1267. China
## 4 4 Istanbul NA 7.01e18 15029231. Metropolitan municipality 700262029000… 24231. Turkey
## 5 5 Karachi NA 7.01e18 14910352. City[14] 700337800000… 3944. Pakist…
## 6 6 Dhaka NA 7.01e18 14399000. City 700233754000… 42659. Bangla…
## 7 7 Guangzhou NA 7.01e18 13081000. City (sub-provincial) 700374340000… 1760. China
## 8 8 Shenzhen NA 7.01e18 12528300. City (sub-provincial) 700319920000… 6889. China
## 9 9 Mumbai NA 7.01e18 12442373. City[21] 700243771000… 28426. India
## 10 10 Moscow NA 7.01e18 13200000. Federal city[24][25] 2 511[26] 5256. Russia
## # ... with 77 more rows
table 对行使用以下基础标记:
"population" 单元格最终在 R 原始向量中看起来像这样(这是第一个,30
== a space 以提供视觉标记参考):
## [1] 37 30 30 37 33 30 31 36 35 35 30 30 30 30 30 30 30 30 30 e2 99 a0 33 30 2c 31 36 35 2c 35 30 30 5b 36 5d
这看起来更像是 unicode 嵌入。由于它是 "not ASCII",我们可以利用它来整理数据。
我非常需要你的帮助。我从维基百科上抓取了一些数据,发现了这个 ¦ 标志。一开始我以为只是|但显然不是。
我的大部分细胞看起来像这样
table$Population
7004164110000000000¦16,411[7]
7007111260000000000¦11,126,000[13]
我正在尝试删除除 16,411 以外的所有内容,但首先我需要了解如何将其转换为其他内容。
感谢任何帮助,我快疯了,因为当我尝试 gsub 函数时它不起作用,然后 str_split_fixed 一个也不起作用...
dput(tables$Population)
给出
c("7007301655000000000¦30,165,500[6]", "7007241833000000000¦24,183,300[8]", "7007217070000000000¦21,707,000[10]", "7007150292310000000¦15,029,231[11]")
你需要用\
test <- "7004164110000000000¦16,411"
gsub("\¦", "", test)
[1] "700416411000000000016,411"
编辑:是的,它也适用于列:
> gsub("\¦","",c("7007301655000000000¦30,165,500[6]", "7007241833000000000¦24,183,300[8]", "7007217070000000000¦21,707,000[10]", "7007150292310000000¦15,029,231[11]"))
[1] "700730165500000000030,165,500[6]" "700724183300000000024,183,300[8]"
[3] "700721707000000000021,707,000[10]" "700715029231000000015,029,231[11]"
EDIT2:按照@hrbrmstr 的建议替换字符,以下应该对您有用:
stringr::str_replace(c("7007301655000000000¦30,165,500[6]", "7007241833000000000¦24,183,300[8]", "7007217070000000000¦21,707,000[10]", "7007150292310000000¦15,029,231[11]"),
+ "[^[:ascii:]]+","")
[1] "700730165500000000030,165,500[6]" "700724183300000000024,183,300[8]"
[3] "700721707000000000021,707,000[10]" "700715029231000000015,029,231[11]"
这是将 table 解析为数据框的另一种方法:
library(rvest)
pg <- read_html("https://en.wikipedia.org/wiki/List_of_cities_proper_by_population")
html_node(pg, "table.wikitable") %>%
html_table() %>%
dplyr::tbl_df() %>%
janitor::clean_names() %>% # THE LINE BELOW DOES THE MAGIC YOU ORIGINALLY ASKE FOR BUT IN A DIFFERENT WAY
tidyr::separate(population, c("sortkey", "population"), sep="[^[:ascii:]]+") %>%
dplyr::mutate(
population = gsub("\[.*$", "", population)
) %>%
readr::type_convert()
## # A tibble: 87 x 9
## rank city image sortkey population definition totalarea_km populationdensi… country
## <int> <chr> <lgl> <dbl> <dbl> <chr> <chr> <dbl> <chr>
## 1 1 Chongqing NA 7.01e18 30165500. Municipality 700482403000… 366. China
## 2 2 Shanghai NA 7.01e18 24183300. Municipality 700363405000… 3814. China
## 3 3 Beijing NA 7.01e18 21707000. Municipality 700416411000… 1267. China
## 4 4 Istanbul NA 7.01e18 15029231. Metropolitan municipality 700262029000… 24231. Turkey
## 5 5 Karachi NA 7.01e18 14910352. City[14] 700337800000… 3944. Pakist…
## 6 6 Dhaka NA 7.01e18 14399000. City 700233754000… 42659. Bangla…
## 7 7 Guangzhou NA 7.01e18 13081000. City (sub-provincial) 700374340000… 1760. China
## 8 8 Shenzhen NA 7.01e18 12528300. City (sub-provincial) 700319920000… 6889. China
## 9 9 Mumbai NA 7.01e18 12442373. City[21] 700243771000… 28426. India
## 10 10 Moscow NA 7.01e18 13200000. Federal city[24][25] 2 511[26] 5256. Russia
## # ... with 77 more rows
table 对行使用以下基础标记:
"population" 单元格最终在 R 原始向量中看起来像这样(这是第一个,30
== a space 以提供视觉标记参考):
## [1] 37 30 30 37 33 30 31 36 35 35 30 30 30 30 30 30 30 30 30 e2 99 a0 33 30 2c 31 36 35 2c 35 30 30 5b 36 5d
这看起来更像是 unicode 嵌入。由于它是 "not ASCII",我们可以利用它来整理数据。