在 r 中将列分成多列时缺少数据
Missing data when separating column into multiple columns in r
我从 pdf 中抓取了一个 table,所有内容都进入了数据框的一个元素。我设法将所有内容分成单独的列,但 r 对列名感到困惑。第一列是“州”,应该包括所有的州名,但分开后是空白的。第二列是“州药品名册”,分离后错误地包含了州名称。它还缺少很多其他信息。任何可能的修复?
为简单起见,我将列重命名为“x”。
library(tabulizer)
library(pdftools)
library(rJava)
library(tidyverse)
url4 = "https://oppe.pharmacy.washington.edu/PracticumSite/forms/2019_Survey_of_Pharmacy_Law.pdf?-session=Students_Session%3A42F94F5D0a61a20754trv33D875D&fbclid=IwAR0qeK2tYmyI7T_8ict1Hnew9JxPkpt0bvajI3KL3IFDWg6JHNSSFWGlKY4"
out <- pdf_text(url4)
df=as.data.frame(out[[93]],header=F)
df = df %>%
rename(x = `out[[93]]`) %>%
mutate(x=strsplit(x, "\n")) %>%
unnest(x)
df=df[-c(1:2),]
df2=df %>% separate(x, c("State", "State Drug Formulary","Two-line Rx Format","Permissive or Mandatory*","How to Prevent Substitution","Cost Savings Pass-on","Patient Consent**"))
What the table should look like。如果您访问源代码,则原始文档的第 82 页。
我也试过这个,它保留了列名,但删除了数据
df3 = df %>% separate(x, sep = " ", into = c("State", "State Drug Formulary","Two-line Rx Format","Permissive or Mandatory*","How to Prevent Substitution","Cost Savings Pass-on","Patient Consent**"))
第 82 页包含 21. Drug Product Selection Laws
等其他内容
你最好删除它们,
dummy <- strsplit(df$`out[[93]]`, '\n\n')
此过程会将页面分成四个部分,table您要查找的是该列表的第二个对象。
df2 <- df %>%
rename(x = `out[[93]]`) %>%
mutate(x = stringr::str_split(x, '\n\n',simplify = T)[2]) %>%
mutate(x = strsplit(x, '\n')) %>%
unnest() %>%
.[-c(1:3), ]
现在df2
是table的内容。所以,用两个以上的空格分开,
df2 %>% separate(x, c("a","State", "State Drug Formulary","Two-line Rx Format","Permissive or Mandatory*","How to Prevent Substitution","Cost Savings Pass-on","Patient Consent**"), sep = "\s{2,}") %>%
select(-a)
会给出结果。 'a' 是虚拟的,由 separate
产生的结果在前面有空白值。这是部分结果。
State `State Drug Fo…` `Two-line Rx F…` `Permissive or…` `How to Preven…`
<chr> <chr> <chr> <chr> <chr>
1 Alabama None Yes P, BBB A
2 Alaska None No P B
3 Arizona None No P I
4 Arkans… None No P B
5 Califo… None No P EE
6 Colora… None No P J
7 Connec… None No P E, F
8 Delawa… None No P E
9 Distri… Positive No P B
10 Florida Negative L No M B
在 df
的一行中完成
df %>%
rename(x = `out[[93]]`) %>%
mutate(x = stringr::str_split(x, '\n\n',simplify = T)[2]) %>%
mutate(x = strsplit(x, '\n')) %>%
unnest() %>%
.[-c(1:3),] %>%
separate(x, c("a","State", "State Drug Formulary","Two-line Rx Format","Permissive or Mandatory*","How to Prevent Substitution","Cost Savings Pass-on","Patient Consent**"), sep = "\s{2,}") %>%
select(-a)
你可以试试这个
as.data.frame(pdf_text(url4)[[93]],header=F) %>%
rename(x = `out[[93]]`) %>%
mutate(x = stringr::str_split(x, '\n\n',simplify = T)[2]) %>%
mutate(x = strsplit(x, '\n')) %>%
unnest() %>%
.[-c(1:3),] %>%
separate(x, c("a","State", "State Drug Formulary","Two-line Rx Format","Permissive or Mandatory*","How to Prevent Substitution","Cost Savings Pass-on","Patient Consent**"), sep = "\s{2,}") %>%
select(-a)
我从 pdf 中抓取了一个 table,所有内容都进入了数据框的一个元素。我设法将所有内容分成单独的列,但 r 对列名感到困惑。第一列是“州”,应该包括所有的州名,但分开后是空白的。第二列是“州药品名册”,分离后错误地包含了州名称。它还缺少很多其他信息。任何可能的修复?
为简单起见,我将列重命名为“x”。
library(tabulizer)
library(pdftools)
library(rJava)
library(tidyverse)
url4 = "https://oppe.pharmacy.washington.edu/PracticumSite/forms/2019_Survey_of_Pharmacy_Law.pdf?-session=Students_Session%3A42F94F5D0a61a20754trv33D875D&fbclid=IwAR0qeK2tYmyI7T_8ict1Hnew9JxPkpt0bvajI3KL3IFDWg6JHNSSFWGlKY4"
out <- pdf_text(url4)
df=as.data.frame(out[[93]],header=F)
df = df %>%
rename(x = `out[[93]]`) %>%
mutate(x=strsplit(x, "\n")) %>%
unnest(x)
df=df[-c(1:2),]
df2=df %>% separate(x, c("State", "State Drug Formulary","Two-line Rx Format","Permissive or Mandatory*","How to Prevent Substitution","Cost Savings Pass-on","Patient Consent**"))
What the table should look like。如果您访问源代码,则原始文档的第 82 页。
我也试过这个,它保留了列名,但删除了数据
df3 = df %>% separate(x, sep = " ", into = c("State", "State Drug Formulary","Two-line Rx Format","Permissive or Mandatory*","How to Prevent Substitution","Cost Savings Pass-on","Patient Consent**"))
第 82 页包含 21. Drug Product Selection Laws
等其他内容
你最好删除它们,
dummy <- strsplit(df$`out[[93]]`, '\n\n')
此过程会将页面分成四个部分,table您要查找的是该列表的第二个对象。
df2 <- df %>%
rename(x = `out[[93]]`) %>%
mutate(x = stringr::str_split(x, '\n\n',simplify = T)[2]) %>%
mutate(x = strsplit(x, '\n')) %>%
unnest() %>%
.[-c(1:3), ]
现在df2
是table的内容。所以,用两个以上的空格分开,
df2 %>% separate(x, c("a","State", "State Drug Formulary","Two-line Rx Format","Permissive or Mandatory*","How to Prevent Substitution","Cost Savings Pass-on","Patient Consent**"), sep = "\s{2,}") %>%
select(-a)
会给出结果。 'a' 是虚拟的,由 separate
产生的结果在前面有空白值。这是部分结果。
State `State Drug Fo…` `Two-line Rx F…` `Permissive or…` `How to Preven…`
<chr> <chr> <chr> <chr> <chr>
1 Alabama None Yes P, BBB A
2 Alaska None No P B
3 Arizona None No P I
4 Arkans… None No P B
5 Califo… None No P EE
6 Colora… None No P J
7 Connec… None No P E, F
8 Delawa… None No P E
9 Distri… Positive No P B
10 Florida Negative L No M B
在 df
的一行中完成
df %>%
rename(x = `out[[93]]`) %>%
mutate(x = stringr::str_split(x, '\n\n',simplify = T)[2]) %>%
mutate(x = strsplit(x, '\n')) %>%
unnest() %>%
.[-c(1:3),] %>%
separate(x, c("a","State", "State Drug Formulary","Two-line Rx Format","Permissive or Mandatory*","How to Prevent Substitution","Cost Savings Pass-on","Patient Consent**"), sep = "\s{2,}") %>%
select(-a)
你可以试试这个
as.data.frame(pdf_text(url4)[[93]],header=F) %>%
rename(x = `out[[93]]`) %>%
mutate(x = stringr::str_split(x, '\n\n',simplify = T)[2]) %>%
mutate(x = strsplit(x, '\n')) %>%
unnest() %>%
.[-c(1:3),] %>%
separate(x, c("a","State", "State Drug Formulary","Two-line Rx Format","Permissive or Mandatory*","How to Prevent Substitution","Cost Savings Pass-on","Patient Consent**"), sep = "\s{2,}") %>%
select(-a)