在 r 中将列分成多列时缺少数据

Missing data when separating column into multiple columns in r

我从 pdf 中抓取了一个 table,所有内容都进入了数据框的一个元素。我设法将所有内容分成单独的列,但 r 对列名感到困惑。第一列是“州”,应该包括所有的州名,但分开后是空白的。第二列是“州药品名册”,分离后错误地包含了州名称。它还缺少很多其他信息。任何可能的修复?

为简单起见,我将列重命名为“x”。

library(tabulizer)
library(pdftools)
library(rJava)
library(tidyverse)
url4 = "https://oppe.pharmacy.washington.edu/PracticumSite/forms/2019_Survey_of_Pharmacy_Law.pdf?-session=Students_Session%3A42F94F5D0a61a20754trv33D875D&fbclid=IwAR0qeK2tYmyI7T_8ict1Hnew9JxPkpt0bvajI3KL3IFDWg6JHNSSFWGlKY4"

out <- pdf_text(url4)
df=as.data.frame(out[[93]],header=F)
df = df %>%
  rename(x = `out[[93]]`) %>% 
    mutate(x=strsplit(x, "\n")) %>%
    unnest(x)
df=df[-c(1:2),]
df2=df %>% separate(x, c("State", "State Drug Formulary","Two-line Rx Format","Permissive or Mandatory*","How to Prevent Substitution","Cost Savings Pass-on","Patient Consent**"))

What the table should look like。如果您访问源代码,则原始文档的第 82 页。

我也试过这个,它保留了列名,但删除了数据

df3 = df %>% separate(x, sep = " ", into = c("State", "State Drug Formulary","Two-line Rx Format","Permissive or Mandatory*","How to Prevent Substitution","Cost Savings Pass-on","Patient Consent**"))

第 82 页包含 21. Drug Product Selection Laws 等其他内容

你最好删除它们,

dummy <- strsplit(df$`out[[93]]`, '\n\n')

此过程会将页面分成四个部分,table您要查找的是该列表的第二个对象。

df2 <- df %>%
  rename(x = `out[[93]]`) %>%
  mutate(x = stringr::str_split(x, '\n\n',simplify = T)[2]) %>%
  mutate(x = strsplit(x, '\n')) %>%
  unnest() %>%
  .[-c(1:3), ]

现在df2是table的内容。所以,用两个以上的空格分开,

df2 %>% separate(x, c("a","State", "State Drug Formulary","Two-line Rx Format","Permissive or Mandatory*","How to Prevent Substitution","Cost Savings Pass-on","Patient Consent**"), sep = "\s{2,}") %>%
  select(-a)

会给出结果。 'a' 是虚拟的,由 separate 产生的结果在前面有空白值。这是部分结果。

  State   `State Drug Fo…` `Two-line Rx F…` `Permissive or…` `How to Preven…`
   <chr>   <chr>            <chr>            <chr>            <chr>           
 1 Alabama None             Yes              P, BBB           A               
 2 Alaska  None             No               P                B               
 3 Arizona None             No               P                I               
 4 Arkans… None             No               P                B               
 5 Califo… None             No               P                EE              
 6 Colora… None             No               P                J               
 7 Connec… None             No               P                E, F            
 8 Delawa… None             No               P                E               
 9 Distri… Positive         No               P                B               
10 Florida Negative L       No               M                B   

df

的一行中完成
df %>%
  rename(x = `out[[93]]`) %>%
  mutate(x = stringr::str_split(x, '\n\n',simplify = T)[2]) %>%
  mutate(x = strsplit(x, '\n')) %>%
  unnest() %>%
  .[-c(1:3),] %>%
  separate(x, c("a","State", "State Drug Formulary","Two-line Rx Format","Permissive or Mandatory*","How to Prevent Substitution","Cost Savings Pass-on","Patient Consent**"), sep = "\s{2,}") %>%
  select(-a)

你可以试试这个

as.data.frame(pdf_text(url4)[[93]],header=F) %>%
  rename(x = `out[[93]]`) %>%
  mutate(x = stringr::str_split(x, '\n\n',simplify = T)[2]) %>%
  mutate(x = strsplit(x, '\n')) %>%
  unnest() %>%
  .[-c(1:3),] %>%
  separate(x, c("a","State", "State Drug Formulary","Two-line Rx Format","Permissive or Mandatory*","How to Prevent Substitution","Cost Savings Pass-on","Patient Consent**"), sep = "\s{2,}") %>%
  select(-a)