如何使用 r 将一些向量元素合并到同一个向量中？

Question

我使用 r 中的 pdftools 从 pdf 中提取了 table。 PDF 中的 table 列有 multi-line 文本。我用“|”替换了超过2个空格的空格这样就更容易了。但我运行遇到的问题是，由于 multi-line 和 table 在 PDF 中的格式化方式，数据出现乱序。原来是这样的

我提取的数据如下所示：

    scale_definitions <- c("", "                                        to lack passion                        easily annoyed", 
"      Excitable", "                                        to lack a sense of urgency             emotionally volatile", 
"", "                                        naive                                  mistrustful", 
"      Skeptical", "                                        gullible                               cynical", 
"", "                                        overly confident                       too conservative", 
"      Cautious", "                                        to make risky decisions                risk averse", 
"", "                                        to avoid conflict                      aloof and remote", 
"      Reserved", "                                        too sensitive                          indifferent to others' feelings", 
"", "                                        unengaged                              uncooperative", 
"      Leisurely", "                                        self-absorbed                          stubborn", 
"", "                                        unduly modest                          arrogant", 
"      Bold", "                                        self-doubting                          entitled and self-promoting", 
"", "                                        over controlled                        charming and fun", 
"      Mischievous", "                                        inflexible                             careless about commitments", 
"", "                                        repressed                              dramatic", 
"      Colorful", "                                        apathetic                              noisy", 
"", "                                        too tactical                           impractical", 
"      Imaginative", "                                        to lack vision                         eccentric", 
"", "                                        careless about details                 perfectionistic", 
"      Diligent", "                                        easily distracted                      micromanaging", 
"", "                                        possibly insubordinate                 respectful and deferential", 
"      Dutiful", "                                        too independent                        eager to please"
)

scale_definitions <-  scale_definitions %>% str_replace_all("\s{2,}", "|")

如何最好地将其放入数据框中？

Answer 1

不幸的是，reprex 会很复杂，所以这里描述了如何实现结构化 df：

恐怕你必须使用 pdftools::pdf_data() 而不是 pdftools::pdf_text()。

这样您就可以为列表中的每个页面获取一个 df。在这些 dfs 中，您会得到页面上每个单词的一行和确切位置（加上扩展名 IRCC）。有了这个，您就可以编写一个解析器来完成您的任务……这将需要一些工作，但这是我所知道的解决此类问题的唯一方法。

更新：

我找到了一个 readr 函数，它对您的情况有帮助，因为我们可以假设列位置的长度固定 (nchar())：

library(tidyverse)

scale_definitions %>%
    # parse into columns by lenght and there for implicitely start position
    readr::read_fwf(fwf_widths(c(39, 40, 40), c("col1", "col2", "col3"))) %>%
    # build group ID from row number
    dplyr::mutate(grp = (dplyr::row_number() - 1) %/% 3) %>%
    # firm groupings
    dplyr::group_by(grp) %>%
    # impute missing value in col 1
    tidyr::fill(col1, .direction = "downup") %>%
    # remove groupings to prevent unwanted behaviour down stream
    dplyr::ungroup() %>%
    # remove auxiliary variable
    dplyr::select(-grp) %>%
    # convert to long format (saver to remove NAs)
    tidyr::pivot_longer(-col1, names_to = "cols", values_to = "vals") %>%
    # remove NAs
    dplyr::filter(!is.na(vals))

# A tibble: 44 x 3
   col1      cols  vals
   <chr>     <chr> <chr>
 1 Excitable col2  to lack passion
 2 Excitable col3  easily annoyed
 3 Excitable col2  to lack a sense of urgency
 4 Excitable col3  emotionally volatile
 5 Skeptical col2  naive
 6 Skeptical col3  mistrustful
 7 Skeptical col2  gullible
 8 Skeptical col3  cynical
 9 Cautious  col2  overly confident
10 Cautious  col3  too conservative
# ... with 34 more rows

如何使用 r 将一些向量元素合并到同一个向量中？

How do I combine some vector elements in the same vector using r?

r

pdftools

更新：