如何使用 r 将一些向量元素合并到同一个向量中?
How do I combine some vector elements in the same vector using r?
我使用 r 中的 pdftools 从 pdf 中提取了 table。 PDF 中的 table 列有 multi-line 文本。我用“|”替换了超过2个空格的空格这样就更容易了。但我 运行 遇到的问题是,由于 multi-line 和 table 在 PDF 中的格式化方式,数据出现乱序。原来是这样的
我提取的数据如下所示:
scale_definitions <- c("", " to lack passion easily annoyed",
" Excitable", " to lack a sense of urgency emotionally volatile",
"", " naive mistrustful",
" Skeptical", " gullible cynical",
"", " overly confident too conservative",
" Cautious", " to make risky decisions risk averse",
"", " to avoid conflict aloof and remote",
" Reserved", " too sensitive indifferent to others' feelings",
"", " unengaged uncooperative",
" Leisurely", " self-absorbed stubborn",
"", " unduly modest arrogant",
" Bold", " self-doubting entitled and self-promoting",
"", " over controlled charming and fun",
" Mischievous", " inflexible careless about commitments",
"", " repressed dramatic",
" Colorful", " apathetic noisy",
"", " too tactical impractical",
" Imaginative", " to lack vision eccentric",
"", " careless about details perfectionistic",
" Diligent", " easily distracted micromanaging",
"", " possibly insubordinate respectful and deferential",
" Dutiful", " too independent eager to please"
)
scale_definitions <- scale_definitions %>% str_replace_all("\s{2,}", "|")
如何最好地将其放入数据框中?
不幸的是,reprex 会很复杂,所以这里描述了如何实现结构化 df:
恐怕你必须使用 pdftools::pdf_data()
而不是 pdftools::pdf_text()
。
这样您就可以为列表中的每个页面获取一个 df。在这些 dfs 中,您会得到页面上每个单词的一行和确切位置(加上扩展名 IRCC)。有了这个,您就可以编写一个解析器来完成您的任务……这将需要一些工作,但这是我所知道的解决此类问题的唯一方法。
更新:
我找到了一个 readr
函数,它对您的情况有帮助,因为我们可以假设列位置的长度固定 (nchar()
):
library(tidyverse)
scale_definitions %>%
# parse into columns by lenght and there for implicitely start position
readr::read_fwf(fwf_widths(c(39, 40, 40), c("col1", "col2", "col3"))) %>%
# build group ID from row number
dplyr::mutate(grp = (dplyr::row_number() - 1) %/% 3) %>%
# firm groupings
dplyr::group_by(grp) %>%
# impute missing value in col 1
tidyr::fill(col1, .direction = "downup") %>%
# remove groupings to prevent unwanted behaviour down stream
dplyr::ungroup() %>%
# remove auxiliary variable
dplyr::select(-grp) %>%
# convert to long format (saver to remove NAs)
tidyr::pivot_longer(-col1, names_to = "cols", values_to = "vals") %>%
# remove NAs
dplyr::filter(!is.na(vals))
# A tibble: 44 x 3
col1 cols vals
<chr> <chr> <chr>
1 Excitable col2 to lack passion
2 Excitable col3 easily annoyed
3 Excitable col2 to lack a sense of urgency
4 Excitable col3 emotionally volatile
5 Skeptical col2 naive
6 Skeptical col3 mistrustful
7 Skeptical col2 gullible
8 Skeptical col3 cynical
9 Cautious col2 overly confident
10 Cautious col3 too conservative
# ... with 34 more rows
我使用 r 中的 pdftools 从 pdf 中提取了 table。 PDF 中的 table 列有 multi-line 文本。我用“|”替换了超过2个空格的空格这样就更容易了。但我 运行 遇到的问题是,由于 multi-line 和 table 在 PDF 中的格式化方式,数据出现乱序。原来是这样的
我提取的数据如下所示:
scale_definitions <- c("", " to lack passion easily annoyed",
" Excitable", " to lack a sense of urgency emotionally volatile",
"", " naive mistrustful",
" Skeptical", " gullible cynical",
"", " overly confident too conservative",
" Cautious", " to make risky decisions risk averse",
"", " to avoid conflict aloof and remote",
" Reserved", " too sensitive indifferent to others' feelings",
"", " unengaged uncooperative",
" Leisurely", " self-absorbed stubborn",
"", " unduly modest arrogant",
" Bold", " self-doubting entitled and self-promoting",
"", " over controlled charming and fun",
" Mischievous", " inflexible careless about commitments",
"", " repressed dramatic",
" Colorful", " apathetic noisy",
"", " too tactical impractical",
" Imaginative", " to lack vision eccentric",
"", " careless about details perfectionistic",
" Diligent", " easily distracted micromanaging",
"", " possibly insubordinate respectful and deferential",
" Dutiful", " too independent eager to please"
)
scale_definitions <- scale_definitions %>% str_replace_all("\s{2,}", "|")
如何最好地将其放入数据框中?
不幸的是,reprex 会很复杂,所以这里描述了如何实现结构化 df:
恐怕你必须使用 pdftools::pdf_data()
而不是 pdftools::pdf_text()
。
这样您就可以为列表中的每个页面获取一个 df。在这些 dfs 中,您会得到页面上每个单词的一行和确切位置(加上扩展名 IRCC)。有了这个,您就可以编写一个解析器来完成您的任务……这将需要一些工作,但这是我所知道的解决此类问题的唯一方法。
更新:
我找到了一个 readr
函数,它对您的情况有帮助,因为我们可以假设列位置的长度固定 (nchar()
):
library(tidyverse)
scale_definitions %>%
# parse into columns by lenght and there for implicitely start position
readr::read_fwf(fwf_widths(c(39, 40, 40), c("col1", "col2", "col3"))) %>%
# build group ID from row number
dplyr::mutate(grp = (dplyr::row_number() - 1) %/% 3) %>%
# firm groupings
dplyr::group_by(grp) %>%
# impute missing value in col 1
tidyr::fill(col1, .direction = "downup") %>%
# remove groupings to prevent unwanted behaviour down stream
dplyr::ungroup() %>%
# remove auxiliary variable
dplyr::select(-grp) %>%
# convert to long format (saver to remove NAs)
tidyr::pivot_longer(-col1, names_to = "cols", values_to = "vals") %>%
# remove NAs
dplyr::filter(!is.na(vals))
# A tibble: 44 x 3
col1 cols vals
<chr> <chr> <chr>
1 Excitable col2 to lack passion
2 Excitable col3 easily annoyed
3 Excitable col2 to lack a sense of urgency
4 Excitable col3 emotionally volatile
5 Skeptical col2 naive
6 Skeptical col3 mistrustful
7 Skeptical col2 gullible
8 Skeptical col3 cynical
9 Cautious col2 overly confident
10 Cautious col3 too conservative
# ... with 34 more rows