在 R 中操作 .txt 文件中的数据
Manipulating data from .txt files within in R
问题介绍
您好,
我正在为我的实验室制定数据计划,该实验室将从 1 月开始进行盲法临床试验。此任务的一部分是设置一些数据处理管道,以便在收集完所有数据后我们可以 运行 快速编写代码。
我们正在使用的一个结果衡量标准是行为测试。有人开发了一个 javascript 自动评分的程序;然而,输出镜像 5 个表堆叠在一起。在一些 Whosebug 用户的帮助下,我能够开发一个管道,将单个 txt 文件重组为一个数据帧,然后可以对其进行分析。我现在遇到的问题是如何同时处理所有文件。
我的想法是将所有文件加载到一个列表中,然后使用 map.list 或 lapply 操作列表中的每个元素。但是,我遇到了两个问题,我将在下面概述。
首先,这是处理单个数据帧的代码和数据。
input <- c("Cognitive Screen", "Subtest/Section\t\t\tScore\tT-Score",
"1. Line Bisection\t\t9\t53", "2. Semantic Memory\t\t8\t51",
"3. Word Fluency\t\t\t1\t56*", "4. Recognition Memory\t\t40\t59",
"5. Gesture Object Use\t\t2\t68", "6. Arithmetic\t\t\t5\t49",
"Cognitive TOTAL\t\t\t65", "", "Language Battery", "Part 1: Language Comprehension",
"Spoken Language\t\t\tScore\tT-Score", "7. Spoken Words\t\t\t17\t45*",
"9. Spoken Sentences\t\t25\t53*", "11. Spoken Paragraphs\t\t4\t60",
"Spoken Language TOTAL\t\t46\t49*", "", "Written Language\t\tScore\tT-Score",
"8. Written Words\t\t14\t45*", "10. Written Sentences\t\t21\t48*",
"Written Language TOTAL\t\t35\t46*", "", "Part 2: Expressive Language",
"Repetition\t\t\tScore\tT-Score", "12. Words\t\t\t24\t55*", "13. Complex Words\t\t8\t52*",
"14. Nonwords\t\t\t10\t58", "15. Digit Strings\t\t8\t55", "16. Sentences\t\t\t12\t63",
"Repetition TOTAL\t\t62\t57*", "", "Spoken Language\t\t\tScore\tT-Score",
"17. Naming Objects\t\t30\t55*", "18. Naming Actions\t\t36\t63",
"3. Word Fluency\t\t\t12\t56*", "Naming TOTAL\t\t\t56\t57*",
"", "Spoken Picture Description\tScore\tT-Score", "19. Spoken Picture Description\t\t",
"", "Reading Aloud\t\t\tScore\tT-Score", "20. Words\t\t\t25\t50*",
"21. Complex Words\t\t8\t51*", "22. Function Words\t\t3\t62",
"23. Nonwords\t\t\t6\t51*", "Reading TOTAL\t\t\t42\t50*", "",
"Writing\t\t\t\tScore\tT-Score", "24. Writing: Copying\t\t26\t52",
"25. Writing Picture Names\t14\t53*", "26. Writing to Dictation\t28\t68",
"Writing TOTAL\t\t\t68\t58*", "", "Written Picture Description\tScore\tT-Score",
"27. Written Picture Description\t\t")
创建输入文件后,这里是我用来创建数据框的代码(我知道数据框是字符 - 稍后会修复)
input <- read_lines('Example_data')
# do the match and keep only the second column
header <- as_tibble(str_match(input, "^(.*?)\s+Score.*")[, 2, drop = FALSE])
colnames(header) <- 'title'
# add index to the list so we can match the scores that come after
header <- header %>%
mutate(row = row_number()) %>%
fill(title) # copy title down
# pull off the scores on the numbered rows
scores <- str_match(input, "^([0-9]+[. ]+)(.*?)\s+([0-9]+)\s+([0-9*]+)$")
scores <- as_tibble(scores) %>%
mutate(row = row_number())
scores3 <- mutate(scores, row = row_number())
# keep only rows that are numbered and delete first column
scores <- scores[!is.na(scores[,1]), -1]
# merge the header with the scores to give each section
data <- left_join(scores,
header,
by = 'row'
)
#create correct header in new dataframe
data2 <- data.frame(domain = as.vector(str_replace(data$title, "Subtest/Section", "cognition")),
subtest = data$V3,
score = data$V4,
t.score = data$V5)
head(data2)
好的,现在多个数据文件。我的计划是将所有 txt 文件放在一个文件夹中,然后制作一个包含所有文件的列表,如下所示:
# library(rlist)
# setwd("C:/Users/Brahma/Desktop/CAT TEXT FILES/Data")
# temp = list.files(pattern = "*Example")
# myfiles = lapply(temp, readLines)
可重现的示例文件:
myfiles <- list(c("Cognitive Screen", "Subtest/Section\t\t\tScore\tT-Score",
"1. Line Bisection\t\t9\t53", "2. Semantic Memory\t\t8\t51",
"3. Word Fluency\t\t\t1\t56*", "4. Recognition Memory\t\t40\t59",
"5. Gesture Object Use\t\t2\t68", "6. Arithmetic\t\t\t5\t49",
"Cognitive TOTAL\t\t\t65", "", "Language Battery", "Part 1: Language Comprehension",
"Spoken Language\t\t\tScore\tT-Score", "7. Spoken Words\t\t\t17\t45*",
"9. Spoken Sentences\t\t25\t53*", "11. Spoken Paragraphs\t\t4\t60",
"Spoken Language TOTAL\t\t46\t49*", "", "Written Language\t\tScore\tT-Score",
"8. Written Words\t\t14\t45*", "10. Written Sentences\t\t21\t48*",
"Written Language TOTAL\t\t35\t46*", "", "Part 2: Expressive Language",
"Repetition\t\t\tScore\tT-Score", "12. Words\t\t\t24\t55*", "13. Complex Words\t\t8\t52*",
"14. Nonwords\t\t\t10\t58", "15. Digit Strings\t\t8\t55", "16. Sentences\t\t\t12\t63",
"Repetition TOTAL\t\t62\t57*", "", "Spoken Language\t\t\tScore\tT-Score",
"17. Naming Objects\t\t30\t55*", "18. Naming Actions\t\t36\t63",
"3. Word Fluency\t\t\t12\t56*", "Naming TOTAL\t\t\t56\t57*",
"", "Spoken Picture Description\tScore\tT-Score", "19. Spoken Picture Description\t\t",
"", "Reading Aloud\t\t\tScore\tT-Score", "20. Words\t\t\t25\t50*",
"21. Complex Words\t\t8\t51*", "22. Function Words\t\t3\t62",
"23. Nonwords\t\t\t6\t51*", "Reading TOTAL\t\t\t42\t50*", "",
"Writing\t\t\t\tScore\tT-Score", "24. Writing: Copying\t\t26\t52",
"25. Writing Picture Names\t14\t53*", "26. Writing to Dictation\t28\t68",
"Writing TOTAL\t\t\t68\t58*", "", "Written Picture Description\tScore\tT-Score",
"27. Written Picture Description\t\t"), c("Cognitive Screen",
"Subtest/Section\t\t\tScore\tT-Score", "1. Line Bisection\t\t9\t53",
"2. Semantic Memory\t\t8\t51", "3. Word Fluency\t\t\t1\t56*",
"4. Recognition Memory\t\t40\t59", "5. Gesture Object Use\t\t2\t68",
"6. Arithmetic\t\t\t5\t49", "Cognitive TOTAL\t\t\t65", "", "Language Battery",
"Part 1: Language Comprehension", "Spoken Language\t\t\tScore\tT-Score",
"7. Spoken Words\t\t\t17\t45*", "9. Spoken Sentences\t\t25\t53*",
"11. Spoken Paragraphs\t\t4\t60", "Spoken Language TOTAL\t\t46\t49*",
"", "Written Language\t\tScore\tT-Score", "8. Written Words\t\t14\t45*",
"10. Written Sentences\t\t21\t48*", "Written Language TOTAL\t\t35\t46*",
"", "Part 2: Expressive Language", "Repetition\t\t\tScore\tT-Score",
"12. Words\t\t\t24\t55*", "13. Complex Words\t\t8\t52*", "14. Nonwords\t\t\t10\t58",
"15. Digit Strings\t\t8\t55", "16. Sentences\t\t\t12\t63", "Repetition TOTAL\t\t62\t57*",
"", "Spoken Language\t\t\tScore\tT-Score", "17. Naming Objects\t\t30\t55*",
"18. Naming Actions\t\t36\t63", "3. Word Fluency\t\t\t12\t56*",
"Naming TOTAL\t\t\t56\t57*", "", "Spoken Picture Description\tScore\tT-Score",
"19. Spoken Picture Description\t\t", "", "Reading Aloud\t\t\tScore\tT-Score",
"20. Words\t\t\t25\t50*", "21. Complex Words\t\t8\t51*", "22. Function Words\t\t3\t62",
"23. Nonwords\t\t\t6\t51*", "Reading TOTAL\t\t\t42\t50*", "",
"Writing\t\t\t\tScore\tT-Score", "24. Writing: Copying\t\t26\t52",
"25. Writing Picture Names\t14\t53*", "26. Writing to Dictation\t28\t68",
"Writing TOTAL\t\t\t68\t58*", "", "Written Picture Description\tScore\tT-Score",
"27. Written Picture Description\t\t"), c("Cognitive Screen",
"Subtest/Section\t\t\tScore\tT-Score", "1. Line Bisection\t\t9\t53",
"2. Semantic Memory\t\t8\t51", "3. Word Fluency\t\t\t1\t56*",
"4. Recognition Memory\t\t40\t59", "5. Gesture Object Use\t\t2\t68",
"6. Arithmetic\t\t\t5\t49", "Cognitive TOTAL\t\t\t65", "", "Language Battery",
"Part 1: Language Comprehension", "Spoken Language\t\t\tScore\tT-Score",
"7. Spoken Words\t\t\t17\t45*", "9. Spoken Sentences\t\t25\t53*",
"11. Spoken Paragraphs\t\t4\t60", "Spoken Language TOTAL\t\t46\t49*",
"", "Written Language\t\tScore\tT-Score", "8. Written Words\t\t14\t45*",
"10. Written Sentences\t\t21\t48*", "Written Language TOTAL\t\t35\t46*",
"", "Part 2: Expressive Language", "Repetition\t\t\tScore\tT-Score",
"12. Words\t\t\t24\t55*", "13. Complex Words\t\t8\t52*", "14. Nonwords\t\t\t10\t58",
"15. Digit Strings\t\t8\t55", "16. Sentences\t\t\t12\t63", "Repetition TOTAL\t\t62\t57*",
"", "Spoken Language\t\t\tScore\tT-Score", "17. Naming Objects\t\t30\t55*",
"18. Naming Actions\t\t36\t63", "3. Word Fluency\t\t\t12\t56*",
"Naming TOTAL\t\t\t56\t57*", "", "Spoken Picture Description\tScore\tT-Score",
"19. Spoken Picture Description\t\t", "", "Reading Aloud\t\t\tScore\tT-Score",
"20. Words\t\t\t25\t50*", "21. Complex Words\t\t8\t51*", "22. Function Words\t\t3\t62",
"23. Nonwords\t\t\t6\t51*", "Reading TOTAL\t\t\t42\t50*", "",
"Writing\t\t\t\tScore\tT-Score", "24. Writing: Copying\t\t26\t52",
"25. Writing Picture Names\t14\t53*", "26. Writing to Dictation\t28\t68",
"Writing TOTAL\t\t\t68\t58*", "", "Written Picture Description\tScore\tT-Score",
"27. Written Picture Description\t\t"))
这就是麻烦的开始
我尝试在 rlist 包中使用 lapply 和 list.map。首先,lapply 似乎不喜欢管道函数,所以我尝试按步骤工作。我还尝试为此步骤创建一个函数。
创建小标题。这有效!
list_header <- lapply(myfiles, as.tibble)
即将出现错误 - 试图开始数据操作
list_header2 <- lapply(list_header, str_match(list_header, "^(.*?)\s+Score.*")[, 2, drop = FALSE])
这行代码提供了以下错误:
“match.fun(FUN) 中的错误:
'str_match(list_header, "^(.?)\s+Score.")[, 2, drop = FALSE]' 不是函数、字符或符号
另外: 警告信息:
在 stri_match_first_regex(string, pattern, opts_regex = opts(pattern)) 中:
参数不是原子向量;胁迫
所以我尝试制作一个函数放在这里:
drop_rows <- function(df) {
new_df <- str_match_all(df[[1:3]]$value, "^(.*?)\s+Score.*")
}
list_header2 <- lapply(list_header, drop_rows)
现在我得到这个错误:
“match.fun(FUN) 中的错误:
'str_match(list_header, "^(.?)\s+Score.")[, 2, drop = FALSE]' 不是函数、字符或符号
另外: 警告信息:
在 stri_match_first_regex(string, pattern, opts_regex = opts(pattern)) 中:
参数不是原子向量;胁迫
总结:
提供的代码适用于加载单个 txt 文件。但是,当我尝试 运行 代码来批处理多个列表时,我 运行 陷入了麻烦。如果有人能够提供一些关于如何修复此错误的见解**我认为**我将能够完成其余的工作。但是,如果您愿意帮助实现代码的其余部分,我不会反对。
我决定尝试寻找适用于您的示例数据的解决方案,而不是尝试调试您的代码。以下似乎适用于单个向量和向量列表:
library(tidyverse)
text_to_tibb <- function(char_vec){
str_split(char_vec, "\t") %>%
map_dfr(~ .[nchar(.) > 0] %>% matrix(., nrow = T) %>%
as_tibble
) %>%
filter(!is.na(V2), !str_detect(V1, "TOTAL")) %>%
mutate(title = str_detect(V1, "^\d+\.", negate = T),
group = cumsum(title)
) %>%
group_by(group) %>%
mutate(domain = first(V1)) %>%
filter(!title) %>%
ungroup() %>%
select(domain, V1, V2, V3, -title, -group) %>%
mutate(V1 = str_remove(V1, "^\d+\. "),
domain = str_replace(domain, "Subtest.*", "Cognition")) %>%
rename(subtest = V1, score = V2, t_score = V3)
}
如果你 运行 它在你的 input
变量上你应该得到一个干净的标题:
text_to_tibb(input)
#### OUTPUT ####
# A tibble: 26 x 4
domain subtest score t_score
<chr> <chr> <chr> <chr>
1 Cognition Line Bisection 9 53
2 Cognition Semantic Memory 8 51
3 Cognition Word Fluency 1 56*
4 Cognition Recognition Memory 40 59
5 Cognition Gesture Object Use 2 68
6 Cognition Arithmetic 5 49
7 Spoken Language Spoken Words 17 45*
8 Spoken Language Spoken Sentences 25 53*
9 Spoken Language Spoken Paragraphs 4 60
10 Written Language Written Words 14 45*
# … with 16 more rows
它也适用于您上面包含的向量列表。只需使用 lapply
或 purrr::map
:
map(myfiles, text_to_tibb)
如果您认为某些 table 中可能存在一些不一致,您可能想 safely
试试看:
safe_text_to_tibb <- safely(text_to_tibb)
map(myfiles, safe_text_to_tibb)
问题介绍
您好,
我正在为我的实验室制定数据计划,该实验室将从 1 月开始进行盲法临床试验。此任务的一部分是设置一些数据处理管道,以便在收集完所有数据后我们可以 运行 快速编写代码。
我们正在使用的一个结果衡量标准是行为测试。有人开发了一个 javascript 自动评分的程序;然而,输出镜像 5 个表堆叠在一起。在一些 Whosebug 用户的帮助下,我能够开发一个管道,将单个 txt 文件重组为一个数据帧,然后可以对其进行分析。我现在遇到的问题是如何同时处理所有文件。
我的想法是将所有文件加载到一个列表中,然后使用 map.list 或 lapply 操作列表中的每个元素。但是,我遇到了两个问题,我将在下面概述。
首先,这是处理单个数据帧的代码和数据。
input <- c("Cognitive Screen", "Subtest/Section\t\t\tScore\tT-Score",
"1. Line Bisection\t\t9\t53", "2. Semantic Memory\t\t8\t51",
"3. Word Fluency\t\t\t1\t56*", "4. Recognition Memory\t\t40\t59",
"5. Gesture Object Use\t\t2\t68", "6. Arithmetic\t\t\t5\t49",
"Cognitive TOTAL\t\t\t65", "", "Language Battery", "Part 1: Language Comprehension",
"Spoken Language\t\t\tScore\tT-Score", "7. Spoken Words\t\t\t17\t45*",
"9. Spoken Sentences\t\t25\t53*", "11. Spoken Paragraphs\t\t4\t60",
"Spoken Language TOTAL\t\t46\t49*", "", "Written Language\t\tScore\tT-Score",
"8. Written Words\t\t14\t45*", "10. Written Sentences\t\t21\t48*",
"Written Language TOTAL\t\t35\t46*", "", "Part 2: Expressive Language",
"Repetition\t\t\tScore\tT-Score", "12. Words\t\t\t24\t55*", "13. Complex Words\t\t8\t52*",
"14. Nonwords\t\t\t10\t58", "15. Digit Strings\t\t8\t55", "16. Sentences\t\t\t12\t63",
"Repetition TOTAL\t\t62\t57*", "", "Spoken Language\t\t\tScore\tT-Score",
"17. Naming Objects\t\t30\t55*", "18. Naming Actions\t\t36\t63",
"3. Word Fluency\t\t\t12\t56*", "Naming TOTAL\t\t\t56\t57*",
"", "Spoken Picture Description\tScore\tT-Score", "19. Spoken Picture Description\t\t",
"", "Reading Aloud\t\t\tScore\tT-Score", "20. Words\t\t\t25\t50*",
"21. Complex Words\t\t8\t51*", "22. Function Words\t\t3\t62",
"23. Nonwords\t\t\t6\t51*", "Reading TOTAL\t\t\t42\t50*", "",
"Writing\t\t\t\tScore\tT-Score", "24. Writing: Copying\t\t26\t52",
"25. Writing Picture Names\t14\t53*", "26. Writing to Dictation\t28\t68",
"Writing TOTAL\t\t\t68\t58*", "", "Written Picture Description\tScore\tT-Score",
"27. Written Picture Description\t\t")
创建输入文件后,这里是我用来创建数据框的代码(我知道数据框是字符 - 稍后会修复)
input <- read_lines('Example_data')
# do the match and keep only the second column
header <- as_tibble(str_match(input, "^(.*?)\s+Score.*")[, 2, drop = FALSE])
colnames(header) <- 'title'
# add index to the list so we can match the scores that come after
header <- header %>%
mutate(row = row_number()) %>%
fill(title) # copy title down
# pull off the scores on the numbered rows
scores <- str_match(input, "^([0-9]+[. ]+)(.*?)\s+([0-9]+)\s+([0-9*]+)$")
scores <- as_tibble(scores) %>%
mutate(row = row_number())
scores3 <- mutate(scores, row = row_number())
# keep only rows that are numbered and delete first column
scores <- scores[!is.na(scores[,1]), -1]
# merge the header with the scores to give each section
data <- left_join(scores,
header,
by = 'row'
)
#create correct header in new dataframe
data2 <- data.frame(domain = as.vector(str_replace(data$title, "Subtest/Section", "cognition")),
subtest = data$V3,
score = data$V4,
t.score = data$V5)
head(data2)
好的,现在多个数据文件。我的计划是将所有 txt 文件放在一个文件夹中,然后制作一个包含所有文件的列表,如下所示:
# library(rlist)
# setwd("C:/Users/Brahma/Desktop/CAT TEXT FILES/Data")
# temp = list.files(pattern = "*Example")
# myfiles = lapply(temp, readLines)
可重现的示例文件:
myfiles <- list(c("Cognitive Screen", "Subtest/Section\t\t\tScore\tT-Score",
"1. Line Bisection\t\t9\t53", "2. Semantic Memory\t\t8\t51",
"3. Word Fluency\t\t\t1\t56*", "4. Recognition Memory\t\t40\t59",
"5. Gesture Object Use\t\t2\t68", "6. Arithmetic\t\t\t5\t49",
"Cognitive TOTAL\t\t\t65", "", "Language Battery", "Part 1: Language Comprehension",
"Spoken Language\t\t\tScore\tT-Score", "7. Spoken Words\t\t\t17\t45*",
"9. Spoken Sentences\t\t25\t53*", "11. Spoken Paragraphs\t\t4\t60",
"Spoken Language TOTAL\t\t46\t49*", "", "Written Language\t\tScore\tT-Score",
"8. Written Words\t\t14\t45*", "10. Written Sentences\t\t21\t48*",
"Written Language TOTAL\t\t35\t46*", "", "Part 2: Expressive Language",
"Repetition\t\t\tScore\tT-Score", "12. Words\t\t\t24\t55*", "13. Complex Words\t\t8\t52*",
"14. Nonwords\t\t\t10\t58", "15. Digit Strings\t\t8\t55", "16. Sentences\t\t\t12\t63",
"Repetition TOTAL\t\t62\t57*", "", "Spoken Language\t\t\tScore\tT-Score",
"17. Naming Objects\t\t30\t55*", "18. Naming Actions\t\t36\t63",
"3. Word Fluency\t\t\t12\t56*", "Naming TOTAL\t\t\t56\t57*",
"", "Spoken Picture Description\tScore\tT-Score", "19. Spoken Picture Description\t\t",
"", "Reading Aloud\t\t\tScore\tT-Score", "20. Words\t\t\t25\t50*",
"21. Complex Words\t\t8\t51*", "22. Function Words\t\t3\t62",
"23. Nonwords\t\t\t6\t51*", "Reading TOTAL\t\t\t42\t50*", "",
"Writing\t\t\t\tScore\tT-Score", "24. Writing: Copying\t\t26\t52",
"25. Writing Picture Names\t14\t53*", "26. Writing to Dictation\t28\t68",
"Writing TOTAL\t\t\t68\t58*", "", "Written Picture Description\tScore\tT-Score",
"27. Written Picture Description\t\t"), c("Cognitive Screen",
"Subtest/Section\t\t\tScore\tT-Score", "1. Line Bisection\t\t9\t53",
"2. Semantic Memory\t\t8\t51", "3. Word Fluency\t\t\t1\t56*",
"4. Recognition Memory\t\t40\t59", "5. Gesture Object Use\t\t2\t68",
"6. Arithmetic\t\t\t5\t49", "Cognitive TOTAL\t\t\t65", "", "Language Battery",
"Part 1: Language Comprehension", "Spoken Language\t\t\tScore\tT-Score",
"7. Spoken Words\t\t\t17\t45*", "9. Spoken Sentences\t\t25\t53*",
"11. Spoken Paragraphs\t\t4\t60", "Spoken Language TOTAL\t\t46\t49*",
"", "Written Language\t\tScore\tT-Score", "8. Written Words\t\t14\t45*",
"10. Written Sentences\t\t21\t48*", "Written Language TOTAL\t\t35\t46*",
"", "Part 2: Expressive Language", "Repetition\t\t\tScore\tT-Score",
"12. Words\t\t\t24\t55*", "13. Complex Words\t\t8\t52*", "14. Nonwords\t\t\t10\t58",
"15. Digit Strings\t\t8\t55", "16. Sentences\t\t\t12\t63", "Repetition TOTAL\t\t62\t57*",
"", "Spoken Language\t\t\tScore\tT-Score", "17. Naming Objects\t\t30\t55*",
"18. Naming Actions\t\t36\t63", "3. Word Fluency\t\t\t12\t56*",
"Naming TOTAL\t\t\t56\t57*", "", "Spoken Picture Description\tScore\tT-Score",
"19. Spoken Picture Description\t\t", "", "Reading Aloud\t\t\tScore\tT-Score",
"20. Words\t\t\t25\t50*", "21. Complex Words\t\t8\t51*", "22. Function Words\t\t3\t62",
"23. Nonwords\t\t\t6\t51*", "Reading TOTAL\t\t\t42\t50*", "",
"Writing\t\t\t\tScore\tT-Score", "24. Writing: Copying\t\t26\t52",
"25. Writing Picture Names\t14\t53*", "26. Writing to Dictation\t28\t68",
"Writing TOTAL\t\t\t68\t58*", "", "Written Picture Description\tScore\tT-Score",
"27. Written Picture Description\t\t"), c("Cognitive Screen",
"Subtest/Section\t\t\tScore\tT-Score", "1. Line Bisection\t\t9\t53",
"2. Semantic Memory\t\t8\t51", "3. Word Fluency\t\t\t1\t56*",
"4. Recognition Memory\t\t40\t59", "5. Gesture Object Use\t\t2\t68",
"6. Arithmetic\t\t\t5\t49", "Cognitive TOTAL\t\t\t65", "", "Language Battery",
"Part 1: Language Comprehension", "Spoken Language\t\t\tScore\tT-Score",
"7. Spoken Words\t\t\t17\t45*", "9. Spoken Sentences\t\t25\t53*",
"11. Spoken Paragraphs\t\t4\t60", "Spoken Language TOTAL\t\t46\t49*",
"", "Written Language\t\tScore\tT-Score", "8. Written Words\t\t14\t45*",
"10. Written Sentences\t\t21\t48*", "Written Language TOTAL\t\t35\t46*",
"", "Part 2: Expressive Language", "Repetition\t\t\tScore\tT-Score",
"12. Words\t\t\t24\t55*", "13. Complex Words\t\t8\t52*", "14. Nonwords\t\t\t10\t58",
"15. Digit Strings\t\t8\t55", "16. Sentences\t\t\t12\t63", "Repetition TOTAL\t\t62\t57*",
"", "Spoken Language\t\t\tScore\tT-Score", "17. Naming Objects\t\t30\t55*",
"18. Naming Actions\t\t36\t63", "3. Word Fluency\t\t\t12\t56*",
"Naming TOTAL\t\t\t56\t57*", "", "Spoken Picture Description\tScore\tT-Score",
"19. Spoken Picture Description\t\t", "", "Reading Aloud\t\t\tScore\tT-Score",
"20. Words\t\t\t25\t50*", "21. Complex Words\t\t8\t51*", "22. Function Words\t\t3\t62",
"23. Nonwords\t\t\t6\t51*", "Reading TOTAL\t\t\t42\t50*", "",
"Writing\t\t\t\tScore\tT-Score", "24. Writing: Copying\t\t26\t52",
"25. Writing Picture Names\t14\t53*", "26. Writing to Dictation\t28\t68",
"Writing TOTAL\t\t\t68\t58*", "", "Written Picture Description\tScore\tT-Score",
"27. Written Picture Description\t\t"))
这就是麻烦的开始
我尝试在 rlist 包中使用 lapply 和 list.map。首先,lapply 似乎不喜欢管道函数,所以我尝试按步骤工作。我还尝试为此步骤创建一个函数。
创建小标题。这有效!
list_header <- lapply(myfiles, as.tibble)
即将出现错误 - 试图开始数据操作
list_header2 <- lapply(list_header, str_match(list_header, "^(.*?)\s+Score.*")[, 2, drop = FALSE])
这行代码提供了以下错误:
“match.fun(FUN) 中的错误: 'str_match(list_header, "^(.?)\s+Score.")[, 2, drop = FALSE]' 不是函数、字符或符号 另外: 警告信息: 在 stri_match_first_regex(string, pattern, opts_regex = opts(pattern)) 中: 参数不是原子向量;胁迫
所以我尝试制作一个函数放在这里:
drop_rows <- function(df) {
new_df <- str_match_all(df[[1:3]]$value, "^(.*?)\s+Score.*")
}
list_header2 <- lapply(list_header, drop_rows)
现在我得到这个错误:
“match.fun(FUN) 中的错误: 'str_match(list_header, "^(.?)\s+Score.")[, 2, drop = FALSE]' 不是函数、字符或符号 另外: 警告信息: 在 stri_match_first_regex(string, pattern, opts_regex = opts(pattern)) 中: 参数不是原子向量;胁迫
总结:
提供的代码适用于加载单个 txt 文件。但是,当我尝试 运行 代码来批处理多个列表时,我 运行 陷入了麻烦。如果有人能够提供一些关于如何修复此错误的见解**我认为**我将能够完成其余的工作。但是,如果您愿意帮助实现代码的其余部分,我不会反对。
我决定尝试寻找适用于您的示例数据的解决方案,而不是尝试调试您的代码。以下似乎适用于单个向量和向量列表:
library(tidyverse)
text_to_tibb <- function(char_vec){
str_split(char_vec, "\t") %>%
map_dfr(~ .[nchar(.) > 0] %>% matrix(., nrow = T) %>%
as_tibble
) %>%
filter(!is.na(V2), !str_detect(V1, "TOTAL")) %>%
mutate(title = str_detect(V1, "^\d+\.", negate = T),
group = cumsum(title)
) %>%
group_by(group) %>%
mutate(domain = first(V1)) %>%
filter(!title) %>%
ungroup() %>%
select(domain, V1, V2, V3, -title, -group) %>%
mutate(V1 = str_remove(V1, "^\d+\. "),
domain = str_replace(domain, "Subtest.*", "Cognition")) %>%
rename(subtest = V1, score = V2, t_score = V3)
}
如果你 运行 它在你的 input
变量上你应该得到一个干净的标题:
text_to_tibb(input)
#### OUTPUT ####
# A tibble: 26 x 4
domain subtest score t_score
<chr> <chr> <chr> <chr>
1 Cognition Line Bisection 9 53
2 Cognition Semantic Memory 8 51
3 Cognition Word Fluency 1 56*
4 Cognition Recognition Memory 40 59
5 Cognition Gesture Object Use 2 68
6 Cognition Arithmetic 5 49
7 Spoken Language Spoken Words 17 45*
8 Spoken Language Spoken Sentences 25 53*
9 Spoken Language Spoken Paragraphs 4 60
10 Written Language Written Words 14 45*
# … with 16 more rows
它也适用于您上面包含的向量列表。只需使用 lapply
或 purrr::map
:
map(myfiles, text_to_tibb)
如果您认为某些 table 中可能存在一些不一致,您可能想 safely
试试看:
safe_text_to_tibb <- safely(text_to_tibb)
map(myfiles, safe_text_to_tibb)