读取分隔符异常的文本文件
Reading text file with abnormal delimitor
我正在使用一种算法对文本向量进行词形还原。输出是一个 .txt 文件,存储方式如下图所示。
第一栏列出了原始单词,第二栏列出了各种词条,然后是一些语法分类。我想将其读入 R,但不知道该怎么做。我尝试了各种形式的分隔符,但 none 似乎有效。
理想情况下,我希望 R 中的数据框如下所示,其中我只读取每个引理的第一次出现:
也许最好的选择是读取数据,仅保留第一次出现的数据(即 da da adv),然后执行类似文本到列的操作,仅保留前两列。
词形还原算法的输出:
"<da>"
"da" adv
"da" sbu
"da" subst fork
"<dette>"
"dette" det dem nøyt ent
"dette" pron nøyt ent pers 3
"dette" verb inf
"<er>"
"være" verb pres <aux1/perf_part>
"<den>"
"den" det dem fem ent
"den" det dem mask ent
"den" pron mask fem ent pers 3
想要的结构:
da da
dette dette
er være
den den
这是一个有趣的结果:您可以使用 read.table 很好地阅读文件:
s <- '"<da>"
"da" adv
"da" sbu
"da" subst fork
"<dette>"
"dette" det dem nøyt ent
"dette" pron nøyt ent pers 3
"dette" verb inf
"<er>"
"være" verb pres <aux1/perf_part>
"<den>"
"den" det dem fem ent
"den" det dem mask ent
"den" pron mask fem ent pers 3
'
x <- read.table(sep='', text=s, colClasses=c('character','character'), flush=TRUE, fill=TRUE)
> x
V1 V2 V3
1 <da>
2 da adv
3 da sbu
4 da subst fork
5 <dette>
6 dette det dem
7 dette pron nøyt
8 dette verb inf
9 <er>
10 være verb pres
11 <den>
12 den det dem
13 den det dem
14 den pron mask
使用包dplyr
和tidyr
,我们可以将其解压成:
(y <- x %>% mutate(a=grepl('<', V1, fixed=TRUE), b=cumsum(a)) %>%
group_by(b) %>%
summarise(verbs=list(t(unique(V1)))) %>%
unnest(cols=c(verbs)))
# A tibble: 4 x 2
b verbs[,1] [,2]
<int> <chr> <chr>
1 1 <da> da
2 2 <dette> dette
3 3 <er> være
4 4 <den> den
result <- y$verbs
result[,1] <- gsub('(<|>)', '', result[,1])
[,1] [,2]
[1,] "da" "da"
[2,] "dette" "dette"
[3,] "er" "være"
[4,] "den" "den"
将文本复制粘贴到文本文件时,这对我有用:
#Read the data
data <- readLines('temp.txt')
#index where new group starts. I have considered no whitespace at the beginning
# of the string as an indication for new group.
groups <- !startsWith(data, ' ')
#Since the first word is same in the entire group, we take first word
#from 2nd element as 1st element is group name
value <- tapply(data, cumsum(groups), function(x)
sub('"(\w+).*', '\1', trimws(x[2])))
#Create dataframe with group name and value.
data.frame(groups = data[groups], value)
# groups value
#1 "<da>" da
#2 "<dette>" dette
#3 "<er>" være
#4 "<den>" den
我正在使用一种算法对文本向量进行词形还原。输出是一个 .txt 文件,存储方式如下图所示。
第一栏列出了原始单词,第二栏列出了各种词条,然后是一些语法分类。我想将其读入 R,但不知道该怎么做。我尝试了各种形式的分隔符,但 none 似乎有效。
理想情况下,我希望 R 中的数据框如下所示,其中我只读取每个引理的第一次出现:
也许最好的选择是读取数据,仅保留第一次出现的数据(即 da da adv),然后执行类似文本到列的操作,仅保留前两列。
词形还原算法的输出:
"<da>"
"da" adv
"da" sbu
"da" subst fork
"<dette>"
"dette" det dem nøyt ent
"dette" pron nøyt ent pers 3
"dette" verb inf
"<er>"
"være" verb pres <aux1/perf_part>
"<den>"
"den" det dem fem ent
"den" det dem mask ent
"den" pron mask fem ent pers 3
想要的结构:
da da
dette dette
er være
den den
这是一个有趣的结果:您可以使用 read.table 很好地阅读文件:
s <- '"<da>"
"da" adv
"da" sbu
"da" subst fork
"<dette>"
"dette" det dem nøyt ent
"dette" pron nøyt ent pers 3
"dette" verb inf
"<er>"
"være" verb pres <aux1/perf_part>
"<den>"
"den" det dem fem ent
"den" det dem mask ent
"den" pron mask fem ent pers 3
'
x <- read.table(sep='', text=s, colClasses=c('character','character'), flush=TRUE, fill=TRUE)
> x
V1 V2 V3
1 <da>
2 da adv
3 da sbu
4 da subst fork
5 <dette>
6 dette det dem
7 dette pron nøyt
8 dette verb inf
9 <er>
10 være verb pres
11 <den>
12 den det dem
13 den det dem
14 den pron mask
使用包dplyr
和tidyr
,我们可以将其解压成:
(y <- x %>% mutate(a=grepl('<', V1, fixed=TRUE), b=cumsum(a)) %>%
group_by(b) %>%
summarise(verbs=list(t(unique(V1)))) %>%
unnest(cols=c(verbs)))
# A tibble: 4 x 2
b verbs[,1] [,2]
<int> <chr> <chr>
1 1 <da> da
2 2 <dette> dette
3 3 <er> være
4 4 <den> den
result <- y$verbs
result[,1] <- gsub('(<|>)', '', result[,1])
[,1] [,2]
[1,] "da" "da"
[2,] "dette" "dette"
[3,] "er" "være"
[4,] "den" "den"
将文本复制粘贴到文本文件时,这对我有用:
#Read the data
data <- readLines('temp.txt')
#index where new group starts. I have considered no whitespace at the beginning
# of the string as an indication for new group.
groups <- !startsWith(data, ' ')
#Since the first word is same in the entire group, we take first word
#from 2nd element as 1st element is group name
value <- tapply(data, cumsum(groups), function(x)
sub('"(\w+).*', '\1', trimws(x[2])))
#Create dataframe with group name and value.
data.frame(groups = data[groups], value)
# groups value
#1 "<da>" da
#2 "<dette>" dette
#3 "<er>" være
#4 "<den>" den