将行号保留在数据框列中
keeping the row number in a data frame column
我在一个文件夹中有一堆 .txt 文件(文章),我使用 for 循环从 R 上的所有文件中获取文本
input_loc <- "C:/Users/User/Desktop/Folder"
files <- dir(input_loc, full.names = TRUE)
text <- c()
for (f in files) {
text <- c(text, paste(readLines(f), collapse = "\n"))
}
从这里开始,我对每个段落进行标记,然后得到每篇文章中的每个段落:
paragraphs <- tokenize_paragraphs(text)
sapply(paragraphs, length)
paragraphs
然后我取消列出并转换为数据框
par_unlisted<-unlist(paragraphs)
par_unlisted
par_unlisted_df<-as.data.frame(par_unlisted)
但是这样做我不再有段落编号的文章间分隔(例如,第一篇文章有 6 个段落,在取消列出之前,第二篇文章的第一段前面仍然有一个 [1],而在取消列出之后它会有一个 [7])。
我想做的是,一旦我有了数据框,有一列包含段落编号,然后创建另一个名为 "article" 的列,其中包含文章编号。
提前谢谢你
编辑
这大概是我到达 paragraphs
:
后得到的
> paragraphs
[[1]]
[1] "The Miami Dolphins have decided to use their non-exclusive franchise
tag on wide receiver Jarvis Landry."
[2] "The Dolphins tweeted the announcement Tuesday, the first day teams
could use their franchise or transition tags. The salary for wide receivers
getting the franchise tag this offseason is expected to be around .2
million, which will be quite the raise for Landry, who made 4,000 last
season."
[[2]]
[1] "Despite months of little-to-no movement on contract negotiations,
Jarvis Landry has often stated his desire to stay in Miami."
[2] "The Dolphins used their lone tool to wipe away negotation-driven stress
-- at least in the immediate future -- and ensure Landry won't be lured away
from Miami, placing the franchise tag on the receiver on Tuesday, the team
announced."
我想将段落编号 ([n]
) 保留为数据框中的一列,因为当我取消列出它们时,它们不再按文章和段落分开,但我按顺序得到它们,比方说(基本上在我刚刚发布的例子中我不再有
[[1]]
[1] ...
[2] ...
[[2]]
[1] ...
[2] ...
但我明白了
[1] ...
[2] ...
[3] ...
[4] ...
考虑遍历 paragraphs 列表并构建包含所需文章和段落编号的数据框列表,最后一行绑定所有数据框元素。
输入数据
paragraphs <- list(
c("The Miami Dolphins have decided to use their non-exclusive franchise tag on wide receiver Jarvis Landry.",
"The Dolphins tweeted the announcement Tuesday, the first day teams could use their franchise or transition tags. The salary for wide receivers
getting the franchise tag this offseason is expected to be around .2 million, which will be quite the raise for Landry, who made 4,000 last
season."),
c("Despite months of little-to-no movement on contract negotiations, Jarvis Landry has often stated his desire to stay in Miami.",
"The Dolphins used their lone tool to wipe away negotation-driven stress -- at least in the immediate future -- and ensure Landry won't be lured away
from Miami, placing the franchise tag on the receiver on Tuesday, the team announced."))
数据框构建
df_list <- lapply(seq_along(paragraphs), function(i)
setNames(data.frame(i, 1:length(paragraphs[[i]]), paragraphs[[i]]),
c("article_num", "paragraph_num", "paragraph"))
)
final_df <- do.call(rbind, df_list)
输出结果
final_df
# article_num paragraph_num paragraph
# 1 1 1 The Miami Dolphins have decided to use their non-e...
# 2 1 2 The Dolphins tweeted the announcement Tuesday, the...
# 3 2 1 Despite months of little-to-no movement on contrac...
# 4 2 2 The Dolphins used their lone tool to wipe away neg...
我在一个文件夹中有一堆 .txt 文件(文章),我使用 for 循环从 R 上的所有文件中获取文本
input_loc <- "C:/Users/User/Desktop/Folder"
files <- dir(input_loc, full.names = TRUE)
text <- c()
for (f in files) {
text <- c(text, paste(readLines(f), collapse = "\n"))
}
从这里开始,我对每个段落进行标记,然后得到每篇文章中的每个段落:
paragraphs <- tokenize_paragraphs(text)
sapply(paragraphs, length)
paragraphs
然后我取消列出并转换为数据框
par_unlisted<-unlist(paragraphs)
par_unlisted
par_unlisted_df<-as.data.frame(par_unlisted)
但是这样做我不再有段落编号的文章间分隔(例如,第一篇文章有 6 个段落,在取消列出之前,第二篇文章的第一段前面仍然有一个 [1],而在取消列出之后它会有一个 [7])。 我想做的是,一旦我有了数据框,有一列包含段落编号,然后创建另一个名为 "article" 的列,其中包含文章编号。 提前谢谢你
编辑
这大概是我到达 paragraphs
:
> paragraphs
[[1]]
[1] "The Miami Dolphins have decided to use their non-exclusive franchise
tag on wide receiver Jarvis Landry."
[2] "The Dolphins tweeted the announcement Tuesday, the first day teams
could use their franchise or transition tags. The salary for wide receivers
getting the franchise tag this offseason is expected to be around .2
million, which will be quite the raise for Landry, who made 4,000 last
season."
[[2]]
[1] "Despite months of little-to-no movement on contract negotiations,
Jarvis Landry has often stated his desire to stay in Miami."
[2] "The Dolphins used their lone tool to wipe away negotation-driven stress
-- at least in the immediate future -- and ensure Landry won't be lured away
from Miami, placing the franchise tag on the receiver on Tuesday, the team
announced."
我想将段落编号 ([n]
) 保留为数据框中的一列,因为当我取消列出它们时,它们不再按文章和段落分开,但我按顺序得到它们,比方说(基本上在我刚刚发布的例子中我不再有
[[1]]
[1] ...
[2] ...
[[2]]
[1] ...
[2] ...
但我明白了
[1] ...
[2] ...
[3] ...
[4] ...
考虑遍历 paragraphs 列表并构建包含所需文章和段落编号的数据框列表,最后一行绑定所有数据框元素。
输入数据
paragraphs <- list(
c("The Miami Dolphins have decided to use their non-exclusive franchise tag on wide receiver Jarvis Landry.",
"The Dolphins tweeted the announcement Tuesday, the first day teams could use their franchise or transition tags. The salary for wide receivers
getting the franchise tag this offseason is expected to be around .2 million, which will be quite the raise for Landry, who made 4,000 last
season."),
c("Despite months of little-to-no movement on contract negotiations, Jarvis Landry has often stated his desire to stay in Miami.",
"The Dolphins used their lone tool to wipe away negotation-driven stress -- at least in the immediate future -- and ensure Landry won't be lured away
from Miami, placing the franchise tag on the receiver on Tuesday, the team announced."))
数据框构建
df_list <- lapply(seq_along(paragraphs), function(i)
setNames(data.frame(i, 1:length(paragraphs[[i]]), paragraphs[[i]]),
c("article_num", "paragraph_num", "paragraph"))
)
final_df <- do.call(rbind, df_list)
输出结果
final_df
# article_num paragraph_num paragraph
# 1 1 1 The Miami Dolphins have decided to use their non-e...
# 2 1 2 The Dolphins tweeted the announcement Tuesday, the...
# 3 2 1 Despite months of little-to-no movement on contrac...
# 4 2 2 The Dolphins used their lone tool to wipe away neg...