读取一列以行作为变量名的文件
Reading file with one column with rows as variable names
我正在尝试进行一些情绪分析,但不幸的是卡在了最开始,我什至无法导入文件。
数据位于此处:http://snap.stanford.edu/data/web-FineFoods.html
这是一个 353MB 的 .txt 文件,看起来像这样:
product/productId: B001E4KFG0
review/userId: A3SGXH7AUHU8GW
review/profileName: delmartian
review/helpfulness: 1/1
review/score: 5.0
review/time: 1303862400
review/summary: Good Quality Dog Food
review/text: I have bought several of the Vitality canned dog food products and have
found them all to be of good quality. The product looks more like a stew than a
processed meat and it smells better. My Labrador is finicky and she appreciates this
product better than most.
我的尝试都将这些数据放入一个列中,我不确定我应该如何正确地对这些数据进行排序,以便将它们处理成 tidytext。
如果此处的每一行都显示有 headers 的列,我会很高兴。
欣赏任何方向。
这是使用 dplyr
和 tidyr
-
的一种方法
# assuming your data is in file called reviews.txt
reviews <- readLines("reviews.txt")
df <- data_frame(chars = trimws(reviews)) %>%
mutate(
variable_num = cumsum(grepl(":", chars))
) %>%
group_by(variable_num) %>%
summarise(
chars = paste0(chars, collapse = " ")
) %>%
separate(chars, into = c("variable", "value"), sep = ": ", extra = "merge") %>%
select(-variable_num) %>%
mutate(
variable = sub(".*/", "", variable),
record_num = cumsum(variable == "productId")
) %>%
spread(variable, value, convert = T)
> df
record_num helpfulness productId profileName score summary text time userId
<int> <chr> <chr> <chr> <dbl> <chr> <chr> <int> <chr>
1 1 1/1 B001E4KFG0 delmartian 5 Good Qu~ "I have bo~ 1.30e9 A3SGX~
我正在尝试进行一些情绪分析,但不幸的是卡在了最开始,我什至无法导入文件。
数据位于此处:http://snap.stanford.edu/data/web-FineFoods.html
这是一个 353MB 的 .txt 文件,看起来像这样:
product/productId: B001E4KFG0
review/userId: A3SGXH7AUHU8GW
review/profileName: delmartian
review/helpfulness: 1/1
review/score: 5.0
review/time: 1303862400
review/summary: Good Quality Dog Food
review/text: I have bought several of the Vitality canned dog food products and have
found them all to be of good quality. The product looks more like a stew than a
processed meat and it smells better. My Labrador is finicky and she appreciates this
product better than most.
我的尝试都将这些数据放入一个列中,我不确定我应该如何正确地对这些数据进行排序,以便将它们处理成 tidytext。
如果此处的每一行都显示有 headers 的列,我会很高兴。
欣赏任何方向。
这是使用 dplyr
和 tidyr
-
# assuming your data is in file called reviews.txt
reviews <- readLines("reviews.txt")
df <- data_frame(chars = trimws(reviews)) %>%
mutate(
variable_num = cumsum(grepl(":", chars))
) %>%
group_by(variable_num) %>%
summarise(
chars = paste0(chars, collapse = " ")
) %>%
separate(chars, into = c("variable", "value"), sep = ": ", extra = "merge") %>%
select(-variable_num) %>%
mutate(
variable = sub(".*/", "", variable),
record_num = cumsum(variable == "productId")
) %>%
spread(variable, value, convert = T)
> df
record_num helpfulness productId profileName score summary text time userId
<int> <chr> <chr> <chr> <dbl> <chr> <chr> <int> <chr>
1 1 1/1 B001E4KFG0 delmartian 5 Good Qu~ "I have bo~ 1.30e9 A3SGX~