如何从结构文本文件中读取数据
How to read data from a structural text file
该数据集涉及来自亚马逊的产品元数据信息。
数据看起来像这样:
Id: 0
ASIN: 0771044445
discontinued product
Id: 1
ASIN: 0827229534
title: Patterns of Preaching: A Sermon Sampler
group: Book
salesrank: 396585
similar: 5 0804215715 156101074X 0687023955 0687074231 082721619X
categories: 2
Id: 2
ASIN: 0738700797
title: Candlemas: Feast of Flames
group: Book
salesrank: 168596
similar: 5 0738700827 1567184960 1567182836 0738700525 0738700940
categories: 2
如何导入这个 txt.gz 文件并且只想提取与“Id:”和“group:”相关的信息?但是,如果每个块(2 个空行之间的块)包含“停产产品”,我根本不需要该块的任何信息。
这是我阅读您的文字的解决方案。文档中的 temp.txt
包含您共享的文本。您可以用适当的代码替换该行以从 S3 访问文本文件。
library(dplyr)
library(readr)
library(tidyr)
text <- read_lines("temp.txt")
df <- tibble(text = text[text!=""])
# split text using the colon
df %>%
# separate text into two columns
separate(col = text, into = c("variable", "value"), sep = ":",
extra = "merge", fill = "right") %>%
mutate(
# remove extra space from value column
value = trimws(value),
# add value to discontinued product for later usage
value = if_else(variable == "discontinued product", "TRUE", value),
# create index_book column base on Id row
index_book = cumsum(variable == "Id")) %>%
# using the index_book to assign the Id to all row with same index
group_by(index_book) %>%
mutate(book_id = as.numeric(value[variable == "Id"])) %>%
ungroup() %>%
# remove index_book column and rows with variable Id
select(-index_book) %>%
filter(variable != "Id") %>%
# Convert data into wide format
pivot_wider(id_cols = book_id, names_from = variable, values_from = value) %>%
# filter discontinued product
filter(is.na(`discontinued product`))
这是代码的输出
该数据集涉及来自亚马逊的产品元数据信息。
数据看起来像这样:
Id: 0
ASIN: 0771044445
discontinued product
Id: 1
ASIN: 0827229534
title: Patterns of Preaching: A Sermon Sampler
group: Book
salesrank: 396585
similar: 5 0804215715 156101074X 0687023955 0687074231 082721619X
categories: 2
Id: 2
ASIN: 0738700797
title: Candlemas: Feast of Flames
group: Book
salesrank: 168596
similar: 5 0738700827 1567184960 1567182836 0738700525 0738700940
categories: 2
如何导入这个 txt.gz 文件并且只想提取与“Id:”和“group:”相关的信息?但是,如果每个块(2 个空行之间的块)包含“停产产品”,我根本不需要该块的任何信息。
这是我阅读您的文字的解决方案。文档中的 temp.txt
包含您共享的文本。您可以用适当的代码替换该行以从 S3 访问文本文件。
library(dplyr)
library(readr)
library(tidyr)
text <- read_lines("temp.txt")
df <- tibble(text = text[text!=""])
# split text using the colon
df %>%
# separate text into two columns
separate(col = text, into = c("variable", "value"), sep = ":",
extra = "merge", fill = "right") %>%
mutate(
# remove extra space from value column
value = trimws(value),
# add value to discontinued product for later usage
value = if_else(variable == "discontinued product", "TRUE", value),
# create index_book column base on Id row
index_book = cumsum(variable == "Id")) %>%
# using the index_book to assign the Id to all row with same index
group_by(index_book) %>%
mutate(book_id = as.numeric(value[variable == "Id"])) %>%
ungroup() %>%
# remove index_book column and rows with variable Id
select(-index_book) %>%
filter(variable != "Id") %>%
# Convert data into wide format
pivot_wider(id_cols = book_id, names_from = variable, values_from = value) %>%
# filter discontinued product
filter(is.na(`discontinued product`))
这是代码的输出