如何从结构文本文件中读取数据

Question

该数据集涉及来自亚马逊的产品元数据信息。

数据看起来像这样：

Id:   0
ASIN: 0771044445
  discontinued product
        
Id:   1
ASIN: 0827229534
 title: Patterns of Preaching: A Sermon Sampler
 group: Book
 salesrank: 396585
 similar: 5  0804215715  156101074X  0687023955  0687074231  082721619X
 categories: 2
                       
Id:   2
ASIN: 0738700797
 title: Candlemas: Feast of Flames
 group: Book
 salesrank: 168596
 similar: 5  0738700827  1567184960  1567182836  0738700525  0738700940
 categories: 2

如何导入这个 txt.gz 文件并且只想提取与“Id:”和“group:”相关的信息？但是，如果每个块（2 个空行之间的块）包含“停产产品”，我根本不需要该块的任何信息。

Answer 1

这是我阅读您的文字的解决方案。文档中的 temp.txt 包含您共享的文本。您可以用适当的代码替换该行以从 S3 访问文本文件。

library(dplyr)
library(readr)
library(tidyr)

text <- read_lines("temp.txt")
df <- tibble(text = text[text!=""])
# split text using the colon
df %>%
  # separate text into two columns
  separate(col = text, into = c("variable", "value"), sep = ":",
           extra = "merge", fill = "right") %>%
  mutate(
    # remove extra space from value column
    value = trimws(value), 
    # add value to discontinued product for later usage
    value = if_else(variable == "discontinued product", "TRUE", value),
    # create index_book column base on Id row
    index_book = cumsum(variable == "Id")) %>%
  # using the index_book to assign the Id to all row with same index
  group_by(index_book) %>%
  mutate(book_id = as.numeric(value[variable == "Id"])) %>%
  ungroup() %>%
  # remove index_book column and rows with variable Id
  select(-index_book) %>%
  filter(variable != "Id") %>%
  # Convert data into wide format
  pivot_wider(id_cols = book_id, names_from = variable, values_from = value) %>%
  # filter discontinued product
  filter(is.na(`discontinued product`))

这是代码的输出

如何从结构文本文件中读取数据

How to read data from a structural text file

filtering

metadata

r

amazon-s3

data-extraction