解析元数据并将其导入 R

Parse and import metadata into R

我有一个包含亚马逊产品元数据的文件,结构如下:

Id:   0
ASIN: 0771044445
  discontinued product

Id:   1
ASIN: 0827229534
  title: Patterns of Preaching: A Sermon Sampler
  group: Book
  salesrank: 396585
  similar: 5  0804215715  156101074X  0687023955  0687074231  082721619X
  categories: 2
   |Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Preaching[12368]
   |Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Sermons[12370]
  reviews: total: 2  downloaded: 2  avg rating: 5
    2000-7-28  cutomer: A2JW67OY8U6HHK  rating: 5  votes:  10  helpful:   9
    2003-12-14  cutomer: A2VE83MZF98ITY  rating: 5  votes:   6  helpful:   5

Id:   2
ASIN: 0738700797
  title: Candlemas: Feast of Flames
  group: Book
  salesrank: 168596
  similar: 5  0738700827  1567184960  1567182836  0738700525  0738700940
  categories: 2
   |Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Earth-Based Religions[12472]|Wicca[12484]
   |Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Earth-Based Religions[12472]|Witchcraft[12486]
  reviews: total: 12  downloaded: 12  avg rating: 4.5
    2001-12-16  cutomer: A11NCO6YTE4BTJ  rating: 5  votes:   5  helpful:   4
    2002-1-7  cutomer:  A9CQ3PLRNIR83  rating: 4  votes:   5  helpful:   5
    2002-1-24  cutomer: A13SG9ACZ9O5IM  rating: 5  votes:   8  helpful:   8
    2002-1-28  cutomer: A1BDAI6VEYMAZA  rating: 5  votes:   4  helpful:   4
    2002-2-6  cutomer: A2P6KAWXJ16234  rating: 4  votes:  16  helpful:  16
    2002-2-14  cutomer:  AMACWC3M7PQFR  rating: 4  votes:   5  helpful:   5
    2002-3-23  cutomer: A3GO7UV9XX14D8  rating: 4  votes:   6  helpful:   6
    2002-5-23  cutomer: A1GIL64QK68WKL  rating: 5  votes:   8  helpful:   8
    2003-2-25  cutomer:  AEOBOF2ONQJWV  rating: 5  votes:   8  helpful:   5
    2003-11-25  cutomer: A3IGHTES8ME05L  rating: 5  votes:   5  helpful:   5
    2004-2-11  cutomer: A1CP26N8RHYVVO  rating: 1  votes:  13  helpful:   9
    2005-2-7  cutomer:  ANEIANH0WAT9D  rating: 5  votes:   1  helpful:   1

我找到了一个 csv,其中包含与我希望的完全相同的数据,制作如下:

"id","title","group","salesrank","review_cnt","downloads","rating"
"1","Patterns of Preaching: A Sermon Sampler","Book",396585,2,2,5
"2","Candlemas: Feast of Flames","Book",168596,12,12,4.5

虽然我有我需要的文件,但我想知道如何自己生成它,最好是使用 R,以便将数据导入为数据框。

谢谢。

你可以试试:

fields <- c("Id", "title", "group", "salesrank", "total", "downloaded")
text <- readLines("mydata.txt")# The file containing the amazon metadata
a <- grep('(Id|title|reviews|salesrank|group):', text, value = TRUE)
b <- gsub('reviews:|avg', '', a)
d <- trimws(gsub("(downloaded|rating)", "\n\1", b))
e <- do.call(rbind.data.frame, tapply(d, cumsum(grepl('Id', d)), function(x) 
  read.dcf(textConnection(x), fields = fields)))
type.convert(e, as.is = TRUE)

 Id                                   title group salesrank total downloaded
1  0                                    <NA>  <NA>        NA    NA         NA
2  1 Patterns of Preaching: A Sermon Sampler  Book    396585     2          2
3  2              Candlemas: Feast of Flames  Book    168596    12         12