解析元数据并将其导入 R
Parse and import metadata into R
我有一个包含亚马逊产品元数据的文件,结构如下:
Id: 0
ASIN: 0771044445
discontinued product
Id: 1
ASIN: 0827229534
title: Patterns of Preaching: A Sermon Sampler
group: Book
salesrank: 396585
similar: 5 0804215715 156101074X 0687023955 0687074231 082721619X
categories: 2
|Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Preaching[12368]
|Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Sermons[12370]
reviews: total: 2 downloaded: 2 avg rating: 5
2000-7-28 cutomer: A2JW67OY8U6HHK rating: 5 votes: 10 helpful: 9
2003-12-14 cutomer: A2VE83MZF98ITY rating: 5 votes: 6 helpful: 5
Id: 2
ASIN: 0738700797
title: Candlemas: Feast of Flames
group: Book
salesrank: 168596
similar: 5 0738700827 1567184960 1567182836 0738700525 0738700940
categories: 2
|Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Earth-Based Religions[12472]|Wicca[12484]
|Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Earth-Based Religions[12472]|Witchcraft[12486]
reviews: total: 12 downloaded: 12 avg rating: 4.5
2001-12-16 cutomer: A11NCO6YTE4BTJ rating: 5 votes: 5 helpful: 4
2002-1-7 cutomer: A9CQ3PLRNIR83 rating: 4 votes: 5 helpful: 5
2002-1-24 cutomer: A13SG9ACZ9O5IM rating: 5 votes: 8 helpful: 8
2002-1-28 cutomer: A1BDAI6VEYMAZA rating: 5 votes: 4 helpful: 4
2002-2-6 cutomer: A2P6KAWXJ16234 rating: 4 votes: 16 helpful: 16
2002-2-14 cutomer: AMACWC3M7PQFR rating: 4 votes: 5 helpful: 5
2002-3-23 cutomer: A3GO7UV9XX14D8 rating: 4 votes: 6 helpful: 6
2002-5-23 cutomer: A1GIL64QK68WKL rating: 5 votes: 8 helpful: 8
2003-2-25 cutomer: AEOBOF2ONQJWV rating: 5 votes: 8 helpful: 5
2003-11-25 cutomer: A3IGHTES8ME05L rating: 5 votes: 5 helpful: 5
2004-2-11 cutomer: A1CP26N8RHYVVO rating: 1 votes: 13 helpful: 9
2005-2-7 cutomer: ANEIANH0WAT9D rating: 5 votes: 1 helpful: 1
我找到了一个 csv,其中包含与我希望的完全相同的数据,制作如下:
"id","title","group","salesrank","review_cnt","downloads","rating"
"1","Patterns of Preaching: A Sermon Sampler","Book",396585,2,2,5
"2","Candlemas: Feast of Flames","Book",168596,12,12,4.5
虽然我有我需要的文件,但我想知道如何自己生成它,最好是使用 R,以便将数据导入为数据框。
谢谢。
你可以试试:
fields <- c("Id", "title", "group", "salesrank", "total", "downloaded")
text <- readLines("mydata.txt")# The file containing the amazon metadata
a <- grep('(Id|title|reviews|salesrank|group):', text, value = TRUE)
b <- gsub('reviews:|avg', '', a)
d <- trimws(gsub("(downloaded|rating)", "\n\1", b))
e <- do.call(rbind.data.frame, tapply(d, cumsum(grepl('Id', d)), function(x)
read.dcf(textConnection(x), fields = fields)))
type.convert(e, as.is = TRUE)
Id title group salesrank total downloaded
1 0 <NA> <NA> NA NA NA
2 1 Patterns of Preaching: A Sermon Sampler Book 396585 2 2
3 2 Candlemas: Feast of Flames Book 168596 12 12
我有一个包含亚马逊产品元数据的文件,结构如下:
Id: 0
ASIN: 0771044445
discontinued product
Id: 1
ASIN: 0827229534
title: Patterns of Preaching: A Sermon Sampler
group: Book
salesrank: 396585
similar: 5 0804215715 156101074X 0687023955 0687074231 082721619X
categories: 2
|Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Preaching[12368]
|Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Sermons[12370]
reviews: total: 2 downloaded: 2 avg rating: 5
2000-7-28 cutomer: A2JW67OY8U6HHK rating: 5 votes: 10 helpful: 9
2003-12-14 cutomer: A2VE83MZF98ITY rating: 5 votes: 6 helpful: 5
Id: 2
ASIN: 0738700797
title: Candlemas: Feast of Flames
group: Book
salesrank: 168596
similar: 5 0738700827 1567184960 1567182836 0738700525 0738700940
categories: 2
|Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Earth-Based Religions[12472]|Wicca[12484]
|Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Earth-Based Religions[12472]|Witchcraft[12486]
reviews: total: 12 downloaded: 12 avg rating: 4.5
2001-12-16 cutomer: A11NCO6YTE4BTJ rating: 5 votes: 5 helpful: 4
2002-1-7 cutomer: A9CQ3PLRNIR83 rating: 4 votes: 5 helpful: 5
2002-1-24 cutomer: A13SG9ACZ9O5IM rating: 5 votes: 8 helpful: 8
2002-1-28 cutomer: A1BDAI6VEYMAZA rating: 5 votes: 4 helpful: 4
2002-2-6 cutomer: A2P6KAWXJ16234 rating: 4 votes: 16 helpful: 16
2002-2-14 cutomer: AMACWC3M7PQFR rating: 4 votes: 5 helpful: 5
2002-3-23 cutomer: A3GO7UV9XX14D8 rating: 4 votes: 6 helpful: 6
2002-5-23 cutomer: A1GIL64QK68WKL rating: 5 votes: 8 helpful: 8
2003-2-25 cutomer: AEOBOF2ONQJWV rating: 5 votes: 8 helpful: 5
2003-11-25 cutomer: A3IGHTES8ME05L rating: 5 votes: 5 helpful: 5
2004-2-11 cutomer: A1CP26N8RHYVVO rating: 1 votes: 13 helpful: 9
2005-2-7 cutomer: ANEIANH0WAT9D rating: 5 votes: 1 helpful: 1
我找到了一个 csv,其中包含与我希望的完全相同的数据,制作如下:
"id","title","group","salesrank","review_cnt","downloads","rating"
"1","Patterns of Preaching: A Sermon Sampler","Book",396585,2,2,5
"2","Candlemas: Feast of Flames","Book",168596,12,12,4.5
虽然我有我需要的文件,但我想知道如何自己生成它,最好是使用 R,以便将数据导入为数据框。
谢谢。
你可以试试:
fields <- c("Id", "title", "group", "salesrank", "total", "downloaded")
text <- readLines("mydata.txt")# The file containing the amazon metadata
a <- grep('(Id|title|reviews|salesrank|group):', text, value = TRUE)
b <- gsub('reviews:|avg', '', a)
d <- trimws(gsub("(downloaded|rating)", "\n\1", b))
e <- do.call(rbind.data.frame, tapply(d, cumsum(grepl('Id', d)), function(x)
read.dcf(textConnection(x), fields = fields)))
type.convert(e, as.is = TRUE)
Id title group salesrank total downloaded
1 0 <NA> <NA> NA NA NA
2 1 Patterns of Preaching: A Sermon Sampler Book 396585 2 2
3 2 Candlemas: Feast of Flames Book 168596 12 12