R 从带有引号文本的 txt space 分隔文件中读取数据

R read data from a txt space delimited file with quoted text

我正在尝试将数据集加载到 R Studio 中,其中数据集本身是 space 分隔的,但它也包含 space 引用文本,如 csv 文件。这是数据的head

DOC_ID  LABEL   RATING  VERIFIED_PURCHASE   PRODUCT_CATEGORY    PRODUCT_ID  PRODUCT_TITLE   REVIEW_TITLE    REVIEW_TEXT
1   __label1__  4   N   PC  B00008NG7N  "Targus PAUK10U Ultra Mini USB Keypad, Black"   useful  "When least you think so, this product will save the day. Just keep it around just in case you need it for something."
2   __label1__  4   Y   Wireless    B00LH0Y3NM  Note 3 Battery : Stalion Strength Replacement 3200mAh Li-Ion Battery for Samsung Galaxy Note 3 [24-Month Warranty] with NFC Chip + Google Wallet Capable    New era for batteries   Lithium batteries are something new introduced in the market there average developing cost is relatively high but Stallion doesn't compromise on quality and provides us with the best at a low cost.<br />There are so many in built technical assistants that act like a sensor in their particular forté. The battery keeps my phone charged up and it works at every voltage and a high voltage is never risked.
3   __label1__  3   N   Baby    B000I5UZ1Q  "Fisher-Price Papasan Cradle Swing, Starlight"  doesn't swing very well.    "I purchased this swing for my baby. She is 6 months now and has pretty much out grown it. It is very loud and doesn't swing very well. It is beautiful though. I love the colors and it has a lot of settings, but I don't think it was worth the money."
4   __label1__  4   N   Office Products B003822IRA  Casio MS-80B Standard Function Desktop Calculator   Great computing!    I was looking for an inexpensive desk calcolatur and here it is. It works and does everything I need. Only issue is that it tilts slightly to one side so when I hit any keys it rocks a little bit. Not a big deal.
5   __label1__  4   N   Beauty  B00PWSAXAM  Shine Whitening - Zero Peroxide Teeth Whitening System - No Sensitivity Only use twice a week   "I only use it twice a week and the results are great. I have used other teeth whitening solutions and most of them, for the same results I would have to use it at least three times a week. Will keep using this because of the potency of the solution and also the technique of the trays, it keeps everything in my teeth, in my mouth."
6   __label1__  3   N   Health & Personal Care  B00686HNUK  Tobacco Pipe Stand - Fold-away Portable - Light Weight - For Single Pipe    not sure    I'm not sure what this is supposed to be but I would recommend that you do a little more research into the culture of using pipes if you plan on giving this as a gift or using it yourself.
7   __label1__  4   N   Toys    B00NUG865W  ESPN 2-Piece Table Tennis   PING PONG TABLE GREAT FOR YOUTHS AND FAMILY "Pleased with ping pong table. 11 year old and 13 year old having a blast, plus lots of family entertainment too. Plus better than kids sitting on video games all day. A friend put it together. I do believe that was a challenge, but nothing they could not handle"
8   __label1__  4   Y   Beauty  B00QUL8VX6  "Abundant Health 25% Vitamin C Serum with Vitamin E and Hyaluronic Acid for Youthful Looking Skin, 1 fl. oz."   Great vitamin C serum   "Great vitamin C serum... I really like the oil feeling, not too sticky. I used it last week on some of my recent bug bites and it helps heal the skin faster than normal."
9   __label1__  4   N   Health & Personal Care  B004YHKVCM  PODS Spring Meadow HE Turbo Laundry Detergent Pacs 77-load Tub  wonderful detergent.    "I've used tide pods laundry detergent for many years,its such a great detergent to use having a nice scent and leaver the cloths smelling fresh."

问题是它看起来是制表符分隔但实际上不是,示例是 DOC_ID = 1,其中 useful 和 [=15 之间只有两个 space =],这种方式将 sep = "/t" 传递给 read.table 会抛出一个错误,指出 line 1 did not have 10 elements,由于某种原因,这是不正确的,因为元素的数量应该是 9。这是我的参数路过(没有原路径):

read.table(file = "path", sep ="\t", header = TRUE, strip.white = TRUE)

依赖引号也不是一个好的策略,因为有些行没有引用它们的文本,所以分隔符应该是双 space 之类的东西,它与 strip.white 结合应该可以工作正确,但 read.table 只接受单字节定界符。

所以问题是您将如何在 R 中或使用任何其他第三方软件解析此类语料库,以便将其充分转换为 csv 或至少一个制表符分隔文件?

使用 python pandas.read_csv(filename, sep='\t', header = 0, ...) 解析数据似乎已经成功解析了数据,从这一点来看,任何事情都可以用它来完成。结束这个。