Pandas 随机插入不存在的分隔符

Pandas randomly inserting non-existent delimiters

我真的在这个问题上摸不着头脑,但这对我来说毫无意义。我正在使用 pandas 是一种非常简单的方法,用于读取 tsv。这是最简单的代码:

source = pd.read_csv("neimanmarcus.csv", sep="\t")
images = source["image_link"]

此文件中的所有行都恰好有 53 个制表符。出于某种原因,pandas 认为其中大约 2% 的标签正好有 72 个制表符。这会导致以下错误:

pandas.parser.CParserError: Error tokenizing data. C error: Expected 54 fields in line x, saw 73

也就是说,经过人工检查,我无法在受影响的行中发现任何差异。在这种情况下跳过行会很成问题,所以我试图解决这个问题,但我无能为力。很抱歉,如果这有点愚蠢,但这里有 "correct" 和 "incorrect" 行的示例。

正确:

sku157001669    Tango Dancer-Print A-Line Dress, Size: 4, TANGO - Carolina Herrera  Carolina Herrera Tango Dancer-Print A-Line Dress Details Carolina Herrera tango dancer-print woven dress. Approx. measurements: 35.5"L center back to hem, 35.5"L center front to hem. V'd jewel neckline. Cap sleeves. Self-tie belt at natural waist; ties at left. Inverted center pleat at A-line skirt. Straight hem. Fit and flare silhouette. Hidden back zip. Cotton/spandex; dry clean. Made in Italy. Model's measurements: Height 5'10"/177cm, bust 34"/86cm, waist 26"/66cm, hips 35.5"/90cm, dress size US 2. Designer About Carolina Herrera: The empress of classically refined looks for both day and evening, Carolina Herrera launched her eponymous line in 1980 after encouragement from her friend, legendary Vogue editor Diana Vreeland. Over the years she has collected a number of fashion's highest accolades as well as a star-studded client list. With both a global focus and adoration for the sum of all things beautiful, Carolina Herrera has been hailed as "Fashion's First Lady." Size: 4. Color: TANGO. Age Group: Adult. Material: 97% COTTON, 3% ELASTANE. Apparel & Accessories > Clothing > Dresses  Women's Apparel > Mid-Length > Daytime Dresses > Mid    1390.00 USD 1390.00 USD     http://www.neimanmarcus.com/en-us/Carolina-Herrera-Tango-Dancer-Print-A-Line-Dress/prod177890243/p.prod     http://images.neimanmarcus.com/product_assets/B/2/W/Y/K/NMB2WYK_mz.jpg  http://images.neimanmarcus.com/product_assets/B/2/W/Y/K/NMB2WYK_az.jpg  Carolina Herrera    07667702164817  prod177890243       new in stock        prod177890243   TANGO   97% COTTON, 3% ELASTANE     4           female  Adult       US::Ground:0.00 USD                                                                                             

不正确:

sku158601482    Sleeveless Faux-Wrap Jersey Dress, Women's, Size: 2X, BLACK - Eileen Fisher Eileen Fisher Sleeveless Faux-Wrap Jersey Dress, Women's Details Eileen Fisher jersey dress in your choice of color. Round neckline; sleeveless. Faux-wrap style. Shift silhouette. Viscose/spandex; machine wash. Made in USA of imported materials. Model's measurements: Height 5'10.5"/179cm, bust 32"/81cm, waist 24"/61cm, hips 35.5"/90cm, dress size US 2/4. Necklace not included. Designer Please note: Apparel may be available in more sizes: Shop Eileen Fisher Petite Shop Eileen Fisher Women's About Eileen Fisher: Former interior and graphic designer Eileen Fisher launched her self-named collection in 1984. The acclaimed designer made her mark with clean lines, simple shapes, and a timeless, functional style. Size: 2X. Color: BLACK. Age Group: Adult. Material: " 92% Viscose/8% Spandex F4VF-D3502 / D2502X: Body: 92% Viscose, 8% Spandex Hem: 80% Recycled Polyester, 20% Lycra? F4VF-S1496: Body: 92% Viscose, 8% Spandex Hem Panel: 80% Recycled Polyester, 20% Lycra?. Apparel & Accessories > Clothing > Dresses  Women's Apparel > Women's > Special Sizes > Mid 198.00 USD  198.00 USD      http://www.neimanmarcus.com/en-us/Eileen-Fisher-Sleeveless-Faux-Wrap-Jersey-Dress-Women-s/prod179830418/p.prod      http://images.neimanmarcus.com/product_assets/T/A/6/X/8/NMTA6X8_mz.jpg  http://images.neimanmarcus.com/product_assets/T/A/6/X/8/NMTA6X8_az.jpg  Eileen Fisher   00713259663697  prod179830418       new in stock        prod179830418   BLACK   " 92% Viscose/8% Spandex F4VF-D3502 / D2502X: Body: 92% Viscose, 8% Spandex Hem: 80% Recycled Polyester, 20% Lycra? F4VF-S1496: Body: 92% Viscose, 8% Spandex Hem Panel: 80% Recycled Polyester, 20 Graphic 2X          female  Adult       US::Ground:0.00 USD                                 

在这种情况下,只需调用 line.split('\t') 即可正常工作,pandas 似乎由于某种原因中断。

您的数据包含不匹配的引号字符(在 Height 5'10.5" 之类的东西中似乎使用 " 来表示英寸)。这使解析器认为存在带引号的字段,但由于引号不成对而导致数据损坏。

尝试将 quoting=csv.QUOTE_NONE 作为附加参数传递给 read_csv。 (您需要先执行 import csv。或者您可以直接通过 quoting=3。)