C 用换行符解析 comma-separated-values
C parsing a comma-separated-values with line breaks
我有一个 CSV
数据文件,其中包含以下数据:
H1,H2,H3
a,"b
c
d",e
当我通过 Excel 打开 CSV 文件时,它能够显示 sheet,列标题为 H1, H2, H3
,列值为:a for H1
,
multi line value as
b
c
d
for H2
和c for H3
我需要使用 C 程序解析此文件并像这样提取值。
但是,我的以下代码片段将不起作用,因为我有一个列的多行值:
char buff[200];
char tokens[10][30];
fgets(buff, 200, stdin);
char *ptok = buff; // for iterating
char *pch;
int i = 0;
while ((pch = strchr(ptok, ',')) != NULL) {
*pch = 0;
strcpy(tokens[i++], ptok);
ptok = pch+1;
}
strcpy(tokens[i++], ptok);
如何修改此代码片段以适应列的 multi-line 值?
请不要被字符串缓冲区的 hard-coded 值打扰,这是作为 POC 的测试代码。
而不是任何第 3 方库,我想从第一原则开始用困难的方式来做。
请帮忙。
在 C 中解析 "well-formed" CSV 的主要困难是精确地 处理可变长度字符串和数组,而您通过使用固定长度字符串和阵列。 (另一个复杂问题是处理格式不正确的 CSV。)
没有那些复杂的东西,解析真的很简单:
(未经测试)
/* Appends a non-quoted field to s and returns the delimiter */
int readSimpleField(struct String* s) {
for (;;) {
int ch = getc();
if (ch == ',' || ch == '\n' || ch == EOF) return ch;
stringAppend(s, ch);
}
}
/* Appends a quoted field to s and returns the delimiter.
* Assumes the open quote has already been read.
* If the field is not terminated, returns ERROR, which
* should be a value different from any character or EOF.
* The delimiter returned is the character after the closing quote
* (or EOF), which may not be a valid delimiter. Caller should check.
*/
int readQuotedField(struct String* s) {
for (;;) {
int ch;
for (;;) {
ch = getc();
if (ch == EOF) return ERROR;
if (ch == '"') {
ch = getc();
if (ch != '"') break;
}
stringAppend(s, ch);
}
}
}
/* Reads a single field into s and returns the following delimiter,
* which might be invalid.
*/
int readField(struct String* s) {
stringClear(s);
int ch = getc();
if (ch == '"') return readQuotedField(s);
if (ch == '\n' || ch == EOF) return ch;
stringAppend(s, ch);
return readSimpleField(s);
}
/* Reads a single row into row and returns the following delimiter,
* which might be invalid.
*/
int readRow(struct Row* row) {
struct String field = {0};
rowClear(row);
/* Make sure there is at least one field */
int ch = getc();
if (ch != '\n' && ch != EOF) {
ungetc(ch, stdin);
do {
ch = readField(s);
rowAppend(row, s);
} while (ch == ',');
}
return ch;
}
/* Reads an entire CSV file into table.
* Returns true if the parse was successful.
* If an error is encountered, returns false. If the end-of-file
* indicator is set, the error was an unterminated quoted field;
* otherwise, the next character read will be the one which
* triggered the error.
*/
bool readCSV(struct Table* table) {
tableClear(table);
struct Row row = {0};
/* Make sure there is at least one row */
int ch = getc();
if (ch != EOF) {
ungetc(ch, stdin);
do {
ch = readRow(row);
tableAppend(table, row);
} while (ch == '\n');
}
return ch == EOF;
}
上面是"from first principles"——它甚至没有使用标准的C 库字符串函数。但是需要花一些功夫去理解和验证。就我个人而言,我会使用 (f)lex 甚至 yacc/bison (尽管有点矫枉过正)来简化代码并使预期的语法更加明显。但是在 C 中处理可变长度结构仍然是第一步。
我有一个 CSV
数据文件,其中包含以下数据:
H1,H2,H3
a,"b
c
d",e
当我通过 Excel 打开 CSV 文件时,它能够显示 sheet,列标题为 H1, H2, H3
,列值为:a for H1
,
multi line value as
b
c
d
for H2
和c for H3
我需要使用 C 程序解析此文件并像这样提取值。
但是,我的以下代码片段将不起作用,因为我有一个列的多行值:
char buff[200];
char tokens[10][30];
fgets(buff, 200, stdin);
char *ptok = buff; // for iterating
char *pch;
int i = 0;
while ((pch = strchr(ptok, ',')) != NULL) {
*pch = 0;
strcpy(tokens[i++], ptok);
ptok = pch+1;
}
strcpy(tokens[i++], ptok);
如何修改此代码片段以适应列的 multi-line 值? 请不要被字符串缓冲区的 hard-coded 值打扰,这是作为 POC 的测试代码。 而不是任何第 3 方库,我想从第一原则开始用困难的方式来做。 请帮忙。
在 C 中解析 "well-formed" CSV 的主要困难是精确地 处理可变长度字符串和数组,而您通过使用固定长度字符串和阵列。 (另一个复杂问题是处理格式不正确的 CSV。)
没有那些复杂的东西,解析真的很简单:
(未经测试)
/* Appends a non-quoted field to s and returns the delimiter */
int readSimpleField(struct String* s) {
for (;;) {
int ch = getc();
if (ch == ',' || ch == '\n' || ch == EOF) return ch;
stringAppend(s, ch);
}
}
/* Appends a quoted field to s and returns the delimiter.
* Assumes the open quote has already been read.
* If the field is not terminated, returns ERROR, which
* should be a value different from any character or EOF.
* The delimiter returned is the character after the closing quote
* (or EOF), which may not be a valid delimiter. Caller should check.
*/
int readQuotedField(struct String* s) {
for (;;) {
int ch;
for (;;) {
ch = getc();
if (ch == EOF) return ERROR;
if (ch == '"') {
ch = getc();
if (ch != '"') break;
}
stringAppend(s, ch);
}
}
}
/* Reads a single field into s and returns the following delimiter,
* which might be invalid.
*/
int readField(struct String* s) {
stringClear(s);
int ch = getc();
if (ch == '"') return readQuotedField(s);
if (ch == '\n' || ch == EOF) return ch;
stringAppend(s, ch);
return readSimpleField(s);
}
/* Reads a single row into row and returns the following delimiter,
* which might be invalid.
*/
int readRow(struct Row* row) {
struct String field = {0};
rowClear(row);
/* Make sure there is at least one field */
int ch = getc();
if (ch != '\n' && ch != EOF) {
ungetc(ch, stdin);
do {
ch = readField(s);
rowAppend(row, s);
} while (ch == ',');
}
return ch;
}
/* Reads an entire CSV file into table.
* Returns true if the parse was successful.
* If an error is encountered, returns false. If the end-of-file
* indicator is set, the error was an unterminated quoted field;
* otherwise, the next character read will be the one which
* triggered the error.
*/
bool readCSV(struct Table* table) {
tableClear(table);
struct Row row = {0};
/* Make sure there is at least one row */
int ch = getc();
if (ch != EOF) {
ungetc(ch, stdin);
do {
ch = readRow(row);
tableAppend(table, row);
} while (ch == '\n');
}
return ch == EOF;
}
上面是"from first principles"——它甚至没有使用标准的C 库字符串函数。但是需要花一些功夫去理解和验证。就我个人而言,我会使用 (f)lex 甚至 yacc/bison (尽管有点矫枉过正)来简化代码并使预期的语法更加明显。但是在 C 中处理可变长度结构仍然是第一步。