我该如何处理解析错误的 csv 数据?
How can I deal with parsing bad csv data?
我知道数据应该是正确的。我无法控制数据,我的老板只是要告诉我,我需要想办法处理别人的错误。所以请不要告诉我数据不好不是我的问题,因为它确实是。
任何人,这就是我正在看的:
"Words","email@email.com","","4253","57574","FirstName","","LastName, MD","","","576JFJD","","1971","","Words","Address","SUITE "A"","City","State","Zip","Phone","",""
出于保密原因,数据已被清理。
如您所见,数据包含引号,并且其中一些引用字段中包含逗号。所以我不能删除它们。但是 "Suite A""" 正在抛弃解析器。引号太多了。 >.<
我在 Microsoft.VisualBasic.FileIO 命名空间中使用 TextFieldParser,设置如下:
parser.HasFieldsEnclosedInQuotes = true;
parser.SetDelimiters(",");
parser.TextFieldType = FieldType.Delimited;
错误是
MalformedLineException: Line 9871 cannot be parsed using the current
delimiters.
我想以某种方式清理数据以解决这个问题,但我不确定该怎么做。或者也许有一种方法可以跳过这一行?尽管我怀疑我的上级不会赞成我只是跳过我们可能需要的数据。
我以前不得不这样做,
第一步是使用string.split(',')
解析数据
下一步是合并属于一起的片段。
我基本上做的是
- 创建一个表示组合字符串的新列表
- 如果字符串以引号开头,将其推入您的新列表
- 如果它不是以引号开头,请将其附加到列表中的最后一个字符串
- 奖励:当一个字符串以引号结尾但下一个字符串不以引号开头时抛出异常
根据关于您的数据中实际出现的内容的规则,您可能需要更改您的代码来解决这个问题。
在CSV's file format的核心,每一行是一行,该行中的每个单元格由逗号分隔。在您的情况下,您的格式还包含(非常不幸的)规定,即一对引号内的逗号不算作分隔符,而是数据的一部分。我说非常不幸,因为放错位置的引号会影响该行的整个其余部分,并且由于标准 ASCII 中的引号不区分开放和封闭,所以在不知道原意的情况下,您真的无能为力。
也就是说,当您以 知道原意的人(提供数据的人)可以查看文件并更正错误的方式记录消息时错误:
if (parse_line(line, &data)) {
// save the data
} else {
// log the error
fprintf(&stderr, "Bad line: %s", line);
}
并且由于您的引号没有转义换行符,因此您可以继续在 运行 之后的下一行进入此错误。
附录: 如果您的公司有选择(即您的数据由公司工具序列化),请不要使用 CSV。使用 XML 或 JSON 之类的具有更明确定义的解析机制。
我不熟悉TextFieldParser
。但是,对于 CsvHelper
,您可以为无效数据添加自定义处理程序:
var config = new CsvConfiguration();
config.IgnoreReadingExceptions = true;
config.ReadingExceptionCallback += (e, row) =>
{
// you can add some custom patching here if possible
// or, save the line numbers and add/edit them manually later.
};
using(var file = File.OpenRead(".csv"))
using(var reader = new CsvReader(reader, config))
{
reader.GetRecords<YourDtoClass>();
}
我对每个人所说的唯一补充(因为我们都去过那里)是尝试纠正您在代码中遇到的每个新问题。那里有一些不错的 REGEX 字符串 https://www.google.com/?ion=1&espv=2#q=c-sharp+regex+csv+clean 或者您可以使用 String.Replace (String.Replace("\"\"\"","").Replace("\" \","").Replace("\",","\"") 等)。最终,当您发现并找到纠正越来越多错误的方法时,您的手动恢复率将大大降低(您的大部分不良数据可能来自类似错误)。干杯!
PS - 有点想法(已经有一段时间了 - 逻辑可能需要一些调整,因为我是凭记忆写的),但你会明白要点的:
public string[] parseCSVWithQuotes(string csvLine,int expectedNumberOfDataPoints)
{
string ret = "";
string thisChar = "";
string lastChar = "";
bool needleDown = true;
for(int i = 0; i < csvLine.Length; i++)
{
thisChar = csvLine.Substring(i, 1);
if (thisChar == "'"&&lastChar!="'")
needleDown = needleDown == true ? false : true;//when needleDown = true, characters are treated literally
if (thisChar == ","&&lastChar!=",") {
if (needleDown)
{
ret += "|";//convert literal comma to pipe so it doesn't cause another break on split
}else
{
ret += ",";//break on split is intended because the comma is outside the single quote
}
}
if (!needleDown && (thisChar == "\"" || thisChar == "*")) {//repeat for any undesired character or use RegEx
//do not add -- this eliminates any undesired characters outside single quotes
}
else
{
if ((lastChar == "'" || lastChar == "\"" || lastChar == ",") && thisChar == lastChar)
{
//do not add - this eliminates double characters
}else
{
ret += thisChar;
lastChar = thisChar;
//this character is not an undesired character, is no a double, is valid.
}
}
}
//we've cleaned as best we can
string[] parts = ret.Split(',');
if(parts.Length==expectedNumberOfDataPoints){
for(int i = 0; i < parts.Length; i++)
{
//go back and replace the temporary pipe with the literal comma AFTER split
parts[i] = parts[i].Replace("|", ",");
}
return parts;
}else{
//save ret to bad CSV log
return null;
}
}
如果您只是想摆脱 csv 中的杂散 "
标记,您可以使用以下正则表达式找到它们并将它们替换为 '
String sourcestring = "source string to match with pattern";
String matchpattern = @"(?<!^|,)""(?!(,|$))";
String replacementpattern = @"'";
Console.WriteLine(Regex.Replace(sourcestring,matchpattern,replacementpattern,RegexOptions.Multiline));
解释:
@"(?<!^|,)""(?!(,|$))";
会找到任何 "
前面没有字符串开头的,或者 ,
后面没有字符串结尾的任何 "
或一个 ,
我也必须这样做一次。我的方法是通过一行并跟踪我正在阅读的内容。
基本上,我编写了自己的扫描仪,从输入行中截取标记,这让我可以完全控制我的错误 .csv 数据。
这是我做的:
For each character on a line of input.
1. when outside of a string meeting a comma => all of the previous string (which can be empty) is a valid token.
2. when outside of a sting meeting anything but a comma or a quote => now you have a real problem, unquoted tekst => handle as you see fit.
3. when outside of a string meeing a quote => found a start of string.
4. when inside of a string meeting a comma => accept the comma as part of the string.
5. when inside of the string meeting a qoute => trouble starts here, mark this point.
6. continue and when meeting a comma (skipping white space if desired) close the string, 'unread' the comma and continue. (than will bring you to point 1.)
7. or continue and when meeting a quote -> obviously, what was read must be part of the string, add it to the string, 'unread' the quote and continue. (that will you bring to point 5)
8. or continue and find an whitespace, then End Of Line ('\n') -> the last qoute must be the closing quote. accept the string as a value.
9. or continue and fine non-whitespace, then End Of Line. -> now you have a real problem, you have the start of a string but it is not closed -> handle the error as you see fit.
如果您的 .csv 文件中的字段数是固定的,您可以计算您识别为字段分隔符的逗号,当您看到行尾时,您就知道您是否遇到了其他问题。
使用从输入行接收到的字符串流,您可以构建一个 'clean' .csv 行,这样可以构建一个接受和清理输入的缓冲区,您可以在现有代码中使用它。
我知道数据应该是正确的。我无法控制数据,我的老板只是要告诉我,我需要想办法处理别人的错误。所以请不要告诉我数据不好不是我的问题,因为它确实是。
任何人,这就是我正在看的:
"Words","email@email.com","","4253","57574","FirstName","","LastName, MD","","","576JFJD","","1971","","Words","Address","SUITE "A"","City","State","Zip","Phone","",""
出于保密原因,数据已被清理。
如您所见,数据包含引号,并且其中一些引用字段中包含逗号。所以我不能删除它们。但是 "Suite A""" 正在抛弃解析器。引号太多了。 >.<
我在 Microsoft.VisualBasic.FileIO 命名空间中使用 TextFieldParser,设置如下:
parser.HasFieldsEnclosedInQuotes = true;
parser.SetDelimiters(",");
parser.TextFieldType = FieldType.Delimited;
错误是
MalformedLineException: Line 9871 cannot be parsed using the current delimiters.
我想以某种方式清理数据以解决这个问题,但我不确定该怎么做。或者也许有一种方法可以跳过这一行?尽管我怀疑我的上级不会赞成我只是跳过我们可能需要的数据。
我以前不得不这样做,
第一步是使用string.split(',')
下一步是合并属于一起的片段。
我基本上做的是
- 创建一个表示组合字符串的新列表
- 如果字符串以引号开头,将其推入您的新列表
- 如果它不是以引号开头,请将其附加到列表中的最后一个字符串
- 奖励:当一个字符串以引号结尾但下一个字符串不以引号开头时抛出异常
根据关于您的数据中实际出现的内容的规则,您可能需要更改您的代码来解决这个问题。
在CSV's file format的核心,每一行是一行,该行中的每个单元格由逗号分隔。在您的情况下,您的格式还包含(非常不幸的)规定,即一对引号内的逗号不算作分隔符,而是数据的一部分。我说非常不幸,因为放错位置的引号会影响该行的整个其余部分,并且由于标准 ASCII 中的引号不区分开放和封闭,所以在不知道原意的情况下,您真的无能为力。
也就是说,当您以 知道原意的人(提供数据的人)可以查看文件并更正错误的方式记录消息时错误:
if (parse_line(line, &data)) {
// save the data
} else {
// log the error
fprintf(&stderr, "Bad line: %s", line);
}
并且由于您的引号没有转义换行符,因此您可以继续在 运行 之后的下一行进入此错误。
附录: 如果您的公司有选择(即您的数据由公司工具序列化),请不要使用 CSV。使用 XML 或 JSON 之类的具有更明确定义的解析机制。
我不熟悉TextFieldParser
。但是,对于 CsvHelper
,您可以为无效数据添加自定义处理程序:
var config = new CsvConfiguration();
config.IgnoreReadingExceptions = true;
config.ReadingExceptionCallback += (e, row) =>
{
// you can add some custom patching here if possible
// or, save the line numbers and add/edit them manually later.
};
using(var file = File.OpenRead(".csv"))
using(var reader = new CsvReader(reader, config))
{
reader.GetRecords<YourDtoClass>();
}
我对每个人所说的唯一补充(因为我们都去过那里)是尝试纠正您在代码中遇到的每个新问题。那里有一些不错的 REGEX 字符串 https://www.google.com/?ion=1&espv=2#q=c-sharp+regex+csv+clean 或者您可以使用 String.Replace (String.Replace("\"\"\"","").Replace("\" \","").Replace("\",","\"") 等)。最终,当您发现并找到纠正越来越多错误的方法时,您的手动恢复率将大大降低(您的大部分不良数据可能来自类似错误)。干杯!
PS - 有点想法(已经有一段时间了 - 逻辑可能需要一些调整,因为我是凭记忆写的),但你会明白要点的:
public string[] parseCSVWithQuotes(string csvLine,int expectedNumberOfDataPoints)
{
string ret = "";
string thisChar = "";
string lastChar = "";
bool needleDown = true;
for(int i = 0; i < csvLine.Length; i++)
{
thisChar = csvLine.Substring(i, 1);
if (thisChar == "'"&&lastChar!="'")
needleDown = needleDown == true ? false : true;//when needleDown = true, characters are treated literally
if (thisChar == ","&&lastChar!=",") {
if (needleDown)
{
ret += "|";//convert literal comma to pipe so it doesn't cause another break on split
}else
{
ret += ",";//break on split is intended because the comma is outside the single quote
}
}
if (!needleDown && (thisChar == "\"" || thisChar == "*")) {//repeat for any undesired character or use RegEx
//do not add -- this eliminates any undesired characters outside single quotes
}
else
{
if ((lastChar == "'" || lastChar == "\"" || lastChar == ",") && thisChar == lastChar)
{
//do not add - this eliminates double characters
}else
{
ret += thisChar;
lastChar = thisChar;
//this character is not an undesired character, is no a double, is valid.
}
}
}
//we've cleaned as best we can
string[] parts = ret.Split(',');
if(parts.Length==expectedNumberOfDataPoints){
for(int i = 0; i < parts.Length; i++)
{
//go back and replace the temporary pipe with the literal comma AFTER split
parts[i] = parts[i].Replace("|", ",");
}
return parts;
}else{
//save ret to bad CSV log
return null;
}
}
如果您只是想摆脱 csv 中的杂散 "
标记,您可以使用以下正则表达式找到它们并将它们替换为 '
String sourcestring = "source string to match with pattern";
String matchpattern = @"(?<!^|,)""(?!(,|$))";
String replacementpattern = @"'";
Console.WriteLine(Regex.Replace(sourcestring,matchpattern,replacementpattern,RegexOptions.Multiline));
解释:
@"(?<!^|,)""(?!(,|$))";
会找到任何 "
前面没有字符串开头的,或者 ,
后面没有字符串结尾的任何 "
或一个 ,
我也必须这样做一次。我的方法是通过一行并跟踪我正在阅读的内容。 基本上,我编写了自己的扫描仪,从输入行中截取标记,这让我可以完全控制我的错误 .csv 数据。
这是我做的:
For each character on a line of input.
1. when outside of a string meeting a comma => all of the previous string (which can be empty) is a valid token.
2. when outside of a sting meeting anything but a comma or a quote => now you have a real problem, unquoted tekst => handle as you see fit.
3. when outside of a string meeing a quote => found a start of string.
4. when inside of a string meeting a comma => accept the comma as part of the string.
5. when inside of the string meeting a qoute => trouble starts here, mark this point.
6. continue and when meeting a comma (skipping white space if desired) close the string, 'unread' the comma and continue. (than will bring you to point 1.)
7. or continue and when meeting a quote -> obviously, what was read must be part of the string, add it to the string, 'unread' the quote and continue. (that will you bring to point 5)
8. or continue and find an whitespace, then End Of Line ('\n') -> the last qoute must be the closing quote. accept the string as a value.
9. or continue and fine non-whitespace, then End Of Line. -> now you have a real problem, you have the start of a string but it is not closed -> handle the error as you see fit.
如果您的 .csv 文件中的字段数是固定的,您可以计算您识别为字段分隔符的逗号,当您看到行尾时,您就知道您是否遇到了其他问题。
使用从输入行接收到的字符串流,您可以构建一个 'clean' .csv 行,这样可以构建一个接受和清理输入的缓冲区,您可以在现有代码中使用它。