当在 csv 中找到特殊字符时,记录无法通过 MLCP 正确摄取

Records are not ingesting correctly through MLCP when special characters find in the csv

我正在使用 MLCP 将数据提取到 MarkLogic,但由于文件中的字符无效,许多记录被跳过。

有没有什么方法可以忽略无效字符并在不跳过记录的情况下摄取 CSV 中存在的所有记录?

以下是日志中的错误消息:

WARN Skipped record: abc.csv at line 1414, reason: invalid char between encapsulated token and delimiter

如果您提供导致抛出异常的记录示例,将会很有帮助。但是,最常见的原因是您有一个 , 作为分隔符并且在未封装整个值的值中有引号。

例如:

“foo”,“bar” Y,”foo”

在这种情况下,"bar" Y无效。您可以通过转义引号来解决此问题:

“foo”,“"bar"” Y,”foo”

https://www.marklogic.com/blog/delimited_text_mlcp

What does the exception mean?

Invalid char between encapsulated token and delimiter means that you have invalid characters between an encapsulator and a delimiter. Hold on — what is an encapsulator? To put simply, it is the character used to wrap the CSV field or column that may contain special characters, such as line breaks. In most cases, people use double-quotes as the encapsulator.

How to work around the exception?

The best way to get around this exception is to avoid having malformed CSV data in the first place. If that is not possible, you can escape the double quotes in the field if you really want them to be part of the string. But remember, you must escape double quotes using another double quote in CSV!