USQL 转义引号
USQL Escape Quotes
我是 Azure 数据湖分析的新手,我正在尝试加载一个 csv,该 csv 是双引号的,并且在一些随机行的列中有引号。
例如
ID, BookName
1, "Life of Pi"
2, "Story about "Mr X""
当我尝试加载时,它在第二条记录上失败并抛出一条错误消息。
1, 我想知道是否有办法在 csv 文件中解决这个问题,不幸的是我们不能从源中提取新的,因为这些是日志文件?
2,是否可以让ADLA忽略坏行并继续处理其余记录?
Execution failed with error '1_SV1_Extract Error :
'{"diagnosticCode":195887146,"severity":"Error","component":"RUNTIME","source":"User","errorId":"E_RUNTIME_USER_EXTRACT_ROW_ERROR","message":"Error
occurred while extracting row after processing 9045 record(s) in the
vertex' input split. Column index: 9, column name:
'instancename'.","description":"","resolution":"","helpLink":"","details":"","internalDiagnostics":"","innerError":{"diagnosticCode":195887144,"severity":"Error","component":"RUNTIME","source":"User","errorId":"E_RUNTIME_USER_EXTRACT_EXTRACT_INVALID_CHARACTER_AFTER_QUOTED_FIELD","message":"Invalid
character following the ending quote character in a quoted
field.","description":"Invalid character is detected following the
ending quote character in a quoted field. A column delimiter, row
delimiter or EOF is expected.\nThis error can occur if double-quotes
within the field are not correctly escaped as two
double-quotes.","resolution":"Column should be fully surrounded with
double-quotes and double-quotes within the field escaped as two
double-quotes."
根据错误消息,如果您要导入带引号的 csv,其中某些列中有引号,则这些需要转义为 两个双引号。在您的特定示例中,您的第二行需要是:
..."Life after death and ""good death"" models - a qualitative study",...
因此,一种选择是在输出时修复原始文件。如果您无法做到这一点,那么您可以将所有列作为一列导入,使用 RegEx 修复引号并再次输出文件,例如
// Import records as one row then use RegEx to clean columns
@input =
EXTRACT oneCol string
FROM "/input/input132.csv"
USING Extractors.Text( '|', quoting: false );
// Fix up the quotes using RegEx
@output =
SELECT Regex.Replace(oneCol, "([^,])\"([^,])", "\"\"") AS cleanCol
FROM @input;
OUTPUT @output
TO "/output/output.csv"
USING Outputters.Csv(quoting : false);
文件现在将成功导入。我的结果:
我是 Azure 数据湖分析的新手,我正在尝试加载一个 csv,该 csv 是双引号的,并且在一些随机行的列中有引号。
例如
ID, BookName
1, "Life of Pi"
2, "Story about "Mr X""
当我尝试加载时,它在第二条记录上失败并抛出一条错误消息。
1, 我想知道是否有办法在 csv 文件中解决这个问题,不幸的是我们不能从源中提取新的,因为这些是日志文件?
2,是否可以让ADLA忽略坏行并继续处理其余记录?
Execution failed with error '1_SV1_Extract Error : '{"diagnosticCode":195887146,"severity":"Error","component":"RUNTIME","source":"User","errorId":"E_RUNTIME_USER_EXTRACT_ROW_ERROR","message":"Error occurred while extracting row after processing 9045 record(s) in the vertex' input split. Column index: 9, column name: 'instancename'.","description":"","resolution":"","helpLink":"","details":"","internalDiagnostics":"","innerError":{"diagnosticCode":195887144,"severity":"Error","component":"RUNTIME","source":"User","errorId":"E_RUNTIME_USER_EXTRACT_EXTRACT_INVALID_CHARACTER_AFTER_QUOTED_FIELD","message":"Invalid character following the ending quote character in a quoted field.","description":"Invalid character is detected following the ending quote character in a quoted field. A column delimiter, row delimiter or EOF is expected.\nThis error can occur if double-quotes within the field are not correctly escaped as two double-quotes.","resolution":"Column should be fully surrounded with double-quotes and double-quotes within the field escaped as two double-quotes."
根据错误消息,如果您要导入带引号的 csv,其中某些列中有引号,则这些需要转义为 两个双引号。在您的特定示例中,您的第二行需要是:
..."Life after death and ""good death"" models - a qualitative study",...
因此,一种选择是在输出时修复原始文件。如果您无法做到这一点,那么您可以将所有列作为一列导入,使用 RegEx 修复引号并再次输出文件,例如
// Import records as one row then use RegEx to clean columns
@input =
EXTRACT oneCol string
FROM "/input/input132.csv"
USING Extractors.Text( '|', quoting: false );
// Fix up the quotes using RegEx
@output =
SELECT Regex.Replace(oneCol, "([^,])\"([^,])", "\"\"") AS cleanCol
FROM @input;
OUTPUT @output
TO "/output/output.csv"
USING Outputters.Csv(quoting : false);
文件现在将成功导入。我的结果: