如何仅转义定界符而不转义 CSV 中的换行符

How to escape only delimiter and not the newline character in CSV

我收到正常的逗号分隔 CSV 文件,其中的数据包含换行符。

输入数据

我想将输入数据转换为:

  1. 竖线 (|) 分隔
  2. 没有任何引号转义(" 或 ')
  3. 用插入符号 (^) 字符转义的数据中的竖线 (|)

我的文件也可能包含多行数据(或单行换行中的数据)。

预期输出数据

我能够生成输出文件。

正如您在图像中看到的那样,插入符号 (^) 完美地转义了数据中的所有竖线 (|),而且还转义了第 5 行和第 6 行的换行符,这是我不想要的。

注意:所有回车 returns(\r,或 CR)和换行符(\n,LF)应该与图片中显示的一样。

import csv
import sys

inputPath = sys.argv[1]
outputPath = sys.argv[2]
with open(inputPath, encoding="utf-8") as inputFile:
    with open(outputPath, 'w', newline='', encoding="utf-8") as outputFile:
        reader = csv.DictReader(inputFile, delimiter=',')
        writer = csv.DictWriter(
            outputFile, reader.fieldnames, delimiter='|', quoting=csv.QUOTE_NONE, escapechar='^', doublequote=False, quotechar="")
        writer.writeheader()
        writer.writerows(reader)

print("Formationg complete.")

以上代码是在Python写的,如果能在Python得到帮助就好了。 其他编程语言的答案也被接受。

有超过800万条记录

请在下面找到一些示例数据:

"VENDOR ID","VENDOR NAME","ORGANIZATION NUMBER","ADDRESS 1","CITY","COUNTRY","ZIP","PRIMARY PHONE","FAX","EMAIL","LMS RECORD CREATED DATE","LMS RECORD MODIFY DATE","DELETE FLAG","LMS RECORD ID"
"a0E6D000001Fag8UAC","Test 'Vendor' 1","","This Vendor contains a single (') quote.","","","","","","test@test.com","2020-4-1 06:32:29","2020-4-1 06:34:43","false",""
"a0E6D000001FagDUAS","Test ""Vendor"" 2","","This Vendor contains a double("") quote.","","","","","","test@test.com","2020-4-1 06:33:38","2020-4-1 06:35:18","false",""
"a0E6D000001FagIUAS","Test Vendor | 3","","This Vendor contains a Pipe (|).","","","","","","test@test.com","2020-4-1 06:38:45","2020-4-1 06:38:45","false",""
"a0E6D000001FagNUAS","Test Vendor 4","","This Vendor contains a
carriage return, i.e 
data in new line.","","","","","","test@test.com","2020-4-1 06:43:08","2020-4-1 06:43:08","false",""

注意:如果您复制以上数据,请确保第 5 行和第 6 行应该仅以 LF(即换行符,\n)结尾,就像图像中显示的那样,否则请尝试将这 2 行复制为这就是这个问题的全部内容,即不特别转义这两行,如下图突出显示的那样。

上面的代码是我在网上所有的发现的最终结果。我什至试过 pandas 库,它的最终输出也是一样的。

下面的代码只是获得预期输出的另一种方法,但问题仍然存在,因为此脚本需要永远(超过 12 小时)才能完成(但仍未完成,最终我必须终止该进程) 当 运行 900 万条记录时。

VBS 代码的批包装器:

0</* :
    @echo off

        cscript /nologo /E:jscript "%~f0" %*

    exit /b %errorlevel% */0;

        var ARGS = WScript.Arguments;

        if (ARGS.Length < 3 ) {
            WScript.Echo("Wrong arguments");
            WScript.Echo(WScript.ScriptName + " path_to_file search replace [search replace[search replace [...]]]");
            WScript.Echo(WScript.ScriptName + " e?path_to_file search replace [search replace[search replace [...]]]");
            WScript.Echo("if filename starts with \"e?\" search and replace string will be evaluated for special characters ")
            WScript.Quit(1);
        }

        if (ARGS.Item(0).toLowerCase() == "-help" || ARGS.Item(0).toLowerCase() == "-h") {
            WScript.Echo(WScript.ScriptName + " path_to_file search replace [search replace[search replace [...]]]");
            WScript.Echo(WScript.ScriptName + " e?path_to_file search replace [search replace[search replace [...]]]");
            WScript.Echo("if filename starts with \"e?\" search and replace string will be evaluated for special characters ")
            WScript.Quit(0);
        }



        if (ARGS.Length % 2 !== 1 ) {
            WScript.Echo("Wrong arguments");
            WScript.Quit(2);
        }

        var jsEscapes = {
          'n': '\n',
          'r': '\r',
          't': '\t',
          'f': '\f',
          'v': '\v',
          'b': '\b'
        };


        //string evaluation
        //

        function decodeJsEscape(_, hex0, hex1, octal, other) {
          var hex = hex0 || hex1;
          if (hex) { return String.fromCharCode(parseInt(hex, 16)); }
          if (octal) { return String.fromCharCode(parseInt(octal, 8)); }
          return jsEscapes[other] || other;
        }

        function decodeJsString(s) {
          return s.replace(
              // Matches an escape sequence with UTF-16 in group 1, single byte hex in group 2,
              // octal in group 3, and arbitrary other single-character escapes in group 4.
              /\(?:u([0-9A-Fa-f]{4})|x([0-9A-Fa-f]{2})|([0-3][0-7]{0,2}|[4-7][0-7]?)|(.))/g,
              decodeJsEscape);
        }

        function convertToPipe(find, replace, str) {        
          return str.replace(new RegExp('\|','g'),"^|");
        }

        function removeStartingQuote(find, replace, str) {      
          return str.replace(new RegExp('^"', 'g'), '');
        }

        function removeEndQuote(find, replace, str) {       
          return str.replace(new RegExp('"\r\n$', 'g'), '\r\n');
        }

        function removeLeadingAndTrailingQuotes(find, replace, str) {       
          return str.replace(new RegExp('"\r\n"', 'g'), '\r\n');
        }

        function replaceDelimiter(find, replace, str) {     
          return str.replace(new RegExp('","', 'g'), '|');
        }

        function convertSFDCDoubleQuotes(find, replace, str) {      
          return str.replace(new RegExp('""', 'g'), '"');
        }


      function getContent(file) {
            // :: http://www.dostips.com/forum/viewtopic.php?f=3&t=3855&start=15&p=28898  ::
            var ado = WScript.CreateObject("ADODB.Stream");
            ado.Type = 2;  // adTypeText = 2

            ado.CharSet = "iso-8859-1";  // code page with minimum adjustments for input
            ado.Open();
            ado.LoadFromFile(file);

            var adjustment = "\u20AC\u0081\u201A\u0192\u201E\u2026\u2020\u2021" +
                             "\u02C6\u2030\u0160\u2039\u0152\u008D\u017D\u008F" +
                             "\u0090\u2018\u2019\u201C\u201D\u2022\u2013\u2014" +
                             "\u02DC\u2122\u0161\u203A\u0153\u009D\u017E\u0178" ;


            var fs = new ActiveXObject("Scripting.FileSystemObject");
            var size = (fs.getFile(file)).size;

            var lnkBytes = ado.ReadText(size);
            ado.Close();
            var chars=lnkBytes.split('');
            for (var indx=0;indx<size;indx++) {
                if ( chars[indx].charCodeAt(0) > 255 ) {
                   chars[indx] = String.fromCharCode(128 + adjustment.indexOf(chars[indx]));
                }
            }
            return chars.join("");
       }

       function writeContent(file,content) {
            var ado = WScript.CreateObject("ADODB.Stream");
            ado.Type = 2;  // adTypeText = 2
            ado.CharSet = "iso-8859-1";  // right code page for output (no adjustments)
            //ado.Mode=2;
            ado.Open();

            ado.WriteText(content);
            ado.SaveToFile(file, 2);
            ado.Close();    
       }

        if (typeof String.prototype.startsWith != 'function') {
          // see below for better implementation!
          String.prototype.startsWith = function (str){
            return this.indexOf(str) === 0;
          };
        }


        var evaluate=false;
        var filename=ARGS.Item(0);
        if(filename.toLowerCase().startsWith("e?")) {
            filename=filename.substring(2,filename.length);
            evaluate=true;
        }
        var content=getContent(filename);
        var newContent=content;
        var find="";
        var replace="";

        for (var i=1;i<ARGS.Length-1;i=i+2){
            find=ARGS.Item(i);
            replace=ARGS.Item(i+1);
            if(evaluate){
                find=decodeJsString(find);
                replace=decodeJsString(replace);
            }
            newContent=convertToPipe(find,replace,newContent);
            newContent=removeStartingQuote(find,replace,newContent);        
            newContent=removeEndQuote(find,replace,newContent);
            newContent=removeLeadingAndTrailingQuotes(find,replace,newContent);
            newContent=replaceDelimiter(find,replace,newContent);       
            newContent=convertSFDCDoubleQuotes(find,replace,newContent);        
        }

        writeContent(filename,newContent);

执行步骤:

> replace.bat <file_name or full_path_to_file> "." "."

这个批处理文件是为了根据我们的要求对任何文件进行操作而制作的。

我从大量 google 搜索中编译并制作了这个。它仍在处理中,因为我已经在文件中硬编码了我的正则表达式。你可以根据自己的需要对我做的函数进行修改,甚至可以通过复制其他函数来制作自己的函数,并在最后调用它们。

我使用 Wondows Powershell 脚本实现的目标的另一种选择。

((Get-Content -path $args[0] -Raw) -replace '\|', '^|') | Set-Content -NoNewline -Force -Path $args[0]
((Get-Content -path $args[0] -Raw) -replace '^"', '') | Set-Content -NoNewline -Force -Path $args[0]
((Get-Content -path $args[0] -Raw) -replace "`"\r\n$", "") | Set-Content -NoNewline -Force -Path $args[0]
((Get-Content -path $args[0] -Raw) -replace '"\r\n"', "`r`n") | Set-Content -NoNewline -Force -Path $args[0]
((Get-Content -path $args[0] -Raw) -replace '","', '|') | Set-Content -NoNewline -Force -Path $args[0]
((Get-Content -path $args[0] -Raw) -replace '""', '"' ) | Set-Content -Path $args[0]

执行方式:

  1. 使用 Powershell

    replace.ps1 '< path_to_file >'

  2. 使用批处理脚本

    C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe -ExecutionPolicy ByPass -command "& '< path_to_ps_script >\replace.ps1' '< path_to_csv_file >.csv'"

注意:需要 Powershell V5.0 或更高版本

这可以在一分钟左右处理 100 万条记录。

我发现我们必须将庞大的 csv 文件拆分为多个文件,每个文件包含 100 万条记录,然后单独处理它们。

如有错误请指正,或有其他替代方法。