不需要的地方的线路馈线导致读取数据时出现问题

Line feeder in unwanted places causing problem in reading data

我的数据有超过 50,000 个观察值。问题是换行符 LF 到处都是,导致导入统计软件如 STATA 成为一场噩梦。我在 STATA 中尝试了很多不同的选项,最后放弃了 STATA。现在,在 Notepad++ 中花了半天时间后,我发现在数据被推入新行的任何地方手动删除“LF”可以解决问题。

我尝试使用 'replace all' 'LF' 什么都没有(空单元格),但它导致所有数据聚集在一行中(软件(Excel,STATA)将其解释为冗长的变量列表)。

我希望有人可能遇到过这样的问题并找到了解决方案。如果有人可以分享这些解决方案,那就太棒了!

notepad++中的乱数据显示为:

"id"|"sex"|"choice"LF
"1aef2"|"M"|"burger",.LF
"pizza"LF
"pasta"LF
"B2qwX"|"F"|"salad".LF
"keto diet",LF
""LF

我想要一个干净的数据:

"id"|"sex"|"choice"LF
"1aef2"|"M"|"burger","pizza""pasta"LF
"B2qwX"|"F"|"salad""keto diet",""LF

请帮忙!

假设您的计算机上有 Excel、Word 或其他 Office 产品,请使用下面的方法(因为我有为此编写的代码;我会在尾部写 Python稍后)。

通过右键单击 VBA 项目 > 插入 > 模块添加模块

转到工具 > 参考 > 检查 Microsoft 脚本运行时

然后,在白色space你可以粘贴下面的代码:

Option Explicit

Sub CorrectFileFormat()

    Dim InFile As FileSystemObject
    Set InFile = New FileSystemObject
    
    Dim OutFile As FileSystemObject
    Set OutFile = New FileSystemObject
    
    Dim InFilePath As String
    InFilePath = "c:\test.txt"
    
    Dim OutFilePath As String
    OutFilePath = "c:\test2.txt"
    
    Dim InStream As TextStream
    Set InStream = InFile.OpenTextFile(InFilePath, ForReading, False)
    
    Dim OutStream As TextStream
    Set OutStream = InFile.OpenTextFile(OutFilePath, ForWriting, True)

    Dim Line As String
    Dim OutputLine As String
    OutputLine = ""
    
    Dim LineNo As Integer
    LineNo = 1
    
    While Not InStream.AtEndOfStream
LoopStart:
        Line = InStream.ReadLine
        
        ' if line is empty, skip it
        If Len(Trim(Line)) = 0 Then
            GoTo LoopStart
        End If
        
        ' write the header line to the output file as-is
        If InStr(Line, """id""|") = 1 Then
            OutStream.WriteLine (Line)
            GoTo LoopStart
        End If
        
        ' if line contains a separator, we should store this line in
        ' outputline buffer. Before doing that, check if outputline has information in
        ' it. If it does, write it to the file first and then overwrite
        ' its contents
        If InStr(Line, "|") > 0 Then
            If Len(Trim(OutputLine)) > 0 Then
                OutStream.WriteLine (OutputLine)
            End If
            OutputLine = Line
            
        Else
            OutputLine = OutputLine + Line
        End If
        
        LineNo = LineNo + 1
    Wend
    
    ' write whatever's in outputline to the output file
    OutStream.WriteLine (OutputLine)
    
    InStream.Close
    OutStream.Close
    
    MsgBox "Done"
    
End Sub

将光标放在子例程内的任意位置,然后单击 运行 按钮。

您将看到一个消息框 Done

原始文本文件

"id"|"sex"|"choice"
"1aef2"|"M"|"burger"
"pizza"
"pasta"
"B2qwX"|"F"|"salad"
"keto diet",
""

新建文本文件

"id"|"sex"|"choice"
"1aef2"|"M"|"burger""pizza""pasta"
"B2qwX"|"F"|"salad""keto diet",""

您可以根据需要调整代码。

Python 例子

inpath = r'c:\test.txt'
outpath = r'c:\test2.txt'

infile = open(inpath, 'r')
outfile = open(outpath, 'w')

indata = infile.readlines()

outstring = ''

for line in indata:
    
    if len(line.strip()) == 0:
        continue
    
    if line.startswith('"id"|'):
        outfile.write(line)
        continue
        
    if '|' in line:
        if len(outstring) > 0:
            outfile.write(outstring + '\n')
        
        outstring = line.strip()
    
    else:
        outstring += line.strip()
    
outfile.write(outstring + '\n')

infile.close()
outfile.close()

根据您的需要调整此代码。

这将在下一行不包含竖线字符时移除换行符。

  • Ctrl+H
  • 查找内容:\n(?!.*\|)
  • 替换为:LEAVE EMPTY
  • 检查 环绕
  • 检查 正则表达式
  • 取消选中 . matches newline
  • 全部替换

解释:

\n          # linefeed
(?!         # negative lookahead, make we haven't, after:
    .*          # 0 or more any character but newline
    \|          # a pipe character
)           # end lookahead

屏幕截图(之前):

截图(之后):