不需要的地方的线路馈线导致读取数据时出现问题
Line feeder in unwanted places causing problem in reading data
我的数据有超过 50,000 个观察值。问题是换行符 LF
到处都是,导致导入统计软件如 STATA 成为一场噩梦。我在 STATA 中尝试了很多不同的选项,最后放弃了 STATA。现在,在 Notepad++ 中花了半天时间后,我发现在数据被推入新行的任何地方手动删除“LF”可以解决问题。
我尝试使用 'replace all' 'LF' 什么都没有(空单元格),但它导致所有数据聚集在一行中(软件(Excel,STATA)将其解释为冗长的变量列表)。
我希望有人可能遇到过这样的问题并找到了解决方案。如果有人可以分享这些解决方案,那就太棒了!
notepad++中的乱数据显示为:
"id"|"sex"|"choice"LF
"1aef2"|"M"|"burger",.LF
"pizza"LF
"pasta"LF
"B2qwX"|"F"|"salad".LF
"keto diet",LF
""LF
我想要一个干净的数据:
"id"|"sex"|"choice"LF
"1aef2"|"M"|"burger","pizza""pasta"LF
"B2qwX"|"F"|"salad""keto diet",""LF
请帮忙!
假设您的计算机上有 Excel、Word 或其他 Office 产品,请使用下面的方法(因为我有为此编写的代码;我会在尾部写 Python稍后)。
- 打开Excel
- 使用此视频添加开发人员工作簿:https://youtu.be/_oZGdg1aiEQ?t=26
- 转到“开发人员”选项卡并单击“Visual Basic”
通过右键单击 VBA 项目 > 插入 > 模块添加模块
转到工具 > 参考 > 检查 Microsoft 脚本运行时
然后,在白色space你可以粘贴下面的代码:
Option Explicit
Sub CorrectFileFormat()
Dim InFile As FileSystemObject
Set InFile = New FileSystemObject
Dim OutFile As FileSystemObject
Set OutFile = New FileSystemObject
Dim InFilePath As String
InFilePath = "c:\test.txt"
Dim OutFilePath As String
OutFilePath = "c:\test2.txt"
Dim InStream As TextStream
Set InStream = InFile.OpenTextFile(InFilePath, ForReading, False)
Dim OutStream As TextStream
Set OutStream = InFile.OpenTextFile(OutFilePath, ForWriting, True)
Dim Line As String
Dim OutputLine As String
OutputLine = ""
Dim LineNo As Integer
LineNo = 1
While Not InStream.AtEndOfStream
LoopStart:
Line = InStream.ReadLine
' if line is empty, skip it
If Len(Trim(Line)) = 0 Then
GoTo LoopStart
End If
' write the header line to the output file as-is
If InStr(Line, """id""|") = 1 Then
OutStream.WriteLine (Line)
GoTo LoopStart
End If
' if line contains a separator, we should store this line in
' outputline buffer. Before doing that, check if outputline has information in
' it. If it does, write it to the file first and then overwrite
' its contents
If InStr(Line, "|") > 0 Then
If Len(Trim(OutputLine)) > 0 Then
OutStream.WriteLine (OutputLine)
End If
OutputLine = Line
Else
OutputLine = OutputLine + Line
End If
LineNo = LineNo + 1
Wend
' write whatever's in outputline to the output file
OutStream.WriteLine (OutputLine)
InStream.Close
OutStream.Close
MsgBox "Done"
End Sub
将光标放在子例程内的任意位置,然后单击 运行 按钮。
您将看到一个消息框 Done
。
原始文本文件
"id"|"sex"|"choice"
"1aef2"|"M"|"burger"
"pizza"
"pasta"
"B2qwX"|"F"|"salad"
"keto diet",
""
新建文本文件
"id"|"sex"|"choice"
"1aef2"|"M"|"burger""pizza""pasta"
"B2qwX"|"F"|"salad""keto diet",""
您可以根据需要调整代码。
Python 例子
inpath = r'c:\test.txt'
outpath = r'c:\test2.txt'
infile = open(inpath, 'r')
outfile = open(outpath, 'w')
indata = infile.readlines()
outstring = ''
for line in indata:
if len(line.strip()) == 0:
continue
if line.startswith('"id"|'):
outfile.write(line)
continue
if '|' in line:
if len(outstring) > 0:
outfile.write(outstring + '\n')
outstring = line.strip()
else:
outstring += line.strip()
outfile.write(outstring + '\n')
infile.close()
outfile.close()
根据您的需要调整此代码。
这将在下一行不包含竖线字符时移除换行符。
- Ctrl+H
- 查找内容:
\n(?!.*\|)
- 替换为:
LEAVE EMPTY
- 检查 环绕
- 检查 正则表达式
- 取消选中
. matches newline
- 全部替换
解释:
\n # linefeed
(?! # negative lookahead, make we haven't, after:
.* # 0 or more any character but newline
\| # a pipe character
) # end lookahead
屏幕截图(之前):
截图(之后):
我的数据有超过 50,000 个观察值。问题是换行符 LF
到处都是,导致导入统计软件如 STATA 成为一场噩梦。我在 STATA 中尝试了很多不同的选项,最后放弃了 STATA。现在,在 Notepad++ 中花了半天时间后,我发现在数据被推入新行的任何地方手动删除“LF”可以解决问题。
我尝试使用 'replace all' 'LF' 什么都没有(空单元格),但它导致所有数据聚集在一行中(软件(Excel,STATA)将其解释为冗长的变量列表)。
我希望有人可能遇到过这样的问题并找到了解决方案。如果有人可以分享这些解决方案,那就太棒了!
notepad++中的乱数据显示为:
"id"|"sex"|"choice"LF
"1aef2"|"M"|"burger",.LF
"pizza"LF
"pasta"LF
"B2qwX"|"F"|"salad".LF
"keto diet",LF
""LF
我想要一个干净的数据:
"id"|"sex"|"choice"LF
"1aef2"|"M"|"burger","pizza""pasta"LF
"B2qwX"|"F"|"salad""keto diet",""LF
请帮忙!
假设您的计算机上有 Excel、Word 或其他 Office 产品,请使用下面的方法(因为我有为此编写的代码;我会在尾部写 Python稍后)。
- 打开Excel
- 使用此视频添加开发人员工作簿:https://youtu.be/_oZGdg1aiEQ?t=26
- 转到“开发人员”选项卡并单击“Visual Basic”
通过右键单击 VBA 项目 > 插入 > 模块添加模块
转到工具 > 参考 > 检查 Microsoft 脚本运行时
然后,在白色space你可以粘贴下面的代码:
Option Explicit
Sub CorrectFileFormat()
Dim InFile As FileSystemObject
Set InFile = New FileSystemObject
Dim OutFile As FileSystemObject
Set OutFile = New FileSystemObject
Dim InFilePath As String
InFilePath = "c:\test.txt"
Dim OutFilePath As String
OutFilePath = "c:\test2.txt"
Dim InStream As TextStream
Set InStream = InFile.OpenTextFile(InFilePath, ForReading, False)
Dim OutStream As TextStream
Set OutStream = InFile.OpenTextFile(OutFilePath, ForWriting, True)
Dim Line As String
Dim OutputLine As String
OutputLine = ""
Dim LineNo As Integer
LineNo = 1
While Not InStream.AtEndOfStream
LoopStart:
Line = InStream.ReadLine
' if line is empty, skip it
If Len(Trim(Line)) = 0 Then
GoTo LoopStart
End If
' write the header line to the output file as-is
If InStr(Line, """id""|") = 1 Then
OutStream.WriteLine (Line)
GoTo LoopStart
End If
' if line contains a separator, we should store this line in
' outputline buffer. Before doing that, check if outputline has information in
' it. If it does, write it to the file first and then overwrite
' its contents
If InStr(Line, "|") > 0 Then
If Len(Trim(OutputLine)) > 0 Then
OutStream.WriteLine (OutputLine)
End If
OutputLine = Line
Else
OutputLine = OutputLine + Line
End If
LineNo = LineNo + 1
Wend
' write whatever's in outputline to the output file
OutStream.WriteLine (OutputLine)
InStream.Close
OutStream.Close
MsgBox "Done"
End Sub
将光标放在子例程内的任意位置,然后单击 运行 按钮。
您将看到一个消息框 Done
。
原始文本文件
"id"|"sex"|"choice"
"1aef2"|"M"|"burger"
"pizza"
"pasta"
"B2qwX"|"F"|"salad"
"keto diet",
""
新建文本文件
"id"|"sex"|"choice"
"1aef2"|"M"|"burger""pizza""pasta"
"B2qwX"|"F"|"salad""keto diet",""
您可以根据需要调整代码。
Python 例子
inpath = r'c:\test.txt'
outpath = r'c:\test2.txt'
infile = open(inpath, 'r')
outfile = open(outpath, 'w')
indata = infile.readlines()
outstring = ''
for line in indata:
if len(line.strip()) == 0:
continue
if line.startswith('"id"|'):
outfile.write(line)
continue
if '|' in line:
if len(outstring) > 0:
outfile.write(outstring + '\n')
outstring = line.strip()
else:
outstring += line.strip()
outfile.write(outstring + '\n')
infile.close()
outfile.close()
根据您的需要调整此代码。
这将在下一行不包含竖线字符时移除换行符。
- Ctrl+H
- 查找内容:
\n(?!.*\|)
- 替换为:
LEAVE EMPTY
- 检查 环绕
- 检查 正则表达式
- 取消选中
. matches newline
- 全部替换
解释:
\n # linefeed
(?! # negative lookahead, make we haven't, after:
.* # 0 or more any character but newline
\| # a pipe character
) # end lookahead
屏幕截图(之前):
截图(之后):