如何完成此 Python 脚本以操作制表符分隔文件中的数据?
How to complete this Python script to manipulate data in tab delimited file?
我在制表符分隔的文件中有一个 零件号 和 序列号 的列表,我需要使用连字符将它们合并在一起制作一个资产编号。
这是输入:
Part Number Serial Number
PART1 SERIAL1
,PART2 SERIAL2
, PART3 SERIAL3
这就是我想要的期望输出:
Part Number Serial Number Asset Number
PART1 SERIAL1 PART1-SERIAL1
,PART2 SERIAL2 PART2-SERIAL2
, PART3 SERIAL3 PART3-SERIAL3
我试过以下代码:
import csv
input_list = []
with open('Assets.txt', mode='r') as input:
for row in input:
field = row.strip().split('\t') #Remove new lines and split at tabs
for x, i in enumerate(field):
if i[0] == (','): #If the start of a field starts with a comma
field[x][0] = ('') #Replace that first character with nothing
field[x].lstrip() #Strip any whitespace
print(field)
这段代码产生了实际输出:
['Part Number', 'Serial Number']
['PART1', 'SERIAL1']
['",PART2"', 'SERIAL2']
['", PART3"', 'SERIAL3']
我的第一个问题是我删除所有字段开头的逗号和空格的代码无法正常工作。
第二个问题是在空格处加了引号
第三个问题是我不知道如何将另一个项目添加到列表数组 (Asset Numbers) 以便我可以加入字段。
有人能帮我解决这些问题吗?
import pandas as pd
data = {'Part Number': ['PART1',', PART2',', PART3'],
'Serial Number': ['Serial1','Serial2','Serial3']}
df = pd.DataFrame(data)
df.loc[:,'AssetNumber'] = df.loc[:,'Part Number'].apply(lambda x: str(x).strip().replace(',','')) + '-' + df.loc[:,'Serial Number'].apply(lambda x: str(x).strip().replace(',',''))
这会做你想做的事
在您处理 CSV 调用的情况下
df = pd.read_csv('filepathasstring',sep='\t')
如果您遇到问题,请检查此行是否存在问题:
然后你可以通过调用保存为制表符分隔:
df.to_csv('filepathasstring', sep='\t')
如果您还没有 pandas,这里是获取方法:
您可以尝试去除逗号,即使它们不在此处也没有问题,因此不再需要 if[0] == ",":
。您还剥离了一个字符串,但该值未存储在列表中。这已在此处修复:
input_list = []
with open('Assets.txt', mode='r') as text_file:
for row in text_file:
field = row.strip('\n').split('\t') # Remove new lines and split at tabs.
for n, word in enumerate(field):
field[n] = word.lstrip(", ") # Strip any number of whitespaces and commas.
print(field)
输出:
['Part Number', 'Serial Number']
['PART1', 'SERIAL1']
['PART2', 'SERIAL2']
['PART3', 'SERIAL3']
所以现在我们可以在某处放置一个 Asset_number = field[0] + '-' + field[1]
,它将为您提供您想要使用的值 PARTx-SERIALx
。
稍作修改以获得所需的输出:
input_list = []
with open('Assets.txt', mode='r') as text_file:
for m, row in enumerate(text_file):
field = row.strip('\n').split('\t') # Remove new lines and split at tabs.
for n, word in enumerate(field):
field[n] = word.lstrip(", ") # Strip any number of whitespaces and commas.
if m == 0: # Special case for the header.
text_to_print = field[0] + '\t' + field[1] + '\t' + 'Asset Number'
else:
Asset_number = field[0] + '-' + field[1]
text_to_print = field[0] + '\t' + field[1] + '\t' + Asset_number
print(text_to_print)
并且打印输出是:
Part Number Serial Number Asset Number
PART1 SERIAL1 PART1-SERIAL1
PART2 SERIAL2 PART2-SERIAL2
PART3 SERIAL3 PART3-SERIAL3
由于某种原因,它在这里看起来不太好,但字符串仍然正确,制表符在预期的位置,因此将其写入新文件而不是打印它应该没有问题。
'Part Number\tSerial Number\tAsset Number'
'PART1\tSERIAL1\tPART1-SERIAL1'
'PART2\tSERIAL2\tPART2-SERIAL2'
'PART3\tSERIAL3\tPART3-SERIAL3'
您可以尝试下面的代码,它非常有效。
input.txt
Part Number Serial Number
PART1 SERIAL1
,PART2 SERIAL2
, PART3 SERIAL3
split_text_add_combine.py
import re
def split_and_combine(in_path, out_path, new_column_name):
format_string = "{0:20s}{1:20s}{2:20s}"
new_lines = [] # To store new lines
# Reading input file to process
with open(in_path) as f:
lines = f.readlines()
for index, line in enumerate(lines):
line = line.strip()
arr = re.split(r"\s{2,}", line)
if index == 0:
# Important to split words in case if words have more than single space
new_line = format_string.format(arr[0], arr[1], new_column_name) + '\n'
else:
# arr = line.split()
comma_removed_string = (arr[0] + "-" + arr[1]).lstrip(",").lstrip()
new_line = format_string.format(arr[0], arr[1], comma_removed_string) + '\n'
new_lines.append(new_line)
print(new_lines)
# Writing new lines to: output.txt
with open(out_path, "w") as f:
f.writelines(new_lines)
if __name__ == "__main__":
in_path = "input.txt"
out_path = "output.txt"
new_column_name = "Asset Number"
split_and_combine(in_path, out_path, new_column_name)
output.txt
Part Number Serial Number Asset Number
PART1 SERIAL1 PART1-SERIAL1
,PART2 SERIAL2 PART2-SERIAL2
, PART3 SERIAL3 PART3-SERIAL3
References:
我在制表符分隔的文件中有一个 零件号 和 序列号 的列表,我需要使用连字符将它们合并在一起制作一个资产编号。
这是输入:
Part Number Serial Number
PART1 SERIAL1
,PART2 SERIAL2
, PART3 SERIAL3
这就是我想要的期望输出:
Part Number Serial Number Asset Number
PART1 SERIAL1 PART1-SERIAL1
,PART2 SERIAL2 PART2-SERIAL2
, PART3 SERIAL3 PART3-SERIAL3
我试过以下代码:
import csv
input_list = []
with open('Assets.txt', mode='r') as input:
for row in input:
field = row.strip().split('\t') #Remove new lines and split at tabs
for x, i in enumerate(field):
if i[0] == (','): #If the start of a field starts with a comma
field[x][0] = ('') #Replace that first character with nothing
field[x].lstrip() #Strip any whitespace
print(field)
这段代码产生了实际输出:
['Part Number', 'Serial Number']
['PART1', 'SERIAL1']
['",PART2"', 'SERIAL2']
['", PART3"', 'SERIAL3']
我的第一个问题是我删除所有字段开头的逗号和空格的代码无法正常工作。
第二个问题是在空格处加了引号
第三个问题是我不知道如何将另一个项目添加到列表数组 (Asset Numbers) 以便我可以加入字段。
有人能帮我解决这些问题吗?
import pandas as pd
data = {'Part Number': ['PART1',', PART2',', PART3'],
'Serial Number': ['Serial1','Serial2','Serial3']}
df = pd.DataFrame(data)
df.loc[:,'AssetNumber'] = df.loc[:,'Part Number'].apply(lambda x: str(x).strip().replace(',','')) + '-' + df.loc[:,'Serial Number'].apply(lambda x: str(x).strip().replace(',',''))
这会做你想做的事
在您处理 CSV 调用的情况下
df = pd.read_csv('filepathasstring',sep='\t')
如果您遇到问题,请检查此行是否存在问题:
然后你可以通过调用保存为制表符分隔:
df.to_csv('filepathasstring', sep='\t')
如果您还没有 pandas,这里是获取方法:
您可以尝试去除逗号,即使它们不在此处也没有问题,因此不再需要 if[0] == ",":
。您还剥离了一个字符串,但该值未存储在列表中。这已在此处修复:
input_list = []
with open('Assets.txt', mode='r') as text_file:
for row in text_file:
field = row.strip('\n').split('\t') # Remove new lines and split at tabs.
for n, word in enumerate(field):
field[n] = word.lstrip(", ") # Strip any number of whitespaces and commas.
print(field)
输出:
['Part Number', 'Serial Number']
['PART1', 'SERIAL1']
['PART2', 'SERIAL2']
['PART3', 'SERIAL3']
所以现在我们可以在某处放置一个 Asset_number = field[0] + '-' + field[1]
,它将为您提供您想要使用的值 PARTx-SERIALx
。
稍作修改以获得所需的输出:
input_list = []
with open('Assets.txt', mode='r') as text_file:
for m, row in enumerate(text_file):
field = row.strip('\n').split('\t') # Remove new lines and split at tabs.
for n, word in enumerate(field):
field[n] = word.lstrip(", ") # Strip any number of whitespaces and commas.
if m == 0: # Special case for the header.
text_to_print = field[0] + '\t' + field[1] + '\t' + 'Asset Number'
else:
Asset_number = field[0] + '-' + field[1]
text_to_print = field[0] + '\t' + field[1] + '\t' + Asset_number
print(text_to_print)
并且打印输出是:
Part Number Serial Number Asset Number
PART1 SERIAL1 PART1-SERIAL1
PART2 SERIAL2 PART2-SERIAL2
PART3 SERIAL3 PART3-SERIAL3
由于某种原因,它在这里看起来不太好,但字符串仍然正确,制表符在预期的位置,因此将其写入新文件而不是打印它应该没有问题。
'Part Number\tSerial Number\tAsset Number'
'PART1\tSERIAL1\tPART1-SERIAL1'
'PART2\tSERIAL2\tPART2-SERIAL2'
'PART3\tSERIAL3\tPART3-SERIAL3'
您可以尝试下面的代码,它非常有效。
input.txt
Part Number Serial Number
PART1 SERIAL1
,PART2 SERIAL2
, PART3 SERIAL3
split_text_add_combine.py
import re
def split_and_combine(in_path, out_path, new_column_name):
format_string = "{0:20s}{1:20s}{2:20s}"
new_lines = [] # To store new lines
# Reading input file to process
with open(in_path) as f:
lines = f.readlines()
for index, line in enumerate(lines):
line = line.strip()
arr = re.split(r"\s{2,}", line)
if index == 0:
# Important to split words in case if words have more than single space
new_line = format_string.format(arr[0], arr[1], new_column_name) + '\n'
else:
# arr = line.split()
comma_removed_string = (arr[0] + "-" + arr[1]).lstrip(",").lstrip()
new_line = format_string.format(arr[0], arr[1], comma_removed_string) + '\n'
new_lines.append(new_line)
print(new_lines)
# Writing new lines to: output.txt
with open(out_path, "w") as f:
f.writelines(new_lines)
if __name__ == "__main__":
in_path = "input.txt"
out_path = "output.txt"
new_column_name = "Asset Number"
split_and_combine(in_path, out_path, new_column_name)
output.txt
Part Number Serial Number Asset Number
PART1 SERIAL1 PART1-SERIAL1
,PART2 SERIAL2 PART2-SERIAL2
, PART3 SERIAL3 PART3-SERIAL3
References: