如何从文本中分离出特定的字符串并将它们添加为列名？

Question

这是我拥有的 I 数据的一个相似示例，但行数少得多。

假设我有一个这样的 txt 文件：

'''
Useless information 1
Useless information 2
Useless information 3
Measurement:
Len. (cm)   :length of the object
Hei. (cm)   :height of the object
Tp.         :type of the object
~A DATA
10  5   2
8   7   2
5   6   1
9   9   1
'''

我想将“~A DATA”下方的值作为 DataFrame。我已经设法获得了没有列名的 DataFrame（尽管它有点乱，因为我的代码中有几行废话），如您所见：

with open(r'C:\Users\Lucas\Desktop\...\text.txt') as file:
    for line in file:
        if line.startswith('~A'):
           measures = line.split()[len(line):]
           break

    df = pd.read_csv(file, names=measures, sep='~A', engine='python')

newdf = df[0].str.split(expand = True)

newdf()
    0  1  2
0  10  5  2
1   8  7  2
2   5  6  1
3   9  9  1

现在，我想将文本中的 'Len'、'Hei' 和 'Tp' 作为列名放在 DataFrame 上。只是这些测量代码（没有后续字符串）。我该怎么做才能拥有这样的 df？

    Len  Hei  Tp
  0  10   5   2
  1   8   7   2
  2   5   6   1
  3   9   9   1

其中一个解决方案是拆分字符串 'Measurement' 下方的每一行（或以行 'Len...' 开头）直到字符串 '~A' 上方的每一行（或以行结尾'Tp')。然后拆分我们得到的每一行。但是我不知道该怎么做。

Answer 1

解决方案 1： 如果你想 从文本文件本身抓取 列名，那么你需要知道列名信息从哪一行开始，然后逐行读取文件并对您知道的列名作为文本的特定行进行处理。

为了回答您提出的具体问题，我们假设变量 line 包含其中一个字符串，比如 line = Len. (cm) :length of the object，您可以进行基于正则表达式的拆分，其中，您拆分任何特殊的字符串除数字和字母外的符号。

import re
splited_line = re.split(r"[^a-zA-Z0-9]", line) #add other characters which you don't want
print(splited_line)

这导致

['Len', ' ', 'cm', '   ', 'length of the object']

此外，要获取列名，请从列表中选择第一个元素作为 splited_line[0]

解决方案 2：如果您已经知道列名，您可以这样做

df.columns = ['Len','Hei','Tp']

这里是您正在寻找的完整解决方案：

In [34]: f = open('text.txt', "rb") 
    ...: flag = False 
    ...: column_names = [] 
    ...: for line in f: 
    ...:     splited_line = re.split(r"[^a-zA-Z0-9~]", line.decode('utf-8')) 
    ...:     if splited_line[0] == "Measurement": 
    ...:         flag = True 
    ...:         continue 
    ...:     elif splited_line[0] == "~A": 
    ...:         flag = False 
    ...:     if flag == True: 
    ...:         column_names.append(splited_line[0])

如何从文本中分离出特定的字符串并将它们添加为列名？

How to separate specific strings from a text and add them as column names?

text

split

strip

python-3.x

pandas