创建带空格的数据框

Creating a data frame with blank spaces

我在从带空格的 ASCII 文件创建数据帧时遇到问题。

原始数据格式如下。

使用分隔符 \s+ 创建 CSV 文件有效。但我希望空格为 NaN。在我的实际脚本中,空格会被忽略。

我已经尝试替换空格,但是没有用。

我需要这些 NaN 的原因是将每隔一行合并到上面一行。为此,我将数据框一分为二。之后我重命名了第二个 df 的列,然后我将两个框架合并在一起。因此这两个数据帧应该具有相同的格式。

打印行仅在我的控制台中供参考,将在最终版本中删除。

原始数据

我把数据设置成代码格式,以显示原始格式。

 10 N0496 Position         70.990      0.600     71.123      0.268      ***---
                142.10     22.920                22.936
 11 N0497 Position         71.100      0.600     71.421      0.650      |--->>
                142.11     47.750                47.802      0.050
 12 N0498 Position         40.820      0.600     40.827      0.151      **----
                142.12     41.410                41.335
101 N0501 Durchm.           2.000      0.500      2.004      0.004 --****-----
                 140.1                -0.090
102 N0502 Durchm.           2.000      0.500      2.000      0.000 --****-----
                 140.2                -0.090
103 N0503 Durchm.           2.000      0.500      1.930     -0.070 ******-----
                 140.3                -0.090
104 N0504 Durchm.           2.000      0.500      1.903     -0.097 <<---+-----
                 140.4                -0.090                -0.007

代码:

import os
import pandas as pd
import numpy as np

input = "C:\Users\user\Desktop\Messprotokolle\" # Input files
output = "C:\Users\user\Desktop\CSVFiles1\" # Output files

# Select only .asc files
os.chdir(input)
asc_files = os.listdir('.')
for asc_file in (asc_files):
    if asc_file.endswith(".asc"): # Only for .asc
            asc_df = pd.read_csv(asc_file, sep = '\s+',
             names = ['measurement_point', 'specified_value2', 'measurement_value2', 'D', 'E', 'F', 'G', 'H'])
            asc_df.replace(r'\s+', np.nan, regex=True)
            #print(asc_df)
            asc_df.to_csv(output + asc_file + '.csv')
#formatting_ASC
os.chdir(output)
csv_files = os.listdir('.')
for csv_file in (csv_files):
        if csv_file.endswith(".asc.csv"):
            df = pd.read_csv(csv_file)
            #print (df)
            #keep_col = ['measurement_point', 'specified_value2', 'measurement_value1', 'D', 'E', 'F', 'G']
            new_df = df # [keep_col]
            #print (new_df)
            new_df = new_df[~new_df['measurement_point'].isin(['**Teil'])] # Removing unwanted strings
            new_df = new_df[~new_df['measurement_point'].isin(['**'])] # Removing unwanted strings
            new_df = new_df[~new_df['measurement_point'].isin(['**T'])] # Removing unwanted strings
            new_df = new_df[~new_df['measurement_point'].isin(['**KS-Oben'])] # Removing unwanted strings
            new_df = new_df[~new_df['measurement_point'].isin(['**KS-Unten'])] # Removing unwanted strings
            new_df = new_df[~new_df['measurement_point'].isin(['**N'])] # Removing unwanted strings
            new_df = new_df[~new_df['measurement_point'].isin(['**ME1'])] # Removing unwanted strings
            new_df = new_df[~new_df['measurement_point'].isin(['**ME2/3'])] # Removing unwanted strings
            new_df = new_df[~new_df['measurement_point'].isin(['**ME5'])] # Removing unwanted strings
            new_df = new_df[~new_df['measurement_point'].isin(['**ME8'])] # Removing unwanted strings
            new_df = new_df[~new_df['measurement_point'].isin(['**Punkte'])] # Removing unwanted strings
            new_df = new_df[~new_df['measurement_point'].isin(['**Punkte unten (1,3,5,6,7,9,11,13,16,18,19.5,'])] # Removing unwanted strings
            new_df = new_df[~new_df['measurement_point'].isin(['**21,23,45,27,29,34,36,38,41,43,44,46,48)'])] # Removing unwanted strings
            new_df = new_df[~new_df['measurement_point'].isin(['XXXXX'])] # Removing unwanted strings
            new_df = new_df.reset_index(drop=True)
            print(new_df)
            new_df.to_csv(output + csv_file)
            df1 = new_df[new_df.index % 2 ==1] # Splitting the original frame into two
            df2 = new_df[new_df.index % 2 ==0]
            dict2 = {'measurement_point': 'index', 'specified_value2': 'programm_line', 'measurement_value2': 'type', 'D': 'specified_value1', 'E': 'tolerance_value1_upper', 'F': 'measurement_value1', 'G': 'deviation_value1'}
            df2.rename(columns=dict2, inplace=True)
            print (df1)
            print (df2)
            right = df1.reset_index(drop=True)
            left = df2.reset_index(drop=True)
            #print(right)
            #print(left)
            merge_df = pd.merge(left, right, on=right.index)
            merge_df.index = merge_df.index + 1
            #print (merge_df)
            keep_col1 = ['measurement_point', 'specified_value2', 'measurement_value2', 'type', 'specified_value1', 'tolerance_value1_upper', 'measurement_value1', 'deviation_value1',]
            final_df = merge_df[keep_col1]
            #final_df.to_csv(output + csv_file)

输出格式

66,68,10,N0496,Position,70.990,0.600,71.123,0.268,***---
67,69,142.10,22.920,22.936,,,,,
68,70,11,N0497,Position,71.100,0.600,71.421,0.650,|--->>
69,71,142.11,47.750,47.802,0.050,,,,
70,72,12,N0498,Position,40.820,0.600,40.827,0.151,**----
71,73,142.12,41.410,41.335,,,,,
72,74,101,N0501,Durchm.,2.000,0.500,2.004,0.004,--****-----
73,75,140.1,-0.090,,,,,,
74,76,102,N0502,Durchm.,2.000,0.500,2.000,0.000,--****-----
75,77,140.2,-0.090,,,,,,
76,78,103,N0503,Durchm.,2.000,0.500,1.930,-0.070,******-----
77,79,140.3,-0.090,,,,,,
78,80,104,N0504,Durchm.,2.000,0.500,1.903,-0.097,<<---+-----
79,81,140.4,-0.090,-0.007,,,,,

需要的输出格式

66,68,10,N0496,Position,70.990,0.600,71.123,0.268,***---
67,69,,,142.10,22.920,,22.936,,
68,70,11,N0497,Position,71.100,0.600,71.421,0.650,|--->>
69,71,,,142.11,47.750,47.802,,0.050,,
70,72,12,N0498,Position,40.820,0.600,40.827,0.151,**----
71,73,,,142.12,41.410,,41.335,,
72,74,101,N0501,Durchm.,2.000,0.500,2.004,0.004,--****-----
73,75,,,140.1,-0.090,,,,
74,76,102,N0502,Durchm.,2.000,0.500,2.000,0.000,--****-----
75,77,,,140.2,-0.090,,,,
76,78,103,N0503,Durchm.,2.000,0.500,1.930,-0.070,******-----
77,79,,,140.3,-0.090,,,,
78,80,104,N0504,Durchm.,2.000,0.500,1.903,-0.097,<<---+-----
79,81,,,140.4,-0.090,-0.007,,,

我知道这是一个很具体的问题,但是我自己解决不了。

当使用 (' ') 作为分隔符时,我得到以下输出

40,,,8,N0481,Durchm.,,,,,,,,,,,3.75,,,,,,0.0,,,,,,3.6860000000000004,,,,,-0.064,-----***---,,,,,,,
41,,,,,,,,,,,,,,,,,,139.8,,,,,,,,,,,,,,,,-0.200,,,,,,,
42,,,9,N0482,Durchm.,,,,,,,,,,,3.75,,,,,,0.0,,,,,,3.668,,,,,-0.082,-----**----,,,,,,,
43,,,,,,,,,,,,,,,,,,139.9,,,,,,,,,,,,,,,,-0.200,,,,,,,
44,,10,N0483,Durchm.,,,,,,,,,,,3.75,,,,,,0.0,,,,,,3.6860000000000004,,,,,-0.064,-----***---,,,,,,,,
45,,,,,,,,,,,,,,,,,139.1,,,,,,,,,,,,,,,,-0.200,,,,,,,,
46,,11,N0484,Durchm.,,,,,,,,,,,3.75,,,,,,0.0,,,,,,3.66,,,,,-0.090,-----**----,,,,,,,,
47,,,,,,,,,,,,,,,,,139.11,,,,,,,,,,,,,,,,-0.200,,,,,,,,

由于第一个数字,我无法命名列。

尝试用空字符串替换(''):

asc_df.replace(r'\s+', '', regex=True)

我自己解决了这个问题。我误以为,原来的文件是用\s+分隔的。但是该文件是一个固定宽度的文件,通过使用 pd.read_fwf 读取文件,我得到了正确的格式。