drop_duplicates in pandas 对于大型数据集

Question

我是 pandas 的新手，很抱歉天真。

我有两个数据框。一个是out.hdf:

999999  2014    1   2   15  19  45.19   14.095  -91.528 69.7    4.5 0.0 0.0 0.0 603879074
999999  2014    1   2   23  53  57.58   16.128  -97.815 23.2    4.8 0.0 0.0 0.0 603879292
999999  2014    1   9   12  27  10.98   13.265  -89.835 55.0    4.5 0.0 0.0 0.0 603947030
999999  2014    1   9   20  57  44.88   23.273  -80.778 15.0    5.1 0.0 0.0 0.0 603947340

还有一个是out.res（第一列是站名）：

061Z    56.72   0.0 P   603879074
061Z    29.92   0.0 P   603879074
0614    46.24   0.0 P   603879292
109C    87.51   0.0 P   603947030
113A    66.93   0.0 P   603947030
113A    26.93   0.0 P   603947030
121A    31.49   0.0 P   603947340

两个数据框中的最后一列都是 ID。我想创建一个新的数据框，以这种方式将来自两个数据框的相同 ID 放在一起（首先从 hdf 读取一行，然后将来自 res 的具有相同 ID 的行放在它下面，但不将 ID 保留在 res 中） .

新数据框:

"999999 2014    1   2   15  19  45.19   14.095  -91.528 69.7    4.5 0.0 0.0 0.0 603879074"
061Z    56.72   0.0 P
061Z    29.92   0.0 P
"999999 2014    1   2   23  53  57.58   16.128  -97.815 23.2    4.8 0.0 0.0 0.0 603879292"
0614    46.24   0.0 P
"999999 2014    1   9   12  27  10.98   13.265  -89.835 55.0    4.5 0.0 0.0 0.0 603947030"
109C    87.51   0.0 P
113A    66.93   0.0 P
113A    26.93   0.0 P
"999999 2014    1   9   20  57  44.88   23.273  -80.778 15.0    5.1 0.0 0.0 0.0 603947340"
121A    31.49   0.0 P

我的代码是：

import csv
import pandas as pd
import numpy as np

path= './'
hdf = pd.read_csv(path + 'out.hdf', delimiter = '\t', header = None)
res = pd.read_csv(path + 'out.res', delimiter = '\t', header = None)


###creating input to the format of ph2dt-jp/ph
with open('./new_df', 'w', encoding='UTF8') as f:
    writer = csv.writer(f, delimiter='\t')

    i=0
    with open('./out.hdf', 'r') as a_file:
        for line in a_file:
            liney = line.strip()
            writer.writerow(np.array([liney]))
            print(liney)
            j=0
            with open('./out.res', 'r') as a_file:
                for line in a_file:
                    if res.iloc[j, 4] == hdf.iloc[i, 14]:
                        strng = res.iloc[j, [0, 1, 2, 3]]
                        print(strng)
                        writer.writerow(np.array(strng))
                    j+=1
            i+=1

目标是在第三个数据帧中只保留唯一的站点。在创建第 3 个数据帧之前，我使用这些命令来保留唯一的站点：

res.drop_duplicates([0], keep = 'last', inplace = True)

和

res.groupby([0], as_index = False).last()

而且效果很好。问题是对于包含数千行的大型数据集，使用这些命令会导致在第三个数据帧中省略某些 res 文件行。你能告诉我应该怎么做才能为大型数据集提供相同的结果吗？我要疯了，感谢您的时间和提前的帮助。

Answer 1

我找到了问题，希望对以后的其他人有所帮助。在大型数据集中，重复站重复多次但不是连续的。 Drop_duplicates() 只保留其中一个。但是，我只想删除连续的站点，而不是全部。我使用 shift:

完成了此操作

unique_stations = res.loc[res[0].shift() != res[0]]

drop_duplicates in pandas 对于大型数据集

drop_duplicates in pandas for a large data set

python

dataframe

pandas

drop-duplicates