查找数据框中重复列之间的日期和价格差异。将信息附加到数据框中的行

Find date and price difference between duplicate columns in a data frame. Append information to rows in data frame

需要查找比 7 大得多的整个数据框的重复列(按名称)之间的日期差异价格差异行。 如果可以将重复项全部添加到 同一行 ,那将是更可取的,如下例所示。

示例代码:

import pandas as pd
data = {'CarMake':['Toyota', 'Ford', 'Nissan', 'Hyundai','Toyota', 'Ford', 'Nissan', 'Hyundai'],
        'DateSold':['1-2-18','1-2-18','1-3-18','1-3-18','1-2-20','1-2-20','1-3-20','1-3-20'],
        'Price':['20000','21000','22000','23000','15000','16000','17000','18000']}         
df = pd.DataFrame(data)

df['Price']=df['Price'].astype(str).astype(float)
df['DateSold']=pd.to_datetime(df['DateSold'])

查看数据类型:

df.dtypes

预期输出:

   CarMake   DateSold    Price
0   Toyota 2018-01-02  20000.0
1     Ford 2018-01-02  21000.0
2   Nissan 2018-01-03  22000.0
3  Hyundai 2018-01-03  23000.0
4   Toyota 2020-01-02  15000.0
5     Ford 2020-01-02  16000.0
6   Nissan 2020-01-03  17000.0
7  Hyundai 2020-01-03  18000.0

期望的输出: 日期最好以 [月] 为单位。

   CarMake DateSold DateSold2  Price Price2 PriceDifference DateDifference
0   Toyota   1-2-18    1-2-20  20000  15000           -5000             24
1     Ford   1-2-18    1-2-20  21000  16000           -5000             24
2   Nissan   1-3-18    1-3-20  22000  17000           -5000             24
3  Hyundai   1-3-18    1-3-20  23000  18000           -5000             24

您需要将每一行与具有相同 CarMake 值的每一行进行比较,然后将比较添加到新的 table。最有效的方法是首先按 CarMake 对列表进行排序,然后遍历所有行。一旦下一行中的 CarMake 字段与当前行中的 CarMake 字段不匹配,您就知道您已经找到了该品牌的所有汽车,并且可以更改要比较的行。这比比较所有行要快得多。

下面是一个示例,它执行此操作,并检查单例并将它们输入新的 table 中,没有比较数据。

# set up example: added a singleton and triple-duplicate example and varied dates more
import pandas as pd
data = {'CarMake':['Toyota', 'Ford', 'Nissan', 'Hyundai','Toyota', 'Ford', 'Nissan', 'Toyota'],
        'DateSold':['1-2-18','1-2-18','5-3-17','1-3-18','1-2-20','1-2-20','1-3-20','6-3-20'],
        'Price':['20000','21000','22000','23000','15000','16000','17000','18000']}         
df = pd.DataFrame(data)

df['Price']=df['Price'].astype(str).astype(float)
df['DateSold']=pd.to_datetime(df['DateSold'])

# sort our data
df.sort_values('CarMake', inplace=True)
# make list for building the new table
new_data = []

# loop through all rows once
for one in range(len(df)):
    one_row = df.iloc[one]
    one_make = one_row['CarMake']

    # check for singleton (non-duplicate row)
    # if the previous make or next make match, there is a duplicate,
    # otherwise it is a singleton
    prev_make = df.iloc[one - 1]['CarMake'] if one > 0 else None
    next_make = df.iloc[one + 1]['CarMake'] if one < len(df) - 1 else None
    if one_make != prev_make and one_make != next_make:
        # found a singleton
        new_data.append({
            'CarMake': one_make,
            'DateSold': one_row['DateSold'],
            'DateSold2': None,
            'Price': one_row['Price'],
            'Price2': None,
            'PriceDifference': None,
            'DateDifference': None,
        })
        continue
    
    # there is at least one duplicate, find them all
    for two in range(one + 1, len(df)):
        two_row = df.iloc[two]
        two_make = two_row['CarMake']

        if one_make == two_make:
            # found a duplicate
            new_data.append({
                'CarMake': one_make,
                'DateSold': one_row['DateSold'],
                'DateSold2': two_row['DateSold'],
                'Price': one_row['Price'],
                'Price2': two_row['Price'],
                'PriceDifference': abs(one_row['Price'] - two_row['Price']),
                'DateDifference': abs(
                    (one_row['DateSold'].year  - two_row['DateSold'].year) * 12 + 
                    (one_row['DateSold'].month - two_row['DateSold'].month)
                ),
            })
        else:
            break # no more matches, move `one` forward

new_df = pd.DataFrame(new_data)
print(new_df)

输出:

   CarMake   DateSold  DateSold2    Price   Price2  PriceDifference  DateDifference
0     Ford 2018-01-02 2020-01-02  21000.0  16000.0           5000.0            24.0
1  Hyundai 2018-01-03        NaT  23000.0      NaN              NaN             NaN
2   Nissan 2017-05-03 2020-01-03  22000.0  17000.0           5000.0            32.0
3   Toyota 2018-01-02 2020-01-02  20000.0  15000.0           5000.0            24.0
4   Toyota 2018-01-02 2020-06-03  20000.0  18000.0           2000.0            29.0
5   Toyota 2020-01-02 2020-06-03  15000.0  18000.0           3000.0             5.0