查找数据框中重复列之间的日期和价格差异。将信息附加到数据框中的行
Find date and price difference between duplicate columns in a data frame. Append information to rows in data frame
需要查找比 7 大得多的整个数据框的重复列(按名称)之间的日期差异和价格差异行。
如果可以将重复项全部添加到 同一行 ,那将是更可取的,如下例所示。
示例代码:
import pandas as pd
data = {'CarMake':['Toyota', 'Ford', 'Nissan', 'Hyundai','Toyota', 'Ford', 'Nissan', 'Hyundai'],
'DateSold':['1-2-18','1-2-18','1-3-18','1-3-18','1-2-20','1-2-20','1-3-20','1-3-20'],
'Price':['20000','21000','22000','23000','15000','16000','17000','18000']}
df = pd.DataFrame(data)
df['Price']=df['Price'].astype(str).astype(float)
df['DateSold']=pd.to_datetime(df['DateSold'])
查看数据类型:
df.dtypes
预期输出:
CarMake DateSold Price
0 Toyota 2018-01-02 20000.0
1 Ford 2018-01-02 21000.0
2 Nissan 2018-01-03 22000.0
3 Hyundai 2018-01-03 23000.0
4 Toyota 2020-01-02 15000.0
5 Ford 2020-01-02 16000.0
6 Nissan 2020-01-03 17000.0
7 Hyundai 2020-01-03 18000.0
期望的输出:
日期最好以 [月] 为单位。
CarMake DateSold DateSold2 Price Price2 PriceDifference DateDifference
0 Toyota 1-2-18 1-2-20 20000 15000 -5000 24
1 Ford 1-2-18 1-2-20 21000 16000 -5000 24
2 Nissan 1-3-18 1-3-20 22000 17000 -5000 24
3 Hyundai 1-3-18 1-3-20 23000 18000 -5000 24
您需要将每一行与具有相同 CarMake
值的每一行进行比较,然后将比较添加到新的 table。最有效的方法是首先按 CarMake
对列表进行排序,然后遍历所有行。一旦下一行中的 CarMake
字段与当前行中的 CarMake
字段不匹配,您就知道您已经找到了该品牌的所有汽车,并且可以更改要比较的行。这比比较所有行要快得多。
下面是一个示例,它执行此操作,并检查单例并将它们输入新的 table 中,没有比较数据。
# set up example: added a singleton and triple-duplicate example and varied dates more
import pandas as pd
data = {'CarMake':['Toyota', 'Ford', 'Nissan', 'Hyundai','Toyota', 'Ford', 'Nissan', 'Toyota'],
'DateSold':['1-2-18','1-2-18','5-3-17','1-3-18','1-2-20','1-2-20','1-3-20','6-3-20'],
'Price':['20000','21000','22000','23000','15000','16000','17000','18000']}
df = pd.DataFrame(data)
df['Price']=df['Price'].astype(str).astype(float)
df['DateSold']=pd.to_datetime(df['DateSold'])
# sort our data
df.sort_values('CarMake', inplace=True)
# make list for building the new table
new_data = []
# loop through all rows once
for one in range(len(df)):
one_row = df.iloc[one]
one_make = one_row['CarMake']
# check for singleton (non-duplicate row)
# if the previous make or next make match, there is a duplicate,
# otherwise it is a singleton
prev_make = df.iloc[one - 1]['CarMake'] if one > 0 else None
next_make = df.iloc[one + 1]['CarMake'] if one < len(df) - 1 else None
if one_make != prev_make and one_make != next_make:
# found a singleton
new_data.append({
'CarMake': one_make,
'DateSold': one_row['DateSold'],
'DateSold2': None,
'Price': one_row['Price'],
'Price2': None,
'PriceDifference': None,
'DateDifference': None,
})
continue
# there is at least one duplicate, find them all
for two in range(one + 1, len(df)):
two_row = df.iloc[two]
two_make = two_row['CarMake']
if one_make == two_make:
# found a duplicate
new_data.append({
'CarMake': one_make,
'DateSold': one_row['DateSold'],
'DateSold2': two_row['DateSold'],
'Price': one_row['Price'],
'Price2': two_row['Price'],
'PriceDifference': abs(one_row['Price'] - two_row['Price']),
'DateDifference': abs(
(one_row['DateSold'].year - two_row['DateSold'].year) * 12 +
(one_row['DateSold'].month - two_row['DateSold'].month)
),
})
else:
break # no more matches, move `one` forward
new_df = pd.DataFrame(new_data)
print(new_df)
输出:
CarMake DateSold DateSold2 Price Price2 PriceDifference DateDifference
0 Ford 2018-01-02 2020-01-02 21000.0 16000.0 5000.0 24.0
1 Hyundai 2018-01-03 NaT 23000.0 NaN NaN NaN
2 Nissan 2017-05-03 2020-01-03 22000.0 17000.0 5000.0 32.0
3 Toyota 2018-01-02 2020-01-02 20000.0 15000.0 5000.0 24.0
4 Toyota 2018-01-02 2020-06-03 20000.0 18000.0 2000.0 29.0
5 Toyota 2020-01-02 2020-06-03 15000.0 18000.0 3000.0 5.0
需要查找比 7 大得多的整个数据框的重复列(按名称)之间的日期差异和价格差异行。 如果可以将重复项全部添加到 同一行 ,那将是更可取的,如下例所示。
示例代码:
import pandas as pd
data = {'CarMake':['Toyota', 'Ford', 'Nissan', 'Hyundai','Toyota', 'Ford', 'Nissan', 'Hyundai'],
'DateSold':['1-2-18','1-2-18','1-3-18','1-3-18','1-2-20','1-2-20','1-3-20','1-3-20'],
'Price':['20000','21000','22000','23000','15000','16000','17000','18000']}
df = pd.DataFrame(data)
df['Price']=df['Price'].astype(str).astype(float)
df['DateSold']=pd.to_datetime(df['DateSold'])
查看数据类型:
df.dtypes
预期输出:
CarMake DateSold Price
0 Toyota 2018-01-02 20000.0
1 Ford 2018-01-02 21000.0
2 Nissan 2018-01-03 22000.0
3 Hyundai 2018-01-03 23000.0
4 Toyota 2020-01-02 15000.0
5 Ford 2020-01-02 16000.0
6 Nissan 2020-01-03 17000.0
7 Hyundai 2020-01-03 18000.0
期望的输出: 日期最好以 [月] 为单位。
CarMake DateSold DateSold2 Price Price2 PriceDifference DateDifference
0 Toyota 1-2-18 1-2-20 20000 15000 -5000 24
1 Ford 1-2-18 1-2-20 21000 16000 -5000 24
2 Nissan 1-3-18 1-3-20 22000 17000 -5000 24
3 Hyundai 1-3-18 1-3-20 23000 18000 -5000 24
您需要将每一行与具有相同 CarMake
值的每一行进行比较,然后将比较添加到新的 table。最有效的方法是首先按 CarMake
对列表进行排序,然后遍历所有行。一旦下一行中的 CarMake
字段与当前行中的 CarMake
字段不匹配,您就知道您已经找到了该品牌的所有汽车,并且可以更改要比较的行。这比比较所有行要快得多。
下面是一个示例,它执行此操作,并检查单例并将它们输入新的 table 中,没有比较数据。
# set up example: added a singleton and triple-duplicate example and varied dates more
import pandas as pd
data = {'CarMake':['Toyota', 'Ford', 'Nissan', 'Hyundai','Toyota', 'Ford', 'Nissan', 'Toyota'],
'DateSold':['1-2-18','1-2-18','5-3-17','1-3-18','1-2-20','1-2-20','1-3-20','6-3-20'],
'Price':['20000','21000','22000','23000','15000','16000','17000','18000']}
df = pd.DataFrame(data)
df['Price']=df['Price'].astype(str).astype(float)
df['DateSold']=pd.to_datetime(df['DateSold'])
# sort our data
df.sort_values('CarMake', inplace=True)
# make list for building the new table
new_data = []
# loop through all rows once
for one in range(len(df)):
one_row = df.iloc[one]
one_make = one_row['CarMake']
# check for singleton (non-duplicate row)
# if the previous make or next make match, there is a duplicate,
# otherwise it is a singleton
prev_make = df.iloc[one - 1]['CarMake'] if one > 0 else None
next_make = df.iloc[one + 1]['CarMake'] if one < len(df) - 1 else None
if one_make != prev_make and one_make != next_make:
# found a singleton
new_data.append({
'CarMake': one_make,
'DateSold': one_row['DateSold'],
'DateSold2': None,
'Price': one_row['Price'],
'Price2': None,
'PriceDifference': None,
'DateDifference': None,
})
continue
# there is at least one duplicate, find them all
for two in range(one + 1, len(df)):
two_row = df.iloc[two]
two_make = two_row['CarMake']
if one_make == two_make:
# found a duplicate
new_data.append({
'CarMake': one_make,
'DateSold': one_row['DateSold'],
'DateSold2': two_row['DateSold'],
'Price': one_row['Price'],
'Price2': two_row['Price'],
'PriceDifference': abs(one_row['Price'] - two_row['Price']),
'DateDifference': abs(
(one_row['DateSold'].year - two_row['DateSold'].year) * 12 +
(one_row['DateSold'].month - two_row['DateSold'].month)
),
})
else:
break # no more matches, move `one` forward
new_df = pd.DataFrame(new_data)
print(new_df)
输出:
CarMake DateSold DateSold2 Price Price2 PriceDifference DateDifference
0 Ford 2018-01-02 2020-01-02 21000.0 16000.0 5000.0 24.0
1 Hyundai 2018-01-03 NaT 23000.0 NaN NaN NaN
2 Nissan 2017-05-03 2020-01-03 22000.0 17000.0 5000.0 32.0
3 Toyota 2018-01-02 2020-01-02 20000.0 15000.0 5000.0 24.0
4 Toyota 2018-01-02 2020-06-03 20000.0 18000.0 2000.0 29.0
5 Toyota 2020-01-02 2020-06-03 15000.0 18000.0 3000.0 5.0