Pandas 通过查看两个列表是否具有共同值来累积计数
Pandas cumulative count by looking at if two lists have common value
如果我有这样的table
|---------------------|------------------|
| time | list of string |
|---------------------|------------------|
| 2019-06-18 09:05:00 | ['A', 'B', 'C']|
|---------------------|------------------|
| 2019-06-19 09:05:00 | ['A', 'C'] |
|---------------------|------------------|
| 2019-06-19 09:05:00 | ['B', 'C'] |
|---------------------|------------------|
| 2019-06-20 09:05:00 | ['C'] |
|---------------------|------------------|
| 2019-06-20 09:05:00 | ['A', 'B', 'C']|
|---------------------|------------------|
对于每一行,我想知道当前时间戳之前有多少行与当前字符串列表至少有一个共同值。
慢速代码应该是这样的:
results = [] for i in range(len(df)):
current_t = df['time'].iloc[i]
current_string = df['list_of_string'].iloc[i]
df_before_t = df[df['time']<current_t]
cumm_count = 0
for row in df_before_t['list_of_string']:
if (set(current_string) & set(row)):
cumm_count += 1
results.append(cumm_count)
所以结果 table 将是:
|---------------------|------------------|---------------------|
| time | list of string | result |
|---------------------|------------------|---------------------|
| 2019-06-18 09:05:00 | ['A', 'B', 'C']| 0 |
|---------------------|------------------|---------------------|
| 2019-06-19 09:05:00 | ['A', 'C'] | 1 |
|---------------------|------------------|---------------------|
| 2019-06-19 09:05:00 | ['D'] | 0 |
|---------------------|------------------|---------------------|
| 2019-06-20 09:05:00 | ['C'] | 2 |
|---------------------|------------------|---------------------|
| 2019-06-20 09:05:00 | ['A', 'B', 'C']| 2 |
|---------------------|------------------|---------------------|
我目前拥有的数据集比较大,我想获得帮助以快速处理这些数据。非常感谢!
一种方法是将列表转换为集合并在 list of string
上使用 listcomp 并将 time
与小于当前 time
的那些进行比较
s = df['list of string'].map(set)
t = pd.to_datetime(df.time)
df['result'] = [sum(len(x & y) != 0 for y in s[t.iloc[i] > t])
for i,x in enumerate(s)]
Out[283]:
time list of string result
0 2019-06-18 09:05:00 [A, B, C] 0
1 2019-06-19 09:05:00 [A, C] 1
2 2019-06-19 09:05:00 [D] 0
3 2019-06-20 09:05:00 [C] 2
4 2019-06-20 09:05:00 [A, B, C] 2
如果我有这样的table
|---------------------|------------------|
| time | list of string |
|---------------------|------------------|
| 2019-06-18 09:05:00 | ['A', 'B', 'C']|
|---------------------|------------------|
| 2019-06-19 09:05:00 | ['A', 'C'] |
|---------------------|------------------|
| 2019-06-19 09:05:00 | ['B', 'C'] |
|---------------------|------------------|
| 2019-06-20 09:05:00 | ['C'] |
|---------------------|------------------|
| 2019-06-20 09:05:00 | ['A', 'B', 'C']|
|---------------------|------------------|
对于每一行,我想知道当前时间戳之前有多少行与当前字符串列表至少有一个共同值。
慢速代码应该是这样的:
results = [] for i in range(len(df)):
current_t = df['time'].iloc[i]
current_string = df['list_of_string'].iloc[i]
df_before_t = df[df['time']<current_t]
cumm_count = 0
for row in df_before_t['list_of_string']:
if (set(current_string) & set(row)):
cumm_count += 1
results.append(cumm_count)
所以结果 table 将是:
|---------------------|------------------|---------------------|
| time | list of string | result |
|---------------------|------------------|---------------------|
| 2019-06-18 09:05:00 | ['A', 'B', 'C']| 0 |
|---------------------|------------------|---------------------|
| 2019-06-19 09:05:00 | ['A', 'C'] | 1 |
|---------------------|------------------|---------------------|
| 2019-06-19 09:05:00 | ['D'] | 0 |
|---------------------|------------------|---------------------|
| 2019-06-20 09:05:00 | ['C'] | 2 |
|---------------------|------------------|---------------------|
| 2019-06-20 09:05:00 | ['A', 'B', 'C']| 2 |
|---------------------|------------------|---------------------|
我目前拥有的数据集比较大,我想获得帮助以快速处理这些数据。非常感谢!
一种方法是将列表转换为集合并在 list of string
上使用 listcomp 并将 time
与小于当前 time
s = df['list of string'].map(set)
t = pd.to_datetime(df.time)
df['result'] = [sum(len(x & y) != 0 for y in s[t.iloc[i] > t])
for i,x in enumerate(s)]
Out[283]:
time list of string result
0 2019-06-18 09:05:00 [A, B, C] 0
1 2019-06-19 09:05:00 [A, C] 1
2 2019-06-19 09:05:00 [D] 0
3 2019-06-20 09:05:00 [C] 2
4 2019-06-20 09:05:00 [A, B, C] 2