按行包含的列表元素数对行进行排序
Sorting rows by the number of list elements the row contains
以下为例table:
index
column_1
column_2
0
bli bli
d e
1
bla bla
a b c d e
2
ble ble
a b c
如果我给出 token_list = ['c', 'e']
我想按每行包含在第 2 列中的标记的次数来排序 table。
通过订购 table 我应该得到以下信息:
index
column_1
column_2
score_tmp
1
bla bla
a b c d e
2
0
bli bli
d e
1
2
ble ble
a b c
1
目前,我已经达到了下面的方法,但是花费了很多时间。我怎样才能改善时间?提前谢谢你。
df['score_tmp'] = df[['column_2']].apply(
lambda x: len([True for token in token_list if
token in str(x['column_2'])]), axis=1)
results = df.sort_values('score_tmp', ascending=False).loc[df['score_tmp'] == len(token_list)].reset_index(inplace=False).to_dict('records')
你可以split
column_2根据空格,将每一行转换成set
然后使用df.apply
with set intersection
with sort_values
:
In [200]: df['matches'] = df.column_2.str.split().apply(lambda x: set(x) & set(token_list)).str.len()
In [204]: df.sort_values('matches', ascending=False).drop('matches', 1)
Out[204]:
index column_1 column_2
1 1 bla bla a b c d e
0 0 bli bli d e
2 2 ble ble a b c
时间:
In [208]: def f1():
...: df['score_tmp'] = df[['column_2']].apply(lambda x: len([True for token in token_list if token in str(x['column_2'])]), axis=1)
...: results = df.sort_values('score_tmp', ascending=False).loc[df['score_tmp'] == len(token_list)].reset_index(inplace=False).to_dict('records')
...:
In [209]: def f2():
...: df['matches'] = df.column_2.str.split().apply(lambda x: set(x) & set(token_list)).str.len()
...: df.sort_values('matches', ascending=False).drop('matches', 1)
...:
In [210]: %timeit f1() # solution provided in question
2.36 ms ± 55.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [211]: %timeit f2() # my solution
1.22 ms ± 14.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
这是使用 str.count()
的另一种方法
df.sort_values('column_2',
key = lambda x: x.str.count('|'.join(token_list)),
ascending=False)
使用 sort_values()
的 key
参数,我们不必创建临时列来进行排序。
输出:
index column_1 column_2
1 1 bla bla a b c d e
0 0 bli bli d e
2 2 ble ble a b c
以下为例table:
index | column_1 | column_2 |
---|---|---|
0 | bli bli | d e |
1 | bla bla | a b c d e |
2 | ble ble | a b c |
如果我给出 token_list = ['c', 'e']
我想按每行包含在第 2 列中的标记的次数来排序 table。
通过订购 table 我应该得到以下信息:
index | column_1 | column_2 | score_tmp |
---|---|---|---|
1 | bla bla | a b c d e | 2 |
0 | bli bli | d e | 1 |
2 | ble ble | a b c | 1 |
目前,我已经达到了下面的方法,但是花费了很多时间。我怎样才能改善时间?提前谢谢你。
df['score_tmp'] = df[['column_2']].apply(
lambda x: len([True for token in token_list if
token in str(x['column_2'])]), axis=1)
results = df.sort_values('score_tmp', ascending=False).loc[df['score_tmp'] == len(token_list)].reset_index(inplace=False).to_dict('records')
你可以split
column_2根据空格,将每一行转换成set
然后使用df.apply
with set intersection
with sort_values
:
In [200]: df['matches'] = df.column_2.str.split().apply(lambda x: set(x) & set(token_list)).str.len()
In [204]: df.sort_values('matches', ascending=False).drop('matches', 1)
Out[204]:
index column_1 column_2
1 1 bla bla a b c d e
0 0 bli bli d e
2 2 ble ble a b c
时间:
In [208]: def f1():
...: df['score_tmp'] = df[['column_2']].apply(lambda x: len([True for token in token_list if token in str(x['column_2'])]), axis=1)
...: results = df.sort_values('score_tmp', ascending=False).loc[df['score_tmp'] == len(token_list)].reset_index(inplace=False).to_dict('records')
...:
In [209]: def f2():
...: df['matches'] = df.column_2.str.split().apply(lambda x: set(x) & set(token_list)).str.len()
...: df.sort_values('matches', ascending=False).drop('matches', 1)
...:
In [210]: %timeit f1() # solution provided in question
2.36 ms ± 55.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [211]: %timeit f2() # my solution
1.22 ms ± 14.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
这是使用 str.count()
df.sort_values('column_2',
key = lambda x: x.str.count('|'.join(token_list)),
ascending=False)
使用 sort_values()
的 key
参数,我们不必创建临时列来进行排序。
输出:
index column_1 column_2
1 1 bla bla a b c d e
0 0 bli bli d e
2 2 ble ble a b c