Python Pandas 在文本字段中聚合以空格分隔的值
Python Pandas aggregating whitespace-separated values in text field
我有一个这样的数据框:
0 A\nA\nA
1 na\nB|D|E|F|G|H\nB|D|E|F|G|H
2 B\nB|C\nB
3 na\nna\nna
我想按最高计数汇总这些值:
0 A
1 B|D|E|F|G|H
2 B
3 na
我想我应该首先用'\n'分隔列,所以我正在使用
df = pd.DataFrame([ x.split('\n') for x in df.tolist()])
所以我得到:
0 1 2
0 A A A
1 na B|D|E|F|G|H B|D|E|F|G|H
2 B B|C B
3 na na na
如何合并旁边的列以获得所需的输出?
谢谢。
pd.DataFrame.mode
在应用于 axis=1
时给出预期的输出:
import pandas as pd
df = pd.read_clipboard()
df.mode(1)
Returns:
0
0 A
1 B|D|E|F|G|H
2 B
3 na
您可以将 Counter
与 most_common
一起使用:
from collections import Counter
df = pd.DataFrame([Counter(x.split('\n')).most_common(1)[0][0] for x in df.tolist()])
print (df)
0
0 A
1 B|D|E|F|G|H
2 B
3 na
str.split
and apply value_counts
的另一个解决方案:
df = df.str.split('\n', expand=True).apply(lambda x: pd.value_counts(x).index[0],axis=1)
print (df)
0 A
1 B|D|E|F|G|H
2 B
3 na
dtype: object
时间:
In [238]: %timeit (pd.DataFrame([Counter(x.split('\n')).most_common(1)[0][0] for x in df.tolist()]))
1000 loops, best of 3: 197 µs per loop
In [239]: %timeit (df.str.split('\n', expand=True).apply(lambda x: pd.value_counts(x).index[0],axis=1))
100 loops, best of 3: 2.33 ms per loop
In [241]: %timeit (pd.DataFrame([ x.split('\n') for x in df.tolist()]).mode(1))
100 loops, best of 3: 2.32 ms per loop
更大 DataFrame
:
#len (df) = 40k
from collections import Counter
df = pd.Series(['A\nA\nA','na\nB|D|E|F|G|H\nB|D|E|F|G|H','B\nB|c\nB','na\nna\nna'])
#print (df)
df = pd.concat([df]*10000).reset_index(drop=True)
In [331]: %timeit (pd.DataFrame([Counter(x.split('\n')).most_common(1)[0][0] for x in df.tolist()]))
1 loop, best of 3: 257 ms per loop
In [332]: %timeit (df.apply(lambda x: Counter(x.split('\n')).most_common()[0][:][0]))
1 loop, best of 3: 282 ms per loop
In [333]: %timeit (pd.DataFrame([ x.split('\n') for x in df.tolist()]).mode(1))
1 loop, best of 3: 9.18 s per loop
In [334]: %timeit (df.str.split('\n', expand=True).apply(lambda x: pd.value_counts(x).index[0],axis=1))
1 loop, best of 3: 15.7 s per loop
我有一个这样的数据框:
0 A\nA\nA
1 na\nB|D|E|F|G|H\nB|D|E|F|G|H
2 B\nB|C\nB
3 na\nna\nna
我想按最高计数汇总这些值:
0 A
1 B|D|E|F|G|H
2 B
3 na
我想我应该首先用'\n'分隔列,所以我正在使用
df = pd.DataFrame([ x.split('\n') for x in df.tolist()])
所以我得到:
0 1 2
0 A A A
1 na B|D|E|F|G|H B|D|E|F|G|H
2 B B|C B
3 na na na
如何合并旁边的列以获得所需的输出?
谢谢。
pd.DataFrame.mode
在应用于 axis=1
时给出预期的输出:
import pandas as pd
df = pd.read_clipboard()
df.mode(1)
Returns:
0
0 A
1 B|D|E|F|G|H
2 B
3 na
您可以将 Counter
与 most_common
一起使用:
from collections import Counter
df = pd.DataFrame([Counter(x.split('\n')).most_common(1)[0][0] for x in df.tolist()])
print (df)
0
0 A
1 B|D|E|F|G|H
2 B
3 na
str.split
and apply value_counts
的另一个解决方案:
df = df.str.split('\n', expand=True).apply(lambda x: pd.value_counts(x).index[0],axis=1)
print (df)
0 A
1 B|D|E|F|G|H
2 B
3 na
dtype: object
时间:
In [238]: %timeit (pd.DataFrame([Counter(x.split('\n')).most_common(1)[0][0] for x in df.tolist()]))
1000 loops, best of 3: 197 µs per loop
In [239]: %timeit (df.str.split('\n', expand=True).apply(lambda x: pd.value_counts(x).index[0],axis=1))
100 loops, best of 3: 2.33 ms per loop
In [241]: %timeit (pd.DataFrame([ x.split('\n') for x in df.tolist()]).mode(1))
100 loops, best of 3: 2.32 ms per loop
更大 DataFrame
:
#len (df) = 40k
from collections import Counter
df = pd.Series(['A\nA\nA','na\nB|D|E|F|G|H\nB|D|E|F|G|H','B\nB|c\nB','na\nna\nna'])
#print (df)
df = pd.concat([df]*10000).reset_index(drop=True)
In [331]: %timeit (pd.DataFrame([Counter(x.split('\n')).most_common(1)[0][0] for x in df.tolist()]))
1 loop, best of 3: 257 ms per loop
In [332]: %timeit (df.apply(lambda x: Counter(x.split('\n')).most_common()[0][:][0]))
1 loop, best of 3: 282 ms per loop
In [333]: %timeit (pd.DataFrame([ x.split('\n') for x in df.tolist()]).mode(1))
1 loop, best of 3: 9.18 s per loop
In [334]: %timeit (df.str.split('\n', expand=True).apply(lambda x: pd.value_counts(x).index[0],axis=1))
1 loop, best of 3: 15.7 s per loop