Python 等价于 bash 字典序和数字排序

Python equivalent of bash sort lexicographical and numerical

所以我一直在研究 Python 脚本,它将一些信息组合成 "bed" 格式。这意味着我正在处理基因组上的特征,我的第一列是脚手架名称(字符串),第二列是该脚手架上的起始位置(整数),第三列是停止位置(整数),其他列包含与我的问题无关的其他信息。 我的问题是我的输出未排序。

现在我知道我可以使用这个 bash 命令对我的文件进行排序:

$sort -k1,1 -k2,2n -k3,3n infile > outfile

但在利益效力方面,我想知道 Python 中是否有办法做到这一点。到目前为止,我只看到了基于列表的排序,这些排序处理字典排序或数字排序。不是两者的结合。 那么,你们有什么想法吗?

我的数据片段(我想按第 1、2 和 3 列(按此顺序)排序):

Scf_3R  8599253 8621866 FBgn0000014 FBgn0191744 -0.097558026153
Scf_3R  8497493 8503049 FBgn0000015 FBgn0025043 0.437973284047
Scf_3L  16209309    16236428    FBgn0000017 FBgn0184183 -1.19105585707
Scf_2L  10630469    10632308    FBgn0000018 FBgn0193617 0.073153454539
Scf_3R  12087670    12124207    FBgn0000024 FBgn0022516 -0.023946795475
Scf_X   14395665    14422243    FBgn0000028 FBgn0187465 0.00300558969397
Scf_3R  25163062    25165316    FBgn0000032 FBgn0189058 0.530118698187
Scf_3R  19757441    19808894    FBgn0000036 FBgn0189822 -0.282508464261

加载数据,用sorted排序,写入新文件。

# Load data 
lists = list()
with open(filename, 'r') as f:
    for line in f:
        lists.append(line.rstrip().split())

# Sort data
results = sorted(lists, key=lambda x:(x[0], int(x[1]), int(x[2])))

# Write to a file
import csv
with open(filename, 'w') as f:
    writer = csv.writer(f, delimiter='\t')
    writer.writerows(results)

对于特定的排序模式,@sparkandshine 的解决方案似乎很简短,而且切中要害。另外,@j-f-sebastian 向我提供的那个看起来非常好、简洁并且在国际化和内存外排序策略方面具有重要的 hints/links。

也许 以下更明确的展示案例为 OP 或具有类似任务的人提供了额外的有用信息,以适应他们的需求。请参阅大多数符合 pep8 的代码中的注释:

#! /usr/bin/env python
"""Provide a show case for hierarchical sort, that offers flexible
hierarchical lexcical, numeric column sort mixes at runtime.
Hopefully this draft solution offers ideas for helping migrate
the sort shell level operation into a pythonic solution - YMMV."""
from __future__ import print_function

from functools import partial  # We use this to tailor the key function


def text_in_lines_gen(text_in_lines):
    """Mock generator simulating a line source for the data."""
    for line in text_in_lines.split('\n'):
        if line:
            yield line.split()


def sort_hier_gen(iterable_lines, hier_sort_spec):
    """Given iterator of text lines, sort all lines based on
    sort specification in hier_sort_spec.

    Every entry in hier_sort_spec is expected to be a pair with first value
    integer for index in columns of text blocks lines and second entry
    type of sorting in ('int', 'float') numeric or any other for text
    (lexical) ordering regime."""

    num_codes = ('int', 'float')
    converter_map = dict(zip(num_codes, (int, float)))

    # Extract facts from sort spec, prepare processing:
    key_ordered = tuple(k for k, _ in hier_sort_spec)

    # Prepare key function: Step 1 ...
    def _key_in(selected, r):
        """Inject the indexing into the key at sort time
        via partial application, as key function in sort
        has single argument only."""
        return tuple(r[k] for k in selected)

    _key = partial(_key_in, key_ordered)  # ... step 2

    convert_these_by = {}
    for k, t in hier_sort_spec:
        if t in num_codes:
            convert_these_by[k] = converter_map[t]

    if not convert_these_by:  # early out
        for row in sorted(iterable_lines, key=_key):
            yield row
    else:
        def flow_converter(row_iter, converter_map):
            """Row based converter - Don't block the flow ;-)."""
            for row in row_iter:
                for k, convert in converter_map.items():
                    row[k] = convert(row[k])
                yield row

        for row in sorted(flow_converter(iterable_lines,
                          convert_these_by), key=_key):
            yield row


def main():
    """Drive the hierarchical text-int-int sort."""

    data_1 = """Scf_3R  8599253 8621866 FBgn0000014 FBgn0191744 -0.097558026153
    Scf_3R  8497493 8503049 FBgn0000015 FBgn0025043 0.437973284047
    Scf_3L  16209309    16236428    FBgn0000017 FBgn0184183 -1.19105585707
    Scf_2L  10630469    10632308    FBgn0000018 FBgn0193617 0.073153454539
    Scf_3R  12087670    12124207    FBgn0000024 FBgn0022516 -0.023946795475
    Scf_X   14395665    14422243    FBgn0000028 FBgn0187465 0.00300558969397
    Scf_3R  25163062    25165316    FBgn0000032 FBgn0189058 0.530118698187
    Scf_3R  19757441    19808894    FBgn0000036 FBgn0189822 -0.282508464261"""

    bar = []
    x = 0
    for a in range(3, 0, -1):
        for b in range(3, 0, -1):
            for c in range(3, 0, -1):
                x += 1
                bar.append('a_%d %d %0.1f %d' % (a, b, c * 1.1, x))
    data_2 = '\n'.join(bar)

    hier_sort_spec = ((0, 't'), (1, 'int'), (2, 'int'))
    print("# Test data set 1 and sort spec={0}:".format(hier_sort_spec))
    for sorted_row in sort_hier_gen(text_in_lines_gen(data_1), hier_sort_spec):
        print(sorted_row)

    hier_sort_spec = ((0, 't'), (1, None), (2, False))
    print("# Test data set 1 and sort spec={0}:".format(hier_sort_spec))
    for sorted_row in sort_hier_gen(text_in_lines_gen(data_1), hier_sort_spec):
        print(sorted_row)

    hier_sort_spec = ((0, 't'), (2, 'float'), (1, 'int'))
    print("# Test data set 2 and sort spec={0}:".format(hier_sort_spec))
    for sorted_row in sort_hier_gen(text_in_lines_gen(data_2), hier_sort_spec):
        print(sorted_row)

if __name__ == '__main__':
    main()

在我的机器上三个测试用例(包括问题样本数据)产量:

第一个:

# Test data set 1 and sort spec=((0, 't'), (1, 'int'), (2, 'int')):
['Scf_2L', 10630469, 10632308, 'FBgn0000018', 'FBgn0193617', '0.073153454539']
['Scf_3L', 16209309, 16236428, 'FBgn0000017', 'FBgn0184183', '-1.19105585707']
['Scf_3R', 8497493, 8503049, 'FBgn0000015', 'FBgn0025043', '0.437973284047']
['Scf_3R', 8599253, 8621866, 'FBgn0000014', 'FBgn0191744', '-0.097558026153']
['Scf_3R', 12087670, 12124207, 'FBgn0000024', 'FBgn0022516', '-0.023946795475']
['Scf_3R', 19757441, 19808894, 'FBgn0000036', 'FBgn0189822', '-0.282508464261']
['Scf_3R', 25163062, 25165316, 'FBgn0000032', 'FBgn0189058', '0.530118698187']
['Scf_X', 14395665, 14422243, 'FBgn0000028', 'FBgn0187465', '0.00300558969397']

第二个:

# Test data set 1 and sort spec=((0, 't'), (1, None), (2, False)):
['Scf_2L', '10630469', '10632308', 'FBgn0000018', 'FBgn0193617', '0.073153454539']
['Scf_3L', '16209309', '16236428', 'FBgn0000017', 'FBgn0184183', '-1.19105585707']
['Scf_3R', '12087670', '12124207', 'FBgn0000024', 'FBgn0022516', '-0.023946795475']
['Scf_3R', '19757441', '19808894', 'FBgn0000036', 'FBgn0189822', '-0.282508464261']
['Scf_3R', '25163062', '25165316', 'FBgn0000032', 'FBgn0189058', '0.530118698187']
['Scf_3R', '8497493', '8503049', 'FBgn0000015', 'FBgn0025043', '0.437973284047']
['Scf_3R', '8599253', '8621866', 'FBgn0000014', 'FBgn0191744', '-0.097558026153']
['Scf_X', '14395665', '14422243', 'FBgn0000028', 'FBgn0187465', '0.00300558969397']

第三名:

# Test data set 2 and sort spec=((0, 't'), (2, 'float'), (1, 'int')):
['a_1', 1, 1.1, '27']
['a_1', 2, 1.1, '24']
['a_1', 3, 1.1, '21']
['a_1', 1, 2.2, '26']
['a_1', 2, 2.2, '23']
['a_1', 3, 2.2, '20']
['a_1', 1, 3.3, '25']
['a_1', 2, 3.3, '22']
['a_1', 3, 3.3, '19']
['a_2', 1, 1.1, '18']
['a_2', 2, 1.1, '15']
['a_2', 3, 1.1, '12']
['a_2', 1, 2.2, '17']
['a_2', 2, 2.2, '14']
['a_2', 3, 2.2, '11']
['a_2', 1, 3.3, '16']
['a_2', 2, 3.3, '13']
['a_2', 3, 3.3, '10']
['a_3', 1, 1.1, '9']
['a_3', 2, 1.1, '6']
['a_3', 3, 1.1, '3']
['a_3', 1, 2.2, '8']
['a_3', 2, 2.2, '5']
['a_3', 3, 2.2, '2']
['a_3', 1, 3.3, '7']
['a_3', 2, 3.3, '4']
['a_3', 3, 3.3, '1']

已更新 主要使用生成器,因为只有一份数据副本 "around" 因为无论如何(在内存中)全局排序都需要它,但不需要需要更多副本 ;-)

还添加了 functools.partial,因为这对我来说是最快的方法,可以使关键函数适应灵活的排序顺序。

最后一次更新通过定义基于行的转换的本地生成器函数,在实现转换的情况下删除了剩余的非生成器副本。 HTH.

要按照自己的排序标准进行排序,只需传递相应的key函数即可:

with open('infile', 'rb') as file:
    lines = file.readlines()

def sort_key(line):
    fields = line.split()
    try:
        return fields[0], int(fields[1]), int(fields[2])
    except (IndexError, ValueError):
        return () # sort invalid lines together
lines.sort(key=sort_key)

with open('outfile', 'wb') as file:
    file.writelines(lines)

它假设在输入文件的末尾有一个换行符(如果需要,附加它)。

代码按字节值对文本数据进行排序(如果第一列是 ASCII 就可以),如果它以文本模式打开文件(在 Python 2 上使用 io.open())情况并非如此(按 Unicode 代码点值排序)。 shell 中 sort 命令的结果可能取决于语言环境。你可以 use PyICU collator in Python.

如果您需要对内存中放不下的文件进行排序,请参阅Sorting text file by using Python