如何编写 python 脚本来查找两个文件之间的交集并追加原始文件

How to write a python script to find the intersection between two files and append original file

我是编程新手,但有以下问题需要解决。我希望找到两个文件的交集。

例如,文件 1:

> scaffold1          0        206        transcript_loc.00001      exon 
> scaffold1         262      749       transcript_loc.00001      exon   
> scaffold1         1391    1549     transcript_loc.00001      exon

文件 2:

scaffold1        517     540     Simple_repeat   
scaffold1        1063    1162    LTR/Gypsy        
scaffold1        1400     1498   LTR 

我想调用两个文件坐标相交的地方,然后将交点附加到File1。

结果文件:

 scaffold1          0        206        transcript_loc.00001      exon    
 scaffold1         262      749       transcript_loc.00001      exon          517     540     Simple_repeat   
 scaffold1         1391    1549     transcript_loc.00001      exon          1400     1498   LTR 

我们非常欢迎任何关于从哪里开始的建议。我可以使用 BedTools Intersect 找到交点,但我真的很想知道如何使用 python 来做到这一点。

谢谢。

因为您没有指定任何额外的约束条件(例如,需要就地进行交叉、任何关于效率或最佳速度的要求、'beautiful algorithm' 的存在等),您可以(并且应该!)使用最愚蠢的策略:

  1. 逐行读取第一个文件.

  2. 对于每一行,获取坐标。例如,对于 file1.

    中的第一行,这将是 (0, 206)
  3. 在继续下一行之前,请检查所有 file2 以查找具有所需间隔内的坐标的第一行。

    例如,如果您想要的间隔是 (0, 206),那么在 file2 之间有坐标 - 如果您想要的间隔是 (262, 749),然后是坐标为 (517, 540) 的第二行 比赛。

  4. 如果 file2 中有相应的行, 将其添加 file1 中的现有行。

  5. 返回第 2 步,直到完成。


代码

path_to_file1 = '<insert path of file1 here>'
path_to_file2 = '<insert path of file2 here>'
path_to_final_output_file = '<insert desired path here>'

# To make the code simpler, I first create a single string that I
# will append each successive line to. At the end, I will just dump
# the whole thing into a new file and call it a day. 

final_output_contents = ''

with open(path_to_file1, 'r') as file1:
     
    for line in file1:

    # split '> scaffold1          0        206        transcript_ loc.00001      exon '
    # into ['>', 'scaffold1', '0', '206', 'transcript_loc.00001', 'exon']
    # then take the third and fourth elements respectively to get 
    # your interval

        # `.split()` will break the string up by whitespace, [2:4]
        # is Python's slice notation, basically a clever way to select
        # elements in an array. This will return 0, 206 for the first 
        # line in the example.

        interval_start, interval_end = line.split()[2:4]

        # now read through the entirety of `file2`. Do the same
        # as before: split each line, get the coordinates, and this time
        # check if they meet the condition that they are in the interval

        with open(path_to_file2, 'r') as file2:
            for other_line in file2:

                first_coordinate, second_coordinate = other_line.split()[1:3]

                # `int()` converts its arguments into integers - we need it
                # because 
                if (int(interval_start) < int(first_coordinate) and 
                    int(interval_end) > int(second_coordinate)):

                    # this is just a fancy way of adding `line` and 
                    # the last three elements of `other_line` together
                    line += '\t'.join(other_line.split()[-3:])

                    # break the `for other_line in file2` loop! 
                    break

        # add this line to your final file, making sure to add
        # a newline character '\n' so your text editor will insert
        # an appropriate line break.

        final_output_contents += line + '\n'

# now dump the contents to your file
with open(path_to_final_output_file, 'w') as final_output_file:
    final_output_file.write(final_output_contents)

如果两个文件都按照您的示例进行排序,那么您可以在智能地逐行阅读一个或两个文件时找到交集。这对您的进程的内存消耗更友好,并且不需要为 file1 中的每一行循环 file2。这对于像你的例子这样的小文件并不重要,但当这些文件很大时会帮助你。

请注意,我只考虑了与 DNA 链相匹配的起始和终止列。如果其他字段也需要匹配此代码将不起作用(但您可以调整它)。

在现实生活中使用它之前,需要进行更多检查(例如在源文件中的行)。

代码:

def read_and_parse_line(fh):
    # It gets the next line,
    #   strip any whitespace from the end,
    #   split it on whitespace into fields.
    return fh.next().rstrip().split()

def append_to_result(result, line):
    (scaffold1, start1, end1, transcript1, type1, start2, end2, comment2) = line
    # Making start and end real integers so you can work with them as such.
    result.append((scaffold1, int(start1), int(end1), transcript1, type1, int(start2), int(end2), comment2))

# These will be our result lines.
result = []
# With makes sure the files are properly closed afterwards.
with open('file1') as file1, open('file2') as file2:
    # Getting the iterator on the files.
    file1 = iter(file1)
    file2 = iter(file2)
    try:
        (scaffold1, start1, end1, transcript1, type1) = read_and_parse_line(file1)
        read_second = True
        while True:
            if read_second:
                try:
                    (scaffold2, start2, end2, comment2) = read_and_parse_line(file2)
                except StopIteration:
                    print("File 2 ended.")
                    # Nothing more in the second file.
                    # Outputting the original line
                    result.append((scaffold1, start1, end1, transcript1, type1, 0, 0, None))
                    # Reading in a new line from the first file.
                    (scaffold1, start1, end1, transcript1, type1) = read_and_parse_line(file1)
            if int(start1) <= int(start2):
                # Second file record starts after the first file record.
                if int(end1) >= int(end2):
                    # And the second file record ends before the first file record.
                    # = MATCH!!!
                    print("We've found a match: [%s-[%s-%s]-%s]" % (start1, start2, end2, end1))
                    # Write the combined line.
                    result.append((scaffold1, start1, end1, transcript1, type1, start2, end2, comment2))
                    ## READ BOTH
                    (scaffold1, start1, end1, transcript1, type1) = read_and_parse_line(file1)
                    read_second = True
                else:
                    # The second file record ends AFTER the first file record.
                    print("No match in second file: [%s-%s]" % (start1, end1))
                    # So writing out the unmodified line.
                    result.append((scaffold1, start1, end1, transcript1, type1, 0, 0, None))
                    ## READ 1
                    (scaffold1, start1, end1, transcript1, type1) = read_and_parse_line(file1)
                    read_second = False
            else:
                # This is thrown away in your example.
                print("No match in the first file for: [%s-%s]" % (start2, end2))
                # Trying the next line in the second file.
                read_second = True
    except StopIteration:
        print("File 1 ended.")
# Printing a clean line after all the logging above.
print("")

# Printing the result to screen.
# I guess you want to do more stuff with the data and not just write it to file.
for line in result:
    # The 8th element being equal to None indicates there was no match.
    if line[7] == None:
        print("%s\t%s\t%s\t%s\t%s" % line[:5])
    else:    
        print("%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s" % line)