Python: 如何根据用户输入将txt文件数据拆分成csv

Python: How to split txt file data into csv according to user input

所以我正在阅读一个 .txt 文件,该文件与此大致相似:TTACGATATACGA 等,但包含数千个字符。现在我可以读入一个文件并根据决定每列字符和列数的用户输入将其输出为 csv,但是它每次都会写入一个新文件。

理想情况下,我希望每个文件都有这样的格式:

用户输入 4 和 3。

输出:TCAG、TGCT、TACG,

我当前的输出是这样的:

TCAGTGCTTACG

我试过查看字符串拆分,但我似乎无法让它工作。

这是我到目前为止写的,如果不好请见谅:

#user input for parameters
user_input_character = int(input("Enter how many characters you;d like 

per column"))
user_input_column = int(input("Enter how many columns you'd like"))
character_per_column = user_input_character
columns_per_entry = user_input_column
characters_to_read = int((character_per_column * columns_per_entry))
print("Total characters: " + str(characters_to_read))

#counts used to set letters to be taken into intake
index_start = 0
index_finish = characters_to_read
count =1

#open the file to be read
lines = []
test_file = open("dna.txt", "r")
for line in test_file:
        line = line.strip()
        if not line:
            continue

lines.append(',')

#read the file and take note of its size for index purposes
read_file = test_file.read()
file_size = read_file.__len__()
print((file_size))
i = 1
index = 0
#use loop to make more than one file output
while(index < 50):

#print count used to measure progress for testing
    print('the count is', count)
    count += 1
    index += characters_to_read
    print('index: ',index)

#intake only uses letters from index count per file
    intake = read_file[index_start:index_finish]
    print(intake)

    index_start += characters_to_read
    index_finish +=characters_to_read

#output a txt file with the 4 letters from intake as a individually     numbered txt file
    text_file_output = open("Output%i.csv"%i,'w')
    i += 1
    text_file_output.write(intake)
    text_file_output.close()
#define path to print to console for file saving
    path = os.path.abspath("Output%i")
    directory = os.path.dirname(path)
    print(path)

test_file.close()

这里有一种简单的方法可以将您的 DNA 数据拆分成由指定大小的列和块组成的行。它假定 DNA 数据位于单个字符串中,没有白色 space 字符(space、制表符、换行符等)。

为了测试这段代码,我使用 random 模块创建了一些假数据。

from random import seed, choice
seed(42)

# Make some random DNA data
num = 66
data = ''.join([choice('ACGT') for _ in range(num)])
print(data, '\n')

# Split the data into chunks, columns and rows
chunksize, cols = 4, 3

row = []
for i in range(0, len(data), chunksize):
    chunk = data[i:i+chunksize]
    row.append(chunk)
    if len(row) == cols:
        print(' '.join(row))
        row = []
if row:
    print(' '.join(row))

输出

AAGCCCAATAAACCACTCTGACTGGCCGAATAGGGATATAGGCAACGACATGTGCGGCGACCCTTG

AAGC CCAA TAAA
CCAC TCTG ACTG
GCCG AATA GGGA
TATA GGCA ACGA
CATG TGCG GCGA
CCCT TG

在我的旧 2GHz 32 位机器上,运行 Python 3.6.0,此代码每秒可以处理并保存到磁盘大约 100000 个字符(包括生成随机字符所花费的时间数据)。


这是上述代码的一个版本,它处理输入数据中的 space 和空行。它从文件中读取输入数据并将输出写入 CSV 文件。

首先,这是我用来创建一些假测试数据的代码,我将其保存到“dnatest.txt”。

from random import seed, choice, randrange
seed(123)

# Make some random DNA data containing spaces
pool = 'ACGT' * 5 + ' '
for _ in range(15):
    # Choose a random line length
    size = randrange(50, 70)
    data = ''.join([choice(pool) for _ in range(size)])
    print(data)
    # Randomly add a blank line
    if randrange(5) < 2:
        print()

这是它创建的文件:

dnatest.txt

AGCATCACCGGCCAGCGTCACGTAGAGGTCGAAACCGTATCCGATGT AGG

 ACC TTACTAC CGTACGGCAGGAGGAGGG TATTACAC CT TCTCACGAGCAAGGAATA
ATTGATGGCACAGC AAGATCCGCTA  CCGATTG CAACCA CATACGAT CGACCAGATGG
ACAGAACAGATCTTGGGAATGGAACAGGAGAGAGTGTGGGCCACATTAAAGTGATAAT ATTT
TCTGTCGTGGGGCACCAAACCATGCTAATGCACGACTGGGT GAGGGTTGAGAGCCTACTATCCTCAG
TCGATCGAGATGACCCTCCTATCGCAACAGCTGTCAGTGTCCAGAG ACGTCGC CA
TAGGTCTGGAAAC GCACTCCCCTC GGAATAGTCTACACGAGTCCATTATGTC
GATCTGACTATGGGGACCATAACGGCTATGCGACCATGGACTGGTTCGAG

GATTCCCGTTCTACAT CACCTT ACCTCTGATAA CGACTGGTTCGA GGGTCTC CC

AAA CGTCTATTATGTCATAACGTAACTCTGC CGTAGTTTGATCAAACGTACAGCCACCAC

TGAAGC CGCCTCGAACCGCGTCCGACCCTGGGGAGCCTGGGGCCCAGCA
CCTTAGC ACTGCGA AGCTACACCCCACGAGTAATTTG T CTATCGT CCG
GCCTCGTTTCCTTGTGAAATTAT ATGGT C AGTCTTCAATCAA CACCTA CTAATAA
 GTGCTAGC CCGGGGATCTTGTCCTGGTCCA GGTC AT AATCCGTGCTCAAATTACATGGCTT
TTAGTAATGAGTTCGGGC  GCGCCCTCAAAGTTGGTCTAGAAGCGCGCAGTTTTCCTTAGGT

这是处理该数据的代码:

# Input & output file names
iname = 'dnatest.txt'
oname = 'dnatest.csv'

# Read the data and eliminate all whitespace
with open(iname) as f:
    data = ''.join(f.read().split())

# Split the data into chunks, columns and rows
chunksize, cols = 4, 3

with open(oname, 'w') as f:
    row = []
    for i in range(0, len(data), chunksize):
        chunk = data[i:i+chunksize]
        row.append(chunk)
        if len(row) == cols:
            f.write(', '.join(row) + '\n')
            row = []
    if row:
        f.write(', '.join(row) + '\n')

这是它创建的文件:

dnatest.csv

AGCA, TCAC, CGGC
CAGC, GTCA, CGTA
GAGG, TCGA, AACC
GTAT, CCGA, TGTA
GGAC, CTTA, CTAC
CGTA, CGGC, AGGA
GGAG, GGTA, TTAC
ACCT, TCTC, ACGA
GCAA, GGAA, TAAT
TGAT, GGCA, CAGC
AAGA, TCCG, CTAC
CGAT, TGCA, ACCA
CATA, CGAT, CGAC
CAGA, TGGA, CAGA
ACAG, ATCT, TGGG
AATG, GAAC, AGGA
GAGA, GTGT, GGGC
CACA, TTAA, AGTG
ATAA, TATT, TTCT
GTCG, TGGG, GCAC
CAAA, CCAT, GCTA
ATGC, ACGA, CTGG
GTGA, GGGT, TGAG
AGCC, TACT, ATCC
TCAG, TCGA, TCGA
GATG, ACCC, TCCT
ATCG, CAAC, AGCT
GTCA, GTGT, CCAG
AGAC, GTCG, CCAT
AGGT, CTGG, AAAC
GCAC, TCCC, CTCG
GAAT, AGTC, TACA
CGAG, TCCA, TTAT
GTCG, ATCT, GACT
ATGG, GGAC, CATA
ACGG, CTAT, GCGA
CCAT, GGAC, TGGT
TCGA, GGAT, TCCC
GTTC, TACA, TCAC
CTTA, CCTC, TGAT
AACG, ACTG, GTTC
GAGG, GTCT, CCCA
AACG, TCTA, TTAT
GTCA, TAAC, GTAA
CTCT, GCCG, TAGT
TTGA, TCAA, ACGT
ACAG, CCAC, CACT
GAAG, CCGC, CTCG
AACC, GCGT, CCGA
CCCT, GGGG, AGCC
TGGG, GCCC, AGCA
CCTT, AGCA, CTGC
GAAG, CTAC, ACCC
CACG, AGTA, ATTT
GTCT, ATCG, TCCG
GCCT, CGTT, TCCT
TGTG, AAAT, TATA
TGGT, CAGT, CTTC
AATC, AACA, CCTA
CTAA, TAAG, TGCT
AGCC, CGGG, GATC
TTGT, CCTG, GTCC
AGGT, CATA, ATCC
GTGC, TCAA, ATTA
CATG, GCTT, TTAG
TAAT, GAGT, TCGG
GCGC, GCCC, TCAA
AGTT, GGTC, TAGA
AGCG, CGCA, GTTT
TCCT, TAGG, T