Python csv Module Error: index out of range
Python csv Module Error: index out of range
我有一个 CSV 文件,我想从中提取列,但只能从某些行中提取。它看起来像这样:
gene_id, ENSDARG00000104632, gene_version, 2, gene_name, RERG
gene_id, ENSDARG00000104632, gene_version, 2, transcript_id, ENSDART00000166186
gene_id, ENSDARG00000104632, gene_version, 2, transcript_id, ENSDART00000166186
gene_id, ENSDARG00000104632, gene_version, 2, transcript_id, ENSDART00000166186
gene_id, ENSDARG00000104632, gene_version, 2, transcript_id, ENSDART00000166186
基本上我想要第 2 列和第 6 列,但仅来自第 5 列中具有 "gene_name" 的行。所以我要提取:
ENSDARG00000104632, RERG
(从那里开始有数千行)
这是我写的:
import csv
with open('filename.csv', 'rb') as infh:
reader = csv.reader(infh)
for row in reader:
if row[4] == 'gene_name':
print row[1, 5]
但是,它给了我这个错误:
File "./gene_name_grabber.sh", line 10, in
if row[4] == 'gene_name':
IndexError: list index out of range
我知道这个错误意味着我已经要求它查看大于行中索引数的索引号...但是每行中显然有 4 个以上的索引。请帮忙?
谢谢!
I want the 2nd and 6th column, but only from the rows which have "gene_name" in the 5th column.
我爱python。但这最自然地表示为
awk ' ~ /gene_name/ {print , }'
让我们回到python。这不是你想写的:
print row[1, 5]
改为print(row[1], row[5])
。
您的某些行只有少量列。所以你会想要包装例如取消引用row[4]
或 row[5]
在 if
语句中验证它是一个足够长的行:
if len(row) > 5:
...
显然,有些行没有包含足够的列。试试这个:
import csv
with open('input.csv', 'r') as f:
reader = csv.reader(f)
for row in reader:
try:
if 'gene_name' in row[4]:
print('%s, %s' % (row[1].strip(), row[5].strip()))
except IndexError:
continue
...输出:
ENSDARG00000104632, RERG
正如 Antimony 所指出的,听起来好像您的数据中偶尔有缺失值,csv 无法轻松处理开箱即用的问题。我建议使用像 pandas 这样的库,它有一个 read_csv
函数,并且可以处理缺失值。以此数据为例:
gene_id, ENSDARG00000104632, gene_version, 2, gene_name, RERG
gene_id, ENSDARG00000104632, gene_version, 2, transcript_id, ENSDART00000166186
gene_id, ENSDARG00000104632, gene_version, 2, transcript_id, ENSDART00000166186
gene_id, ENSDARG00000104632, gene_version, 2, transcript_id, ENSDART00000166186
gene_id, ENSDARG00000104632, gene_version, 2, transcript_id, ENSDART00000166186
gene_id, ENSDARG00000104632, gene_version, 2, transcript_id, ENSDART00000166186
gene_id, ENSDARG00000104632, gene_version, 2, transcript_id, ENSDART00000166186
gene_id, ENSDARG00000104632, gene_version, 2, transcript_id,
gene_id, ENSDARG00000104632, gene_version, , transcript_id,
gene_id, ENSDARG00000104632, gene_version, 2, transcript_id, ENSDART00000166186
可以这样读:
import pandas as pd
# Use the 2nd, 5th and 6th columns - i.e.column indices 1, 4 and 5 respectively
# And, we set the 'not available' data - i.e. `na_values` as 'N/A'.
data = pd.read_csv('test.dat', na_values='N/A', header=None, skipinitialspace=True, usecols=[1,4,5])
# now select only the rows without 'gene_version':
d = data.loc[data[4] != 'gene_name']
# and, now we only select columns with index 1 and 5:
selected_data = d[[1, 5]]
产量:
1 5
0 ENSDARG00000104632 RERG
1 ENSDARG00000104632 ENSDART00000166186
2 ENSDARG00000104632 ENSDART00000166186
3 ENSDARG00000104632 ENSDART00000166186
4 ENSDARG00000104632 ENSDART00000166186
5 ENSDARG00000104632 ENSDART00000166186
6 ENSDARG00000104632 ENSDART00000166186
7 ENSDARG00000104632 NaN
8 ENSDARG00000104632 NaN
9 ENSDARG00000104632 ENSDART00000166186
随意。
但是,如果缺少数据(如本例所示),您只需删除这些行,例如:
selected_data.dropna()
输出:
1 5
1 ENSDARG00000104632 ENSDART00000166186
2 ENSDARG00000104632 ENSDART00000166186
3 ENSDARG00000104632 ENSDART00000166186
4 ENSDARG00000104632 ENSDART00000166186
5 ENSDARG00000104632 ENSDART00000166186
6 ENSDARG00000104632 ENSDART00000166186
9 ENSDARG00000104632 ENSDART00000166186
(但是,这可能不是你想要的。)
参考资料
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
我有一个 CSV 文件,我想从中提取列,但只能从某些行中提取。它看起来像这样:
gene_id, ENSDARG00000104632, gene_version, 2, gene_name, RERG
gene_id, ENSDARG00000104632, gene_version, 2, transcript_id, ENSDART00000166186
gene_id, ENSDARG00000104632, gene_version, 2, transcript_id, ENSDART00000166186
gene_id, ENSDARG00000104632, gene_version, 2, transcript_id, ENSDART00000166186
gene_id, ENSDARG00000104632, gene_version, 2, transcript_id, ENSDART00000166186
基本上我想要第 2 列和第 6 列,但仅来自第 5 列中具有 "gene_name" 的行。所以我要提取:
ENSDARG00000104632, RERG
(从那里开始有数千行)
这是我写的:
import csv
with open('filename.csv', 'rb') as infh:
reader = csv.reader(infh)
for row in reader:
if row[4] == 'gene_name':
print row[1, 5]
但是,它给了我这个错误:
File "./gene_name_grabber.sh", line 10, in
if row[4] == 'gene_name':
IndexError: list index out of range
我知道这个错误意味着我已经要求它查看大于行中索引数的索引号...但是每行中显然有 4 个以上的索引。请帮忙?
谢谢!
I want the 2nd and 6th column, but only from the rows which have "gene_name" in the 5th column.
我爱python。但这最自然地表示为
awk ' ~ /gene_name/ {print , }'
让我们回到python。这不是你想写的:
print row[1, 5]
改为print(row[1], row[5])
。
您的某些行只有少量列。所以你会想要包装例如取消引用row[4]
或 row[5]
在 if
语句中验证它是一个足够长的行:
if len(row) > 5:
...
显然,有些行没有包含足够的列。试试这个:
import csv
with open('input.csv', 'r') as f:
reader = csv.reader(f)
for row in reader:
try:
if 'gene_name' in row[4]:
print('%s, %s' % (row[1].strip(), row[5].strip()))
except IndexError:
continue
...输出:
ENSDARG00000104632, RERG
正如 Antimony 所指出的,听起来好像您的数据中偶尔有缺失值,csv 无法轻松处理开箱即用的问题。我建议使用像 pandas 这样的库,它有一个 read_csv
函数,并且可以处理缺失值。以此数据为例:
gene_id, ENSDARG00000104632, gene_version, 2, gene_name, RERG
gene_id, ENSDARG00000104632, gene_version, 2, transcript_id, ENSDART00000166186
gene_id, ENSDARG00000104632, gene_version, 2, transcript_id, ENSDART00000166186
gene_id, ENSDARG00000104632, gene_version, 2, transcript_id, ENSDART00000166186
gene_id, ENSDARG00000104632, gene_version, 2, transcript_id, ENSDART00000166186
gene_id, ENSDARG00000104632, gene_version, 2, transcript_id, ENSDART00000166186
gene_id, ENSDARG00000104632, gene_version, 2, transcript_id, ENSDART00000166186
gene_id, ENSDARG00000104632, gene_version, 2, transcript_id,
gene_id, ENSDARG00000104632, gene_version, , transcript_id,
gene_id, ENSDARG00000104632, gene_version, 2, transcript_id, ENSDART00000166186
可以这样读:
import pandas as pd
# Use the 2nd, 5th and 6th columns - i.e.column indices 1, 4 and 5 respectively
# And, we set the 'not available' data - i.e. `na_values` as 'N/A'.
data = pd.read_csv('test.dat', na_values='N/A', header=None, skipinitialspace=True, usecols=[1,4,5])
# now select only the rows without 'gene_version':
d = data.loc[data[4] != 'gene_name']
# and, now we only select columns with index 1 and 5:
selected_data = d[[1, 5]]
产量:
1 5
0 ENSDARG00000104632 RERG
1 ENSDARG00000104632 ENSDART00000166186
2 ENSDARG00000104632 ENSDART00000166186
3 ENSDARG00000104632 ENSDART00000166186
4 ENSDARG00000104632 ENSDART00000166186
5 ENSDARG00000104632 ENSDART00000166186
6 ENSDARG00000104632 ENSDART00000166186
7 ENSDARG00000104632 NaN
8 ENSDARG00000104632 NaN
9 ENSDARG00000104632 ENSDART00000166186
随意。
但是,如果缺少数据(如本例所示),您只需删除这些行,例如:
selected_data.dropna()
输出:
1 5
1 ENSDARG00000104632 ENSDART00000166186
2 ENSDARG00000104632 ENSDART00000166186
3 ENSDARG00000104632 ENSDART00000166186
4 ENSDARG00000104632 ENSDART00000166186
5 ENSDARG00000104632 ENSDART00000166186
6 ENSDARG00000104632 ENSDART00000166186
9 ENSDARG00000104632 ENSDART00000166186
(但是,这可能不是你想要的。)
参考资料
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html