python 删除集合中的重复值
python remove repeated values in set
我有一套看起来像这样的:
my_set = {
[
{
"sample_id": "read1",
"seg_1": None,
"lukM-F": "D",
"23s_SA": None,
"see": None,
"sed": "ND"
},
{
"sample_id": "read2",
"seg_1": None,
"lukM-F": "ND",
"23s_SA": None,
"see": "D",
"sed": "ND"
},
{
"sample_id": "read3",
"seg_1": None,
"lukM-F": "D",
"23s_SA": None,
"see": "ND",
"sed": "None"
}
]
}
我想删除整个字符串中值为 'None' 的键。例如,例如:如果 "None" 是每个 sample_id 中键 "seg_1" 的值(read1 AND read2 AND read3),则完全删除该键。如果"seg_1"中有一个"None",说在read1中,另外两个sample_id不是"None"则保留"seg_1"及其值。所以我想以以下结尾:
my_set = {
[
{
"sample_id": "read1",
"lukM-F": "D",
"see": None,
"sed": "ND"
},
{
"sample_id": "read2",
"lukM-F": "ND",
"see": "D",
"sed": "ND"
},
{
"sample_id": "read3",
"lukM-F": "D",
"see": "ND",
"sed": "None"
}
]
}
请注意,seg_1 和 23s_SA 现在已被删除,因为它们在所有 sample_id 中的值为 'None'。
我花了很长时间尝试这样做但没有成功。我终于将集合转换为 dict,然后列出,然后遍历所有列表并删除所有列表中包含 None 的所有项目。
number_of_samples = len(my_set)
each_sample_list = [[] for i in range(0, number_of_samples)]
n = 0
for data_in_dict in my_set:
for k,val in data_in_dict.items():
each_sample_list[n].append([k,val])
if n == number_of_samples:
break
else:
print each_sample_list[n]
n += 1
我想过使用 itertools izip 来遍历多个列表,但不确定这是否可行。非常感谢任何帮助。
谢谢
您可以创建计数器,然后删除所有需要的键:
import collections
import itertools
source = [
{
"sample_id": "read1",
"seg_1": None,
"lukM-F": "D",
"23s_SA": None,
"see": None,
"sed": "ND"
},
{
"sample_id": "read2",
"seg_1": None,
"lukM-F": "ND",
"23s_SA": None,
"see": "D",
"sed": "ND"
},
{
"sample_id": "read3",
"seg_1": None,
"lukM-F": "D",
"23s_SA": None,
"see": "ND",
"sed": "None"
}
]
size = len(source)
# for python2 you should use iteritems() method
iterators_chain = itertools.chain(*[x.items() for x in source])
counter = collections.Counter(iterators_chain)
for (key, val), count in counter.items():
if count == size and val is None:
for x in source:
x.pop(key)
您的 my_set
不是有效的集合,因为集合项必须是可散列的,而列表是不可散列的。但无论如何...
这是一种不需要任何导入的方法。它使用集合来确定要保留哪些密钥。
my_stuff = [
{
"sample_id": "read1",
"seg_1": None,
"lukM-F": "D",
"23s_SA": None,
"see": None,
"sed": "ND"
},
{
"sample_id": "read2",
"seg_1": None,
"lukM-F": "ND",
"23s_SA": None,
"see": "D",
"sed": "ND"
},
{
"sample_id": "read3",
"seg_1": None,
"lukM-F": "D",
"23s_SA": None,
"see": "ND",
"sed": None
}
]
allkeys = set(k for d in my_stuff for k in d)
goodkeys = set(k for k in allkeys if any(d.get(k) for d in my_stuff))
badkeys = allkeys - goodkeys
for d in my_stuff:
for k in badkeys:
del d[k]
for d in my_stuff:
print(d)
输出
{'lukM-F': 'D', 'see': None, 'sed': 'ND', 'sample_id': 'read1'}
{'lukM-F': 'ND', 'see': 'D', 'sed': 'ND', 'sample_id': 'read2'}
{'lukM-F': 'D', 'see': 'ND', 'sed': None, 'sample_id': 'read3'}
allkeys
和 goodkeys
的 set(...)
结构可以用现代版本 Python 中的集合推导代替,但我在 Python 2.6.6 在这台古老的机器上。
另一种构建 allkeys
集的方法是
allkeys = set()
for d in my_stuff:
allkeys.update(d.keys())
虽然代码更多,但运行速度更快,因为 .update
正在以 C 速度处理 dict
的整个密钥集合,而另一种方法必须循环遍历 [=36 处的密钥=] 速度。当然,如果您可以保证列表中每个 dict
中的键集始终相同,则可以进一步优化。
利用list
里面的所有dict
里面的key必须是None
bkeys = [k for k, v in next(iter(my_stuff), {}).items() if v is None]
bkeys = [k for k in bkeys if all(d[k] is None for d in my_stuff)]
my_stuff = [{k: v for k, v in d.items() if k not in bkeys} for d in my_stuff]
新 my_stuff
的打印输出:
{'see': None, 'sed': 'ND', 'lukM-F': 'D', 'sample_id': 'read1'}
{'see': 'D', 'sed': 'ND', 'lukM-F': 'ND', 'sample_id': 'read2'}
{'see': 'ND', 'sed': None, 'lukM-F': 'D', 'sample_id': 'read3'}
没有dict
理解只需将最后一行更改为:
my_stuff = [dict(((k, v) for k, v in d.items() if k not in bkeys)) for d in my_stuff]
已编辑 以仅使用第一项的 None
键(如果存在)。
我有一套看起来像这样的:
my_set = {
[
{
"sample_id": "read1",
"seg_1": None,
"lukM-F": "D",
"23s_SA": None,
"see": None,
"sed": "ND"
},
{
"sample_id": "read2",
"seg_1": None,
"lukM-F": "ND",
"23s_SA": None,
"see": "D",
"sed": "ND"
},
{
"sample_id": "read3",
"seg_1": None,
"lukM-F": "D",
"23s_SA": None,
"see": "ND",
"sed": "None"
}
]
}
我想删除整个字符串中值为 'None' 的键。例如,例如:如果 "None" 是每个 sample_id 中键 "seg_1" 的值(read1 AND read2 AND read3),则完全删除该键。如果"seg_1"中有一个"None",说在read1中,另外两个sample_id不是"None"则保留"seg_1"及其值。所以我想以以下结尾:
my_set = {
[
{
"sample_id": "read1",
"lukM-F": "D",
"see": None,
"sed": "ND"
},
{
"sample_id": "read2",
"lukM-F": "ND",
"see": "D",
"sed": "ND"
},
{
"sample_id": "read3",
"lukM-F": "D",
"see": "ND",
"sed": "None"
}
]
}
请注意,seg_1 和 23s_SA 现在已被删除,因为它们在所有 sample_id 中的值为 'None'。
我花了很长时间尝试这样做但没有成功。我终于将集合转换为 dict,然后列出,然后遍历所有列表并删除所有列表中包含 None 的所有项目。
number_of_samples = len(my_set)
each_sample_list = [[] for i in range(0, number_of_samples)]
n = 0
for data_in_dict in my_set:
for k,val in data_in_dict.items():
each_sample_list[n].append([k,val])
if n == number_of_samples:
break
else:
print each_sample_list[n]
n += 1
我想过使用 itertools izip 来遍历多个列表,但不确定这是否可行。非常感谢任何帮助。
谢谢
您可以创建计数器,然后删除所有需要的键:
import collections
import itertools
source = [
{
"sample_id": "read1",
"seg_1": None,
"lukM-F": "D",
"23s_SA": None,
"see": None,
"sed": "ND"
},
{
"sample_id": "read2",
"seg_1": None,
"lukM-F": "ND",
"23s_SA": None,
"see": "D",
"sed": "ND"
},
{
"sample_id": "read3",
"seg_1": None,
"lukM-F": "D",
"23s_SA": None,
"see": "ND",
"sed": "None"
}
]
size = len(source)
# for python2 you should use iteritems() method
iterators_chain = itertools.chain(*[x.items() for x in source])
counter = collections.Counter(iterators_chain)
for (key, val), count in counter.items():
if count == size and val is None:
for x in source:
x.pop(key)
您的 my_set
不是有效的集合,因为集合项必须是可散列的,而列表是不可散列的。但无论如何...
这是一种不需要任何导入的方法。它使用集合来确定要保留哪些密钥。
my_stuff = [
{
"sample_id": "read1",
"seg_1": None,
"lukM-F": "D",
"23s_SA": None,
"see": None,
"sed": "ND"
},
{
"sample_id": "read2",
"seg_1": None,
"lukM-F": "ND",
"23s_SA": None,
"see": "D",
"sed": "ND"
},
{
"sample_id": "read3",
"seg_1": None,
"lukM-F": "D",
"23s_SA": None,
"see": "ND",
"sed": None
}
]
allkeys = set(k for d in my_stuff for k in d)
goodkeys = set(k for k in allkeys if any(d.get(k) for d in my_stuff))
badkeys = allkeys - goodkeys
for d in my_stuff:
for k in badkeys:
del d[k]
for d in my_stuff:
print(d)
输出
{'lukM-F': 'D', 'see': None, 'sed': 'ND', 'sample_id': 'read1'}
{'lukM-F': 'ND', 'see': 'D', 'sed': 'ND', 'sample_id': 'read2'}
{'lukM-F': 'D', 'see': 'ND', 'sed': None, 'sample_id': 'read3'}
allkeys
和 goodkeys
的 set(...)
结构可以用现代版本 Python 中的集合推导代替,但我在 Python 2.6.6 在这台古老的机器上。
另一种构建 allkeys
集的方法是
allkeys = set()
for d in my_stuff:
allkeys.update(d.keys())
虽然代码更多,但运行速度更快,因为 .update
正在以 C 速度处理 dict
的整个密钥集合,而另一种方法必须循环遍历 [=36 处的密钥=] 速度。当然,如果您可以保证列表中每个 dict
中的键集始终相同,则可以进一步优化。
利用list
dict
里面的key必须是None
bkeys = [k for k, v in next(iter(my_stuff), {}).items() if v is None]
bkeys = [k for k in bkeys if all(d[k] is None for d in my_stuff)]
my_stuff = [{k: v for k, v in d.items() if k not in bkeys} for d in my_stuff]
新 my_stuff
的打印输出:
{'see': None, 'sed': 'ND', 'lukM-F': 'D', 'sample_id': 'read1'}
{'see': 'D', 'sed': 'ND', 'lukM-F': 'ND', 'sample_id': 'read2'}
{'see': 'ND', 'sed': None, 'lukM-F': 'D', 'sample_id': 'read3'}
没有dict
理解只需将最后一行更改为:
my_stuff = [dict(((k, v) for k, v in d.items() if k not in bkeys)) for d in my_stuff]
已编辑 以仅使用第一项的 None
键(如果存在)。