为大型数据集构建地图
Building a map for large datasets
我有一个非常大的文件,它有两列,大小约为 10GB:
A B
1 2
3 7
1 5
6 5
9 8
基本上,我想从这个文件创建一个类似地图的结构,如下所示:
{{1 -> 2,5},{3->7}, {6->5}, {9->8}}
目标是编写一个函数来计算受删除键影响的唯一值的百分比。例如,在上面的示例中,如果我删除键,则 1, 2/4 的值会受到影响。如果我同时删除 1 和 6,则会影响 2/4 的值。问题是这个映射结构会占用太多内存。有没有更有效的替代方法?我认为您需要一张地图来跟踪重复项。您需要知道哪些键已经被删除,这样您就不会重复计算。这是我的初始代码:
with open("C:/Users/XX/Desktop/Train.tsv") as f:
counter = 0
for line in f:
#split line into key and value
#add key into set
#if set does not contain key
#create new key
#add list for this key
#append value to this list
#else
#append value to already existing list for that key
这是我在 运行 Alexander 的代码之后得到的错误信息:不知道 KeyError 293 是什么意思
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-22-73145e080824> in <module>()
7 for line in f:
8 key, value = line.split()
----> 9 if value not in dd[key]:
10 dd[key].append(value)
11 counter = counter+1
KeyError: '293'
您可以为此使用 defaultdict,我们将其设置为自动为每个键分配一个空列表:
from collections import defaultdict
filename = "C:/Users/XX/Desktop/Train.tsv"
dd = defaultdict(list)
with open(filename) as f:
for line in f:
key, value = line.split(',') # Assuming comma delimited.
if value not in dd[key]: # If you only want to retain unique values.
dd[key].append(value)
是这样的吗?
#!python3
from collections import defaultdict
AB_map = defaultdict(set)
Values = set()
with open('train.tsv') as infile:
headers = next(infile)
for line in infile:
if not line.strip():
continue
a,b = map(int, line.split())
AB_map[a].add(b)
Values.add(b)
print("# of keys:", len(AB_map.keys()))
print("# of values:", len(Values))
def impact_of_deletion(keylist):
values_impacted = set([])
for key in keylist:
values_impacted.update(AB_map[key])
return values_impacted
for hyp in ((1,), (1,6)):
print("Deleting", hyp, "would impact:", len(impact_of_deletion(hyp)))
我有一个非常大的文件,它有两列,大小约为 10GB:
A B
1 2
3 7
1 5
6 5
9 8
基本上,我想从这个文件创建一个类似地图的结构,如下所示:
{{1 -> 2,5},{3->7}, {6->5}, {9->8}}
目标是编写一个函数来计算受删除键影响的唯一值的百分比。例如,在上面的示例中,如果我删除键,则 1, 2/4 的值会受到影响。如果我同时删除 1 和 6,则会影响 2/4 的值。问题是这个映射结构会占用太多内存。有没有更有效的替代方法?我认为您需要一张地图来跟踪重复项。您需要知道哪些键已经被删除,这样您就不会重复计算。这是我的初始代码:
with open("C:/Users/XX/Desktop/Train.tsv") as f:
counter = 0
for line in f:
#split line into key and value
#add key into set
#if set does not contain key
#create new key
#add list for this key
#append value to this list
#else
#append value to already existing list for that key
这是我在 运行 Alexander 的代码之后得到的错误信息:不知道 KeyError 293 是什么意思
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-22-73145e080824> in <module>()
7 for line in f:
8 key, value = line.split()
----> 9 if value not in dd[key]:
10 dd[key].append(value)
11 counter = counter+1
KeyError: '293'
您可以为此使用 defaultdict,我们将其设置为自动为每个键分配一个空列表:
from collections import defaultdict
filename = "C:/Users/XX/Desktop/Train.tsv"
dd = defaultdict(list)
with open(filename) as f:
for line in f:
key, value = line.split(',') # Assuming comma delimited.
if value not in dd[key]: # If you only want to retain unique values.
dd[key].append(value)
是这样的吗?
#!python3
from collections import defaultdict
AB_map = defaultdict(set)
Values = set()
with open('train.tsv') as infile:
headers = next(infile)
for line in infile:
if not line.strip():
continue
a,b = map(int, line.split())
AB_map[a].add(b)
Values.add(b)
print("# of keys:", len(AB_map.keys()))
print("# of values:", len(Values))
def impact_of_deletion(keylist):
values_impacted = set([])
for key in keylist:
values_impacted.update(AB_map[key])
return values_impacted
for hyp in ((1,), (1,6)):
print("Deleting", hyp, "would impact:", len(impact_of_deletion(hyp)))