将 Numpy 数组插入并分类到 Django 建模的数据库 EAV 模式中
Insert and categorize a Numpy array into a Django modelled database EAV schema
我有一个 Pandas 主元 table 格式:
income_category age_category income age
High Middle aged 123,564.235 23.456
Medium Old 18,324.356 65.432
我有一个类别层次结构,在名为 dimension
的自引用 table 中匹配 label
。即,
dimension_id label parent_dimension_id
1 Age categories
2 Young 1
3 Middle aged 1
4 Old 1
...and similarly for income
我真的很难一次选择一行并随机访问该行中的单元格。
我有父类别 ID dimension_id
(在下面的代码中它已经在 cat_id_age
中)。所以我想遍历 Numpy 数组,获取该行的匹配类别 dimension_id
,并将其插入到值 table 及其对应的值中。但是我不知道如何 Pythonically 或 Djangonically 做到这一点。 (只有几个类别,所以我认为下面用于查找 dimension_id
的字典方法是最好的。)在我的迭代过程中,这个过程是:
# populate a Dictionary to find dimension_ids
age_dims = Dimension.objects.filter(parent_id=cat_id_age).values('label', 'id')
for row in Numpy_array:
dim_id = Dimension.get(row.age_category)
# Or is the Dict approach incorrect? I'm trying to do: SELECT dimension_id FROM dimension WHERE parent_dimension_id=cat_id_age AND label=row.age_category
# Djagonically? dim = Dimension.objects.get(parent_id=cat_id_age, label=row.age_category)
# Then insert categorized value, ie, INSERT INTO float_value (value, dimension_id) VALUES (row.age, dimension_id)
float_val = FloatValue(value=row.age, dimension_id=dim_id)
float_val.save()
...then repeat for income_category and income.
然而,我正在为这样的迭代而苦苦挣扎——这可能是我唯一的问题,但我已经包括了其余部分来传达我正在尝试做的事情,因为我经常看起来是一个范例 Python(例如,cursor.executemany("""insert values(?, ?, ?)""", map(tuple, numpy_arr[x:].tolist()))
?).
非常感谢任何指点。 (我使用的是 Django 1.7 和 Python 3.4。)
Anzel 回答了迭代问题 - use the Pandas to_csv() function。我的字典语法也是错误的。因此,我的最终解决方案是:
# populate a Dictionary to find dimension_ids for category labels
parent_dimension_age = Dimension.objects.get(name='Age')
parent_dimension_income = Dimension.objects.get(name='Income')
dims_age = dict([ (d.name, d.id) for d in Dimension.objects.filter(parent_id=parent_dimension_age.id) ])
dims_income = dict([ (d.name, d.id) for d in Dimension.objects.filter(parent_id=parent_dimension_income.id) ])
# Retrieves a row at a time into a comma delimited string
for line in pandas_pivottable.to_csv(header=False, index=True, sep='\t').split('\n'):
if line:
# row[0] = income category, row[1] = age category, row[2] = age, row[3] = income
row = line.split('\t')
entity = Entity(name='data pivot row', dataset_id=dataset.id)
entity.save()
# dims_age.get(row[1]) gets the ID for the category whose name matches the contents of row[1]
age_val = FloatValue(value=row[2], entity_id=entity.id, attribute_id=attrib_age.id, dimension_id=dims_age.get(row[1]))
age_val.save()
income_val = FloatValue(value=row[3], entity_id=entity.id, attribute_id=attrib_income.id, dimension_id=dims_income.get(row[0]))
income_val.save()
有关实体属性值 (EAV) 架构的更多信息,请参阅 the Wikipedia page, (if you are considering it see the Django-EAV extension). In the next iteration of this project however, I will be replacing it with postgresql's new JSONB type。这有望使数据更清晰,并且性能相同或更好。
我有一个 Pandas 主元 table 格式:
income_category age_category income age
High Middle aged 123,564.235 23.456
Medium Old 18,324.356 65.432
我有一个类别层次结构,在名为 dimension
的自引用 table 中匹配 label
。即,
dimension_id label parent_dimension_id
1 Age categories
2 Young 1
3 Middle aged 1
4 Old 1
...and similarly for income
我真的很难一次选择一行并随机访问该行中的单元格。
我有父类别 ID dimension_id
(在下面的代码中它已经在 cat_id_age
中)。所以我想遍历 Numpy 数组,获取该行的匹配类别 dimension_id
,并将其插入到值 table 及其对应的值中。但是我不知道如何 Pythonically 或 Djangonically 做到这一点。 (只有几个类别,所以我认为下面用于查找 dimension_id
的字典方法是最好的。)在我的迭代过程中,这个过程是:
# populate a Dictionary to find dimension_ids
age_dims = Dimension.objects.filter(parent_id=cat_id_age).values('label', 'id')
for row in Numpy_array:
dim_id = Dimension.get(row.age_category)
# Or is the Dict approach incorrect? I'm trying to do: SELECT dimension_id FROM dimension WHERE parent_dimension_id=cat_id_age AND label=row.age_category
# Djagonically? dim = Dimension.objects.get(parent_id=cat_id_age, label=row.age_category)
# Then insert categorized value, ie, INSERT INTO float_value (value, dimension_id) VALUES (row.age, dimension_id)
float_val = FloatValue(value=row.age, dimension_id=dim_id)
float_val.save()
...then repeat for income_category and income.
然而,我正在为这样的迭代而苦苦挣扎——这可能是我唯一的问题,但我已经包括了其余部分来传达我正在尝试做的事情,因为我经常看起来是一个范例 Python(例如,cursor.executemany("""insert values(?, ?, ?)""", map(tuple, numpy_arr[x:].tolist()))
?).
非常感谢任何指点。 (我使用的是 Django 1.7 和 Python 3.4。)
Anzel 回答了迭代问题
# populate a Dictionary to find dimension_ids for category labels
parent_dimension_age = Dimension.objects.get(name='Age')
parent_dimension_income = Dimension.objects.get(name='Income')
dims_age = dict([ (d.name, d.id) for d in Dimension.objects.filter(parent_id=parent_dimension_age.id) ])
dims_income = dict([ (d.name, d.id) for d in Dimension.objects.filter(parent_id=parent_dimension_income.id) ])
# Retrieves a row at a time into a comma delimited string
for line in pandas_pivottable.to_csv(header=False, index=True, sep='\t').split('\n'):
if line:
# row[0] = income category, row[1] = age category, row[2] = age, row[3] = income
row = line.split('\t')
entity = Entity(name='data pivot row', dataset_id=dataset.id)
entity.save()
# dims_age.get(row[1]) gets the ID for the category whose name matches the contents of row[1]
age_val = FloatValue(value=row[2], entity_id=entity.id, attribute_id=attrib_age.id, dimension_id=dims_age.get(row[1]))
age_val.save()
income_val = FloatValue(value=row[3], entity_id=entity.id, attribute_id=attrib_income.id, dimension_id=dims_income.get(row[0]))
income_val.save()
有关实体属性值 (EAV) 架构的更多信息,请参阅 the Wikipedia page, (if you are considering it see the Django-EAV extension). In the next iteration of this project however, I will be replacing it with postgresql's new JSONB type。这有望使数据更清晰,并且性能相同或更好。