计算 pandas 中列组合的总和,按行计算,输出文件带有所述组合的名称
Calculating sum of a combination of columns in pandas, row-wise, with output file with the name of said combination
我正在寻找一种为数据框中的列的特定数据组合生成 csv 文件的方法。
我的数据看起来像这样(除了多了 200 行)
+-------------------------------+-----+----------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+
| Species | OGT | Domain | A | C | D | E | F | G | H | I | K | L | M | N | P | Q | R | S | T | V | W | Y |
+-------------------------------+-----+----------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+
| Aeropyrum pernix | 95 | Archaea | 9.7659115711 | 0.6720465616 | 4.3895390781 | 7.6501943794 | 2.9344881615 | 8.8666657183 | 1.5011817208 | 5.6901432494 | 4.1428307243 | 11.0604191603 | 2.21143353 | 1.9387130928 | 5.1038552753 | 1.6855017182 | 7.7664358772 | 6.266067034 | 4.2052190807 | 9.2692433532 | 1.318690698 | 3.5614200159 |
| Argobacterium fabrum | 26 | Bacteria | 11.5698896021 | 0.7985475923 | 5.5884500155 | 5.8165463343 | 4.0512504104 | 8.2643271309 | 2.0116736244 | 5.7962804605 | 3.8931525401 | 9.9250463349 | 2.5980609708 | 2.9846761128 | 4.7828063605 | 3.1262365491 | 6.5684282943 | 5.9454781844 | 5.3740045968 | 7.3382308193 | 1.2519739683 | 2.3149400984 |
| Anaeromyxobacter dehalogenans | 27 | Bacteria | 16.0337898849 | 0.8860252895 | 5.1368827707 | 6.1864992608 | 2.9730203513 | 9.3167603253 | 1.9360386851 | 2.940143349 | 2.3473650439 | 10.898494736 | 1.6343905351 | 1.5247123262 | 6.3580285706 | 2.4715303021 | 9.2639057482 | 4.1890063803 | 4.3992339725 | 8.3885969061 | 1.2890166336 | 1.8265589289 |
| Aquifex aeolicus | 85 | Bacteria | 5.8730327277 | 0.795341216 | 4.3287799008 | 9.6746388172 | 5.1386954322 | 6.7148035486 | 1.5438364179 | 7.3358775924 | 9.4641440609 | 10.5736658776 | 1.9263080969 | 3.6183861236 | 4.0518679067 | 2.0493569604 | 4.9229955632 | 4.7976564501 | 4.2005259246 | 7.9169763709 | 0.9292167138 | 4.1438942987 |
| Archaeoglobus fulgidus | 83 | Archaea | 7.8742687687 | 1.1695110027 | 4.9165979364 | 8.9548767369 | 4.568636662 | 7.2640358917 | 1.4998752909 | 7.2472039919 | 6.8957233203 | 9.4826333048 | 2.6014466253 | 3.206476915 | 3.8419576418 | 1.7789787933 | 5.7572748236 | 5.4763351139 | 4.1490633048 | 8.6330814159 | 1.0325605451 | 3.6494619148 |
+-------------------------------+-----+----------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+
我想做的是找到一种生成 csv 的方法,其中包含物种、OGT,然后是其他一些列的组合,比如 A、C、E 和 G,以及这些特定值。
所以输出看起来像这样:(这些总和刚刚组成)
ACEG.csv
Species OGT Sum of percentage
------------------------------- ----- -------------------
Aeropyrum pernix 95 23.4353
Anaeromyxobacter dehalogenans 26 20.3232
Argobacterium fabrum 27 14.2312
Aquifex aeolicus 85 15.0403
Archaeoglobus fulgidus 83 34.0532
这样做的目的是让我可以对每列 (A-Y) 的 1000 万种组合中的每一种都执行此操作,但我认为这是一个简单的 for 循环。我最初试图在 R 中实现这一点,但经过反思,在 python 中使用 pandas 可能是更好的选择。
是这样的吗?
def subset_to_csv(cols):
df['Sum of percentage'] = your_data[list(cols)].sum(axis=1)
df.to_csv(cols + '.csv')
df = your_data[['Species', 'OGT']]
for c in your_list_of_combinations:
subset_to_csv(c)
其中 cols
是包含您要子集化的列的字符串,例如:'ABC'
以下是您可以尝试的方法:
from itertools import product
from string import ascii_uppercase
import pandas as pd
combinations = [''.join(i) for i in product(ascii_uppercase, repeat = 4)]
for combination in combinations:
new_df = df[['Species', 'OGT']]
new_df['Sum of percentage'] = df[list(combination)]
new_df.to_csv(combination + '.csv')
====
根据 Yakym Pirozhenko 的评论进行编辑,combinations
应该使用 itertools.combinations
以避免像 'AAAA'
:
这样的重复
combinations = [''.join(i) for i in itertools.combinations(ascii_uppercase, r = 4)]
不是原始问题的答案,但在讨论中可能会有用。
目标是找到列的组合,使列总和与 OGT
具有最大相关性。这很容易,因为协方差是双线性的:
cov(OGT, A+B) = cov(OGT, A) + cov(OGT, B)
.
我依赖两个简化的假设:
- 因素 A、B、C 等是独立的。
- 物种的权重相同。
- 每个因素的方差为
1
。
想法:
- 标准化所有列以具有单位方差(即假设 3)。
- 计算 OGT 与每一列的协方差。
- 按协方差递减的顺序对因子 A、B、C 进行排序。最佳组合将作为此排列的前缀出现。
- 我们应该选择哪个前缀?标准偏差总和最大的那个。由于步骤 1 中的归一化,对于大小为 n 的前缀,每个前缀之和的每个标准差只是 sqrt(n)。剩下的就是在一个序列中找到一个最大的索引,这很容易。
这可能比检查所有可能的组合要快一点。
import pandas as pd
import numpy as np
# set up fake data
import string
df = pd.DataFrame(np.random.rand(3, 26), columns=list(string.ascii_uppercase))
df["species"] = ["dog", "cat", "human"]
df["OGT"] = np.random.randint(0, 100, 3)
df = df.set_index("species")
# actual work
alpha_cols = list(string.ascii_uppercase)
# normalize standard deviations of each column
df = df[alpha_cols + ["OGT"]].div(df.std(0), axis=1)
# compute correlations (= covariances) of OGT with each column
corrs = df.corrwith(df.OGT).sort_values(ascending=False)
del corrs["OGT"]
# sort covariances in order from the greatest to the smallest
# compute cumulative sums
# divide by standard deviation of a group (i.e. sqrt(n) at index n-1)
cutoff = (corrs.cumsum() / np.sqrt(np.arange(corrs.shape[0]) + 1)).idxmax()
answer = sorted(corrs.loc[:cutoff].index.values)
print(answer)
# e.g.
# ['B', 'I', 'K', 'O', 'Q', 'S', 'U', 'V', 'Y']
我正在寻找一种为数据框中的列的特定数据组合生成 csv 文件的方法。
我的数据看起来像这样(除了多了 200 行)
+-------------------------------+-----+----------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+
| Species | OGT | Domain | A | C | D | E | F | G | H | I | K | L | M | N | P | Q | R | S | T | V | W | Y |
+-------------------------------+-----+----------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+
| Aeropyrum pernix | 95 | Archaea | 9.7659115711 | 0.6720465616 | 4.3895390781 | 7.6501943794 | 2.9344881615 | 8.8666657183 | 1.5011817208 | 5.6901432494 | 4.1428307243 | 11.0604191603 | 2.21143353 | 1.9387130928 | 5.1038552753 | 1.6855017182 | 7.7664358772 | 6.266067034 | 4.2052190807 | 9.2692433532 | 1.318690698 | 3.5614200159 |
| Argobacterium fabrum | 26 | Bacteria | 11.5698896021 | 0.7985475923 | 5.5884500155 | 5.8165463343 | 4.0512504104 | 8.2643271309 | 2.0116736244 | 5.7962804605 | 3.8931525401 | 9.9250463349 | 2.5980609708 | 2.9846761128 | 4.7828063605 | 3.1262365491 | 6.5684282943 | 5.9454781844 | 5.3740045968 | 7.3382308193 | 1.2519739683 | 2.3149400984 |
| Anaeromyxobacter dehalogenans | 27 | Bacteria | 16.0337898849 | 0.8860252895 | 5.1368827707 | 6.1864992608 | 2.9730203513 | 9.3167603253 | 1.9360386851 | 2.940143349 | 2.3473650439 | 10.898494736 | 1.6343905351 | 1.5247123262 | 6.3580285706 | 2.4715303021 | 9.2639057482 | 4.1890063803 | 4.3992339725 | 8.3885969061 | 1.2890166336 | 1.8265589289 |
| Aquifex aeolicus | 85 | Bacteria | 5.8730327277 | 0.795341216 | 4.3287799008 | 9.6746388172 | 5.1386954322 | 6.7148035486 | 1.5438364179 | 7.3358775924 | 9.4641440609 | 10.5736658776 | 1.9263080969 | 3.6183861236 | 4.0518679067 | 2.0493569604 | 4.9229955632 | 4.7976564501 | 4.2005259246 | 7.9169763709 | 0.9292167138 | 4.1438942987 |
| Archaeoglobus fulgidus | 83 | Archaea | 7.8742687687 | 1.1695110027 | 4.9165979364 | 8.9548767369 | 4.568636662 | 7.2640358917 | 1.4998752909 | 7.2472039919 | 6.8957233203 | 9.4826333048 | 2.6014466253 | 3.206476915 | 3.8419576418 | 1.7789787933 | 5.7572748236 | 5.4763351139 | 4.1490633048 | 8.6330814159 | 1.0325605451 | 3.6494619148 |
+-------------------------------+-----+----------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+
我想做的是找到一种生成 csv 的方法,其中包含物种、OGT,然后是其他一些列的组合,比如 A、C、E 和 G,以及这些特定值。
所以输出看起来像这样:(这些总和刚刚组成)
ACEG.csv
Species OGT Sum of percentage
------------------------------- ----- -------------------
Aeropyrum pernix 95 23.4353
Anaeromyxobacter dehalogenans 26 20.3232
Argobacterium fabrum 27 14.2312
Aquifex aeolicus 85 15.0403
Archaeoglobus fulgidus 83 34.0532
这样做的目的是让我可以对每列 (A-Y) 的 1000 万种组合中的每一种都执行此操作,但我认为这是一个简单的 for 循环。我最初试图在 R 中实现这一点,但经过反思,在 python 中使用 pandas 可能是更好的选择。
是这样的吗?
def subset_to_csv(cols):
df['Sum of percentage'] = your_data[list(cols)].sum(axis=1)
df.to_csv(cols + '.csv')
df = your_data[['Species', 'OGT']]
for c in your_list_of_combinations:
subset_to_csv(c)
其中 cols
是包含您要子集化的列的字符串,例如:'ABC'
以下是您可以尝试的方法:
from itertools import product
from string import ascii_uppercase
import pandas as pd
combinations = [''.join(i) for i in product(ascii_uppercase, repeat = 4)]
for combination in combinations:
new_df = df[['Species', 'OGT']]
new_df['Sum of percentage'] = df[list(combination)]
new_df.to_csv(combination + '.csv')
====
根据 Yakym Pirozhenko 的评论进行编辑,combinations
应该使用 itertools.combinations
以避免像 'AAAA'
:
combinations = [''.join(i) for i in itertools.combinations(ascii_uppercase, r = 4)]
不是原始问题的答案,但在讨论中可能会有用。
目标是找到列的组合,使列总和与 OGT
具有最大相关性。这很容易,因为协方差是双线性的:
cov(OGT, A+B) = cov(OGT, A) + cov(OGT, B)
.
我依赖两个简化的假设:
- 因素 A、B、C 等是独立的。
- 物种的权重相同。
- 每个因素的方差为
1
。
想法:
- 标准化所有列以具有单位方差(即假设 3)。
- 计算 OGT 与每一列的协方差。
- 按协方差递减的顺序对因子 A、B、C 进行排序。最佳组合将作为此排列的前缀出现。
- 我们应该选择哪个前缀?标准偏差总和最大的那个。由于步骤 1 中的归一化,对于大小为 n 的前缀,每个前缀之和的每个标准差只是 sqrt(n)。剩下的就是在一个序列中找到一个最大的索引,这很容易。
这可能比检查所有可能的组合要快一点。
import pandas as pd
import numpy as np
# set up fake data
import string
df = pd.DataFrame(np.random.rand(3, 26), columns=list(string.ascii_uppercase))
df["species"] = ["dog", "cat", "human"]
df["OGT"] = np.random.randint(0, 100, 3)
df = df.set_index("species")
# actual work
alpha_cols = list(string.ascii_uppercase)
# normalize standard deviations of each column
df = df[alpha_cols + ["OGT"]].div(df.std(0), axis=1)
# compute correlations (= covariances) of OGT with each column
corrs = df.corrwith(df.OGT).sort_values(ascending=False)
del corrs["OGT"]
# sort covariances in order from the greatest to the smallest
# compute cumulative sums
# divide by standard deviation of a group (i.e. sqrt(n) at index n-1)
cutoff = (corrs.cumsum() / np.sqrt(np.arange(corrs.shape[0]) + 1)).idxmax()
answer = sorted(corrs.loc[:cutoff].index.values)
print(answer)
# e.g.
# ['B', 'I', 'K', 'O', 'Q', 'S', 'U', 'V', 'Y']