Pandas,对于每个分组,枚举字符串列并转换为计数器字典
Pandas, for each groupby group, enumerate over column of strings and convert to counter dictionary
我正在尝试为任何输入 pandas 数据框自动构建 networkx 图。
数据框如下所示:
FeatureID BC chrom pos ftm_call
1_1_1 GCTATT 12 25398138 NRAS_3
1_1_1 GCCTAT 12 25398160 NRAS_3
1_1_1 GCCTAT 12 25398073 NRAS_3
1_1_1 GATCCT 12 25398128 NRAS_3
1_1_1 GATCCT 12 25398107 NRAS_3
下面是我需要整理的算法:
- 按 FeatureID 分组
- 对于每个 FeatureID,具有 "name" 属性且匹配 ftm_call
的 select 图
- 对于组中的每一行,枚举 BC 列,其中起始位置等于 pos 列中的值
- 对于 BC 中的每个字母,检查该字母是否已经在图中的那个位置找到,如果没有,则添加权重 1。如果已经存在,则将权重添加 1
到目前为止,这是我所拥有的:
import pandas as pd
import numpy as np
import networkx as nx
from collections import defaultdict
# read in test basecalls
hamming_df = pd.read_csv("./test_data.txt", sep="\t")
hamming_df = hamming_df[["FeatureID", "BC", "chrom", "pos"]]
# initiate graphs
G = nx.DiGraph(name="G")
KRAS = nx.DiGraph(name="KRAS")
NRAS_3 = nx.DiGraph(name="NRAS_3")
# list of reference graphs
ref_graph_list = [G, KRAS, NRAS_3]
def add_basecalls(row):
basecall = row.BC.astype(str)
target = row.name[1]
pos = row["pos"]
chrom = row["chrom"]
# initialize counter dictionary
d = defaultdict()
# select graph that matches ftm call
graph = [f for f in ref_graph_list if f.graph["name"] == target]
stuff = hamming_df.groupby(["FeatureID", "ftm_call"])
stuff.apply(add_basecalls)
但这并不是将条形码作为我可以枚举的字符串提取出来,而是将它们作为一个系列提取出来,我被卡住了。
期望的输出是一个包含以下信息的图表,例如第一个 BC "GCTATT" 和虚构计数的示例:
FeatureID chrom pos Nucleotide Weight
1_1_1 12 25398138 G 10
1_1_1 12 25398138 C 22
1_1_1 12 25398139 T 12
1_1_1 12 25398140 A 15
1_1_1 12 25398141 T 18
1_1_1 12 25398142 T 22
提前致谢!
您可能需要额外的 apply
和 axis=1
来解析每个组的行:
import pandas as pd
import numpy as np
import networkx as nx
from collections import defaultdict
# initiate graphs
GRAPHS = {"G": nx.DiGraph(name="G"),
"KRAS": nx.DiGraph(name="KRAS"),
"NRAS_3": nx.DiGraph(name="NRAS_3"), # notice that test_data.txt has "NRAS_3" not "KRAS_3"
}
WEIGHT_DICT = defaultdict()
def update_weight_for_row(row, target_graph):
pos = row["pos"]
chrom = row["chrom"]
for letter in row.BC:
print(letter)
# now you have access to letters in BC per row
# and can update graph weights as desired
def add_basecalls(grp):
# select graph that matches ftm_call
target = grp.name[1]
target_graph = GRAPHS[target]
grp.apply(lambda row: update_weight_for_row(row, target_graph), axis=1)
# read in test basecalls
hamming_df = pd.read_csv("./test_data.txt", sep="\t")
hamming_df2 = hamming_df[["FeatureID", "BC", "chrom", "pos"]] # Why is this line needed?
stuff = hamming_df.groupby(["FeatureID", "ftm_call"])
stuff.apply(lambda grp: add_basecalls(grp))
我正在尝试为任何输入 pandas 数据框自动构建 networkx 图。
数据框如下所示:
FeatureID BC chrom pos ftm_call
1_1_1 GCTATT 12 25398138 NRAS_3
1_1_1 GCCTAT 12 25398160 NRAS_3
1_1_1 GCCTAT 12 25398073 NRAS_3
1_1_1 GATCCT 12 25398128 NRAS_3
1_1_1 GATCCT 12 25398107 NRAS_3
下面是我需要整理的算法:
- 按 FeatureID 分组
- 对于每个 FeatureID,具有 "name" 属性且匹配 ftm_call 的 select 图
- 对于组中的每一行,枚举 BC 列,其中起始位置等于 pos 列中的值
- 对于 BC 中的每个字母,检查该字母是否已经在图中的那个位置找到,如果没有,则添加权重 1。如果已经存在,则将权重添加 1
到目前为止,这是我所拥有的:
import pandas as pd
import numpy as np
import networkx as nx
from collections import defaultdict
# read in test basecalls
hamming_df = pd.read_csv("./test_data.txt", sep="\t")
hamming_df = hamming_df[["FeatureID", "BC", "chrom", "pos"]]
# initiate graphs
G = nx.DiGraph(name="G")
KRAS = nx.DiGraph(name="KRAS")
NRAS_3 = nx.DiGraph(name="NRAS_3")
# list of reference graphs
ref_graph_list = [G, KRAS, NRAS_3]
def add_basecalls(row):
basecall = row.BC.astype(str)
target = row.name[1]
pos = row["pos"]
chrom = row["chrom"]
# initialize counter dictionary
d = defaultdict()
# select graph that matches ftm call
graph = [f for f in ref_graph_list if f.graph["name"] == target]
stuff = hamming_df.groupby(["FeatureID", "ftm_call"])
stuff.apply(add_basecalls)
但这并不是将条形码作为我可以枚举的字符串提取出来,而是将它们作为一个系列提取出来,我被卡住了。
期望的输出是一个包含以下信息的图表,例如第一个 BC "GCTATT" 和虚构计数的示例:
FeatureID chrom pos Nucleotide Weight
1_1_1 12 25398138 G 10
1_1_1 12 25398138 C 22
1_1_1 12 25398139 T 12
1_1_1 12 25398140 A 15
1_1_1 12 25398141 T 18
1_1_1 12 25398142 T 22
提前致谢!
您可能需要额外的 apply
和 axis=1
来解析每个组的行:
import pandas as pd
import numpy as np
import networkx as nx
from collections import defaultdict
# initiate graphs
GRAPHS = {"G": nx.DiGraph(name="G"),
"KRAS": nx.DiGraph(name="KRAS"),
"NRAS_3": nx.DiGraph(name="NRAS_3"), # notice that test_data.txt has "NRAS_3" not "KRAS_3"
}
WEIGHT_DICT = defaultdict()
def update_weight_for_row(row, target_graph):
pos = row["pos"]
chrom = row["chrom"]
for letter in row.BC:
print(letter)
# now you have access to letters in BC per row
# and can update graph weights as desired
def add_basecalls(grp):
# select graph that matches ftm_call
target = grp.name[1]
target_graph = GRAPHS[target]
grp.apply(lambda row: update_weight_for_row(row, target_graph), axis=1)
# read in test basecalls
hamming_df = pd.read_csv("./test_data.txt", sep="\t")
hamming_df2 = hamming_df[["FeatureID", "BC", "chrom", "pos"]] # Why is this line needed?
stuff = hamming_df.groupby(["FeatureID", "ftm_call"])
stuff.apply(lambda grp: add_basecalls(grp))