Pandas，对于每个分组，枚举字符串列并转换为计数器字典

Question

我正在尝试为任何输入 pandas 数据框自动构建 networkx 图。

数据框如下所示：

  FeatureID       BC         chrom       pos        ftm_call
  1_1_1           GCTATT     12          25398138   NRAS_3
  1_1_1           GCCTAT     12          25398160   NRAS_3
  1_1_1           GCCTAT     12          25398073   NRAS_3
  1_1_1           GATCCT     12          25398128   NRAS_3
  1_1_1           GATCCT     12          25398107   NRAS_3

下面是我需要整理的算法：

按 FeatureID 分组
对于每个 FeatureID，具有 "name" 属性且匹配 ftm_call
对于组中的每一行，枚举 BC 列，其中起始位置等于 pos 列中的值
对于 BC 中的每个字母，检查该字母是否已经在图中的那个位置找到，如果没有，则添加权重 1。如果已经存在，则将权重添加 1

到目前为止，这是我所拥有的：

import pandas as pd
import numpy as np
import networkx as nx
from collections import defaultdict

# read in test basecalls
hamming_df = pd.read_csv("./test_data.txt", sep="\t")
hamming_df = hamming_df[["FeatureID", "BC", "chrom", "pos"]]

# initiate graphs 
G = nx.DiGraph(name="G")
KRAS = nx.DiGraph(name="KRAS")
NRAS_3 = nx.DiGraph(name="NRAS_3")

# list of reference graphs
ref_graph_list = [G, KRAS, NRAS_3]

def add_basecalls(row):
    basecall = row.BC.astype(str)
    target = row.name[1]
    pos = row["pos"]
    chrom = row["chrom"]

    # initialize counter dictionary
    d = defaultdict()

    # select graph that matches ftm call
    graph = [f for f in ref_graph_list if f.graph["name"] == target]

stuff = hamming_df.groupby(["FeatureID", "ftm_call"])  
stuff.apply(add_basecalls)

但这并不是将条形码作为我可以枚举的字符串提取出来，而是将它们作为一个系列提取出来，我被卡住了。

期望的输出是一个包含以下信息的图表，例如第一个 BC "GCTATT" 和虚构计数的示例：

FeatureID    chrom    pos         Nucleotide    Weight
1_1_1        12       25398138       G            10
1_1_1        12       25398138       C            22
1_1_1        12       25398139       T            12
1_1_1        12       25398140       A            15
1_1_1        12       25398141       T            18
1_1_1        12       25398142       T            22

提前致谢！

Answer 1

您可能需要额外的 apply 和 axis=1 来解析每个组的行：

import pandas as pd
import numpy as np
import networkx as nx
from collections import defaultdict

# initiate graphs
GRAPHS = {"G": nx.DiGraph(name="G"),
          "KRAS": nx.DiGraph(name="KRAS"),
          "NRAS_3": nx.DiGraph(name="NRAS_3"), # notice that test_data.txt has "NRAS_3" not "KRAS_3"
     }

WEIGHT_DICT = defaultdict()

def update_weight_for_row(row, target_graph):
    pos = row["pos"]
    chrom = row["chrom"]
    for letter in row.BC:
        print(letter)
        # now you have access to letters in BC per row
        # and can update graph weights as desired

def add_basecalls(grp):
    # select graph that matches ftm_call
    target = grp.name[1]
    target_graph = GRAPHS[target]
    grp.apply(lambda row: update_weight_for_row(row, target_graph), axis=1)

# read in test basecalls
hamming_df = pd.read_csv("./test_data.txt", sep="\t")
hamming_df2 = hamming_df[["FeatureID", "BC", "chrom", "pos"]]  # Why is this line needed?
stuff = hamming_df.groupby(["FeatureID", "ftm_call"])  
stuff.apply(lambda grp: add_basecalls(grp))

Pandas，对于每个分组，枚举字符串列并转换为计数器字典

Pandas, for each groupby group, enumerate over column of strings and convert to counter dictionary

python

dictionary

enumerate

pandas