在 zepid 包的支持下,使用 python 从数据框计算相对风险(风险比)(从 epitools r pack 模拟风险比。)

Using python to compute relative risk (risk ratio) from a dataframe with support of the zepid package (simulate the riskratio from epitools r pack.)

我来自这个 post and this website 的 zEpid 包。我的目标是从 pd.DataFrame 计算相对风险(又名风险比)。自变量具有三个级别(1、2 和 3),而我的因变量(目标)具有两个级别(1 和 0)。

当我改编下面的代码时

import numpy as np
import pandas as pd
from scipy.stats import norm
from zepid import RiskRatio


# calculating p-value
est= rr.results['RiskRatio'][1]
std = rr.results['SD(RR)'][1]
z_score = np.log(est)/std
p_value = norm.sf(abs(z_score))*2

Python 返回:

    106         # Getting unique values and dropping reference
    107         vals = set(df[exposure].dropna().unique())
--> 108         vals.remove(self.reference)
    109         self._c = df.loc[(df[exposure] == self.reference) & (df[outcome] == 1)].shape[0]
    110         self._d = df.loc[(df[exposure] == self.reference) & (df[outcome] == 0)].shape[0]

KeyError: 0

感谢任何帮助。

数据帧如下

df = pd.DataFrame(
{'age_group': {2759: 1, 3852: 1, 631: 1, 5152: 2, 5334: 2, 6364: 2, 5652: 2, 1636: 1, 2869: 1, 4654: 2, 1888: 1, 247: 1, 6699: 1, 6471: 2, 1760: 1, 4182: 1, 6095: 1, 48: 1, 6348: 1, 5129: 2, 4920: 2, 4590: 1, 5892: 2, 5131: 2, 1649: 2, 5940: 2, 3960: 1, 3060: 2, 4852: 1, 4605: 1, 3475: 1, 4406: 1, 1958: 1, 2170: 2, 6478: 1, 5328: 2, 4063: 3, 6827: 1, 5085: 1, 5155: 1, 4879: 2, 3185: 1, 32: 1, 4690: 1, 4109: 1, 4617: 1, 1048: 2, 747: 1, 995: 1, 6454: 1, 3302: 3, 5984: 2, 1127: 2, 2165: 2, 2025: 1, 4985: 2, 227: 3, 5802: 1, 4623: 1, 438: 1, 4401: 3, 7099: 1, 1149: 1, 6772: 1, 5567: 1, 873: 2, 2957: 1, 7060: 1, 4206: 2, 5239: 1, 1557: 1, 6080: 1, 411: 2, 2139: 1, 2408: 2, 1189: 2, 3295: 3, 4728: 3, 2490: 1, 4147: 1, 6768: 1, 6810: 1, 2901: 1, 3981: 2, 4941: 1, 3879: 2, 5819: 1, 6662: 2, 1589: 2, 6170: 1, 4522: 1, 552: 2, 5270: 1, 2722: 2, 34: 1, 5193: 1, 5767: 1, 2670: 1, 3298: 1, 5542: 1}, 'adhd_parent': {2759: 0, 3852: 0, 631: 0, 5152: 1, 5334: 1, 6364: 1, 5652: 1, 1636: 0, 2869: 0, 4654: 1, 1888: 0, 247: 0, 6699: 0, 6471: 1, 1760: 0, 4182: 0, 6095: 0, 48: 0, 6348: 0, 5129: 1, 4920: 1, 4590: 0, 5892: 1, 5131: 1, 1649: 1, 5940: 1, 3960: 0, 3060: 1, 4852: 0, 4605: 0, 3475: 0, 4406: 0, 1958: 0, 2170: 1, 6478: 0, 5328: 1, 4063: 1, 6827: 0, 5085: 0, 5155: 0, 4879: 1, 3185: 0, 32: 0, 4690: 0, 4109: 0, 4617: 0, 1048: 1, 747: 0, 995: 0, 6454: 0, 3302: 1, 5984: 1, 1127: 1, 2165: 1, 2025: 0, 4985: 1, 227: 1, 5802: 0, 4623: 0, 438: 0, 4401: 1, 7099: 0, 1149: 0, 6772: 0, 5567: 0, 873: 1, 2957: 0, 7060: 0, 4206: 1, 5239: 0, 1557: 0, 6080: 0, 411: 1, 2139: 0, 2408: 1, 1189: 1, 3295: 1, 4728: 1, 2490: 0, 4147: 0, 6768: 0, 6810: 0, 2901: 0, 3981: 1, 4941: 0, 3879: 1, 5819: 0, 6662: 1, 1589: 1, 6170: 0, 4522: 0, 552: 1, 5270: 0, 2722: 1, 34: 0, 5193: 0, 5767: 0, 2670: 0, 3298: 0, 5542: 0}}
)

该错误是 RiskRatio 如何在幕后解析您的输入数据集的结果。

使用RiskRatio时,默认参考类别设置为0。所以,当你在内部处理自变量时,zEpid 正在寻找 age_group=0。但是,您的数据集中没有 0 的实例。

要解决此问题,您可以指定可选参数 reference。默认reference=0,但您可以将其设置为1,这将设置age_group=1作为风险比率的参考风险。

下面是一个简单的例子,其中包含一些模拟数据 'A''Y'

import numpy as np
import pandas as pd
from scipy.stats import norm
from zepid import RiskRatio

np.random.seed(20220120)
df = pd.DataFrame()
df['A'] = np.random.randint(1, 4, size=100)
df['Y'] = np.random.binomial(n=1, p=0.25, size=100)

# Generating some generic data
np.random.seed(20220120)
df = pd.DataFrame()
df['A'] = np.random.randint(1, 4, size=80)           # Note: A \in {1,2,3}
df['Y'] = np.random.binomial(n=1, p=0.25, size=80)   # Note: Y \in {0,1}

# Estimating Risk Ratios with zEpid
rr = RiskRatio(reference=1)
rr.fit(df, exposure='A', outcome='Y')

# Calculating P-values
est = rr.results['RiskRatio'][1:]
std = rr.results['SD(RR)'][1:]
z_score = np.log(est)/std
p_value = norm.sf(abs(z_score))*2

# Displaying results
print("RR:     ", list(est))
print("P-value:", p_value)

应该输出以下内容

RR:      [1.0266666666666666, 0.7636363636363636]
P-value: [0.93990517 0.5312407 ]

我生成了一些通用数据,而不是使用提供的示例数据集,因为该数据中还有另一个问题会导致错误。下面是一个 2×3 table 的数据集

adhd_parent   0   1
age_group          
1            62   0
2             0  32
3             0   6

数据中的这些结构零将通过 zEpid 中的 PositivityError。基本上,你不能计算由于被零除的风险(所指对象中的风险为0)。