如何从 asn1 数据文件中提取数据并将其加载到数据框中?
How to extract data from asn1 data file and load it into a dataframe?
我的最终目标是将从 PubMed 接收到的元数据加载到 pyspark 数据帧中。
到目前为止,我已经成功地使用 shell 脚本从 PubMed 数据库下载了我想要的数据。
下载的数据是asn1格式。这是一个数据条目的示例:
Pubmed-entry ::= {
pmid 31782536,
medent {
em std {
year 2019,
month 11,
day 30,
hour 6,
minute 0
},
cit {
title {
name "Impact of CYP2C19 genotype and drug interactions on voriconazole
plasma concentrations: a spain pharmacogenetic-pharmacokinetic prospective
multicenter study."
},
authors {
names std {
{
name ml "Blanco Dorado S",
affil str "Pharmacy Department, University Clinical Hospital
Santiago de Compostela (CHUS). Santiago de Compostela, Spain.; Clinical
Pharmacology Group, University Clinical Hospital, Health Research Institute
of Santiago de Compostela (IDIS). Santiago de Compostela, Spain.; Department
of Pharmacology, Pharmacy and Pharmaceutical Technology, Faculty of Pharmacy,
University of Santiago de Compostela (USC). Santiago de Compostela, Spain."
},
{
name ml "Maronas O",
affil str "Genomic Medicine Group, Centro Nacional de Genotipado
(CEGEN-PRB3), CIBERER, CIMUS, University of Santiago de Compostela (USC),
Santiago de Compostela, Spain."
},
{
name ml "Latorre-Pellicer A",
affil str "Genomic Medicine Group, Centro Nacional de Genotipado
(CEGEN-PRB3), CIBERER, CIMUS, University of Santiago de Compostela (USC),
Santiago de Compostela, Spain."
},
{
name ml "Rodriguez Jato T",
affil str "Pharmacy Department, University Clinical Hospital
Santiago de Compostela (CHUS). Santiago de Compostela, Spain."
},
{
name ml "Lopez-Vizcaino A",
affil str "Pharmacy Department, University Hospital Lucus Augusti
(HULA). Lugo, Spain."
},
{
name ml "Gomez Marquez A",
affil str "Pharmacy Department, University Hospital Ourense
(CHUO). Ourense, Spain."
},
{
name ml "Bardan Garcia B",
affil str "Pharmacy Department, University Hospital Ferrol (CHUF).
A Coruna, Spain."
},
{
name ml "Belles Medall D",
affil str "Pharmacy Department, General University Hospital
Castellon (GVA). Castellon, Spain."
},
{
name ml "Barbeito Castineiras G",
affil str "Microbiology Department, University Clinical Hospital
Santiago de Compostela (CHUS). Santiago de Compostela, Spain."
},
{
name ml "Perez Del Molino Bernal ML",
affil str "Microbiology Department, University Clinical Hospital
Santiago de Compostela (CHUS). Santiago de Compostela, Spain."
},
{
name ml "Campos-Toimil M",
affil str "Department of Pharmacology, Pharmacy and Pharmaceutical
Technology, Faculty of Pharmacy, University of Santiago de Compostela (USC).
Santiago de Compostela, Spain."
},
{
name ml "Otero Espinar F",
affil str "Department of Pharmacology, Pharmacy and Pharmaceutical
Technology, Faculty of Pharmacy, University of Santiago de Compostela (USC).
Santiago de Compostela, Spain."
},
{
name ml "Blanco Hortas A",
affil str "Epidemiology Unit. Fundacion Instituto de Investigacion
Sanitaria de Santiago de Compostela (FIDIS), University Hospital Lucus
Augusti (HULA), Spain."
},
{
name ml "Duran Pineiro G",
affil str "Clinical Pharmacology Group, University Clinical
Hospital, Health Research Institute of Santiago de Compostela (IDIS).
Santiago de Compostela, Spain."
},
{
name ml "Zarra Ferro I",
affil str "Pharmacy Department, University Clinical Hospital
Santiago de Compostela (CHUS). Santiago de Compostela, Spain.; Clinical
Pharmacology Group, University Clinical Hospital, Health Research Institute
of Santiago de Compostela (IDIS). Santiago de Compostela, Spain."
},
{
name ml "Carracedo A",
affil str "Genomic Medicine Group, Centro Nacional de Genotipado
(CEGEN-PRB3), CIBERER, CIMUS, University of Santiago de Compostela (USC),
Santiago de Compostela, Spain.; Galician Foundation of Genomic Medicine,
Health Research Institute of Santiago de Compostela (IDIS), SERGAS, Santiago
de Compostela, Spain."
},
{
name ml "Lamas MJ",
affil str "Clinical Pharmacology Group, University Clinical
Hospital, Health Research Institute of Santiago de Compostela (IDIS).
Santiago de Compostela, Spain."
},
{
name ml "Fernandez-Ferreiro A",
affil str "Pharmacy Department, University Clinical Hospital
Santiago de Compostela (CHUS). Santiago de Compostela, Spain.; Clinical
Pharmacology Group, University Clinical Hospital, Health Research Institute
of Santiago de Compostela (IDIS). Santiago de Compostela, Spain.; Department
of Pharmacology, Pharmacy and Pharmaceutical Technology, Faculty of Pharmacy,
University of Santiago de Compostela (USC). Santiago de Compostela, Spain."
}
}
},
from journal {
title {
iso-jta "Pharmacotherapy",
ml-jta "Pharmacotherapy",
issn "1875-9114",
name "Pharmacotherapy"
},
imp {
date std {
year 2019,
month 11,
day 29
},
language "eng",
pubstatus aheadofprint,
history {
{
pubstatus other,
date std {
year 2019,
month 11,
day 30,
hour 6,
minute 0
}
},
{
pubstatus pubmed,
date std {
year 2019,
month 11,
day 30,
hour 6,
minute 0
}
},
{
pubstatus medline,
date std {
year 2019,
month 11,
day 30,
hour 6,
minute 0
}
}
}
}
},
ids {
pubmed 31782536,
doi "10.1002/phar.2351",
other {
db "ELocationID doi",
tag str "10.1002/phar.2351"
}
}
},
abstract "BACKGROUND: Voriconazole, a first-line agent for the treatment
of invasive fungal infections, is mainly metabolized by cytochrome P450 (CYP)
2C19. A significant portion of patients fail to achieve therapeutic
voriconazole trough concentrations, with a consequently increased risk of
therapeutic failure. OBJECTIVE: To show the association between
subtherapeutic voriconazole concentrations and factors affecting voriconazole
pharmacokinetics: CYP2C19 genotype and drug-drug interactions. METHODS:
Adults receiving voriconazole for antifungal treatment or prophylaxis were
included in a multicenter prospective study conducted in Spain. The
prevalence of subtherapeutic voriconazole troughs were analyzed in the rapid
metabolizer and ultra-rapid metabolizer patients (RMs and UMs, respectively),
and compared with the rest of the patients. The relationship between
voriconazole concentration, CYP2C19 phenotype, adverse events (AEs), and
drug-drug interactions was also assessed. RESULTS: In this study 78 patients
were included with a wide variability in voriconazole plasma levels with only
44.8% of patients attaining trough concentrations within the therapeutic
range of 1 and 5.5 microg/ml. The allele frequency of *17 variant was found
to be 29.5%. Compared with patients with other phenotypes, RMs and UMs had a
lower voriconazole plasma concentration (RM/UM: 1.85+/-0.24 microg/ml versus
other phenotypes: 2.36+/-0.26 microg/ml, ). Adverse events were more common
in patients with higher voriconazole concentrations (p<0.05). No association
between voriconazole trough concentration and other factors (age, weight,
route of administration, and concomitant administration of enzyme inducer,
enzyme inhibitor, glucocorticoids, or proton pump inhibitors) was found.
CONCLUSION: These results suggest the potential clinical utility of using
CYP2C19 genotype-guided voriconazole dosing to achieve concentrations in the
therapeutic range in the early course of therapy. Larger studies are needed
to confirm the impact of pharmacogenetics on voriconazole pharmacokinetics.",
pmid 31782536,
pub-type {
"Journal Article"
},
status publisher
}
}
这就是我卡住的地方。我不知道如何从 asn1 中提取信息并将其放入 pyspark 数据帧中。任何人都可以建议这样做的方法吗?
您的问题可能并不简单,但值得尝试。
方法一:
根据您的规范,您可以尝试寻找将创建数据模型的 ASN.1 工具(也称为 ASN.1 编译器)。在您的情况下,因为您下载了文本 ASN.1 值,所以您需要此工具来提供 ASN.1 值解码器。
如果该工具正在生成 Java 代码,它会像这样:
// decode a Pubmed-entry
// input is your data
Asn1ValueReader reader = new Asn1ValueReader(input);
PubmedEntry obj = PubmedEntry.readPdu(reader);
// access the data
obj.getPmid();
obj.getMedent();
一些注意事项:
- 可以做所有事情的工具不会是免费的(如果你能找到的话)。这里的问题是你有一个文本 ASN1 值,而工具通常会提供二进制解码器(BER、DER 等 ..)
- 您需要编写大量胶水代码来创建进入您的 pyspark 数据框的记录
我前段时间写了 this 但它没有文本 ASN1 值解码器
方法二:
如果你的数据足够简单,因为它们是文本数据,你可以尝试编写自己的解析器(使用像 ANTLR 这样的工具)...不容易,如果你不熟悉解析器,评估这个方法.
编辑:
不幸的是,specification 无效。
以上数据绝对在一个"ASN.1 format"中。这种格式称为 ASN.1 值表示法,用于以文本方式表示 ASN.1 值。 (这种格式早于 JSON 编码规则的标准化。今天,人们可以将 JSON 用于相同的目的,只是 JSON 的处理方式与ASN.1 值表示法)。
正如 YaFred 自己指出的那样,YaFred 上面发布的 ASN.1 模式包含一些错误。您自己发布的符号似乎也包含一些错误。我查看了 NCBI 的整套 ASN.1 文件并注意到它们包含几个错误。因此,除非它们已修复,否则它们不能由符合标准的 ASN.1 工具(例如 ASN.1 游乐场)处理。其中一些错误很容易修复,但修复其他错误需要了解这些文件作者的意图。这种情况可能是由于 NCBI 项目使用他们自己的 ASN.1 工具包,该工具包可能以某种非标准方式使用 ASN.1。
我想在 NCBI 工具包中应该有一些方法可以让你解码上面的值符号,所以如果我是你,我会研究那个工具包。我不了解NCBI工具包,无法给你更好的建议。
我的最终目标是将从 PubMed 接收到的元数据加载到 pyspark 数据帧中。 到目前为止,我已经成功地使用 shell 脚本从 PubMed 数据库下载了我想要的数据。 下载的数据是asn1格式。这是一个数据条目的示例:
Pubmed-entry ::= {
pmid 31782536,
medent {
em std {
year 2019,
month 11,
day 30,
hour 6,
minute 0
},
cit {
title {
name "Impact of CYP2C19 genotype and drug interactions on voriconazole
plasma concentrations: a spain pharmacogenetic-pharmacokinetic prospective
multicenter study."
},
authors {
names std {
{
name ml "Blanco Dorado S",
affil str "Pharmacy Department, University Clinical Hospital
Santiago de Compostela (CHUS). Santiago de Compostela, Spain.; Clinical
Pharmacology Group, University Clinical Hospital, Health Research Institute
of Santiago de Compostela (IDIS). Santiago de Compostela, Spain.; Department
of Pharmacology, Pharmacy and Pharmaceutical Technology, Faculty of Pharmacy,
University of Santiago de Compostela (USC). Santiago de Compostela, Spain."
},
{
name ml "Maronas O",
affil str "Genomic Medicine Group, Centro Nacional de Genotipado
(CEGEN-PRB3), CIBERER, CIMUS, University of Santiago de Compostela (USC),
Santiago de Compostela, Spain."
},
{
name ml "Latorre-Pellicer A",
affil str "Genomic Medicine Group, Centro Nacional de Genotipado
(CEGEN-PRB3), CIBERER, CIMUS, University of Santiago de Compostela (USC),
Santiago de Compostela, Spain."
},
{
name ml "Rodriguez Jato T",
affil str "Pharmacy Department, University Clinical Hospital
Santiago de Compostela (CHUS). Santiago de Compostela, Spain."
},
{
name ml "Lopez-Vizcaino A",
affil str "Pharmacy Department, University Hospital Lucus Augusti
(HULA). Lugo, Spain."
},
{
name ml "Gomez Marquez A",
affil str "Pharmacy Department, University Hospital Ourense
(CHUO). Ourense, Spain."
},
{
name ml "Bardan Garcia B",
affil str "Pharmacy Department, University Hospital Ferrol (CHUF).
A Coruna, Spain."
},
{
name ml "Belles Medall D",
affil str "Pharmacy Department, General University Hospital
Castellon (GVA). Castellon, Spain."
},
{
name ml "Barbeito Castineiras G",
affil str "Microbiology Department, University Clinical Hospital
Santiago de Compostela (CHUS). Santiago de Compostela, Spain."
},
{
name ml "Perez Del Molino Bernal ML",
affil str "Microbiology Department, University Clinical Hospital
Santiago de Compostela (CHUS). Santiago de Compostela, Spain."
},
{
name ml "Campos-Toimil M",
affil str "Department of Pharmacology, Pharmacy and Pharmaceutical
Technology, Faculty of Pharmacy, University of Santiago de Compostela (USC).
Santiago de Compostela, Spain."
},
{
name ml "Otero Espinar F",
affil str "Department of Pharmacology, Pharmacy and Pharmaceutical
Technology, Faculty of Pharmacy, University of Santiago de Compostela (USC).
Santiago de Compostela, Spain."
},
{
name ml "Blanco Hortas A",
affil str "Epidemiology Unit. Fundacion Instituto de Investigacion
Sanitaria de Santiago de Compostela (FIDIS), University Hospital Lucus
Augusti (HULA), Spain."
},
{
name ml "Duran Pineiro G",
affil str "Clinical Pharmacology Group, University Clinical
Hospital, Health Research Institute of Santiago de Compostela (IDIS).
Santiago de Compostela, Spain."
},
{
name ml "Zarra Ferro I",
affil str "Pharmacy Department, University Clinical Hospital
Santiago de Compostela (CHUS). Santiago de Compostela, Spain.; Clinical
Pharmacology Group, University Clinical Hospital, Health Research Institute
of Santiago de Compostela (IDIS). Santiago de Compostela, Spain."
},
{
name ml "Carracedo A",
affil str "Genomic Medicine Group, Centro Nacional de Genotipado
(CEGEN-PRB3), CIBERER, CIMUS, University of Santiago de Compostela (USC),
Santiago de Compostela, Spain.; Galician Foundation of Genomic Medicine,
Health Research Institute of Santiago de Compostela (IDIS), SERGAS, Santiago
de Compostela, Spain."
},
{
name ml "Lamas MJ",
affil str "Clinical Pharmacology Group, University Clinical
Hospital, Health Research Institute of Santiago de Compostela (IDIS).
Santiago de Compostela, Spain."
},
{
name ml "Fernandez-Ferreiro A",
affil str "Pharmacy Department, University Clinical Hospital
Santiago de Compostela (CHUS). Santiago de Compostela, Spain.; Clinical
Pharmacology Group, University Clinical Hospital, Health Research Institute
of Santiago de Compostela (IDIS). Santiago de Compostela, Spain.; Department
of Pharmacology, Pharmacy and Pharmaceutical Technology, Faculty of Pharmacy,
University of Santiago de Compostela (USC). Santiago de Compostela, Spain."
}
}
},
from journal {
title {
iso-jta "Pharmacotherapy",
ml-jta "Pharmacotherapy",
issn "1875-9114",
name "Pharmacotherapy"
},
imp {
date std {
year 2019,
month 11,
day 29
},
language "eng",
pubstatus aheadofprint,
history {
{
pubstatus other,
date std {
year 2019,
month 11,
day 30,
hour 6,
minute 0
}
},
{
pubstatus pubmed,
date std {
year 2019,
month 11,
day 30,
hour 6,
minute 0
}
},
{
pubstatus medline,
date std {
year 2019,
month 11,
day 30,
hour 6,
minute 0
}
}
}
}
},
ids {
pubmed 31782536,
doi "10.1002/phar.2351",
other {
db "ELocationID doi",
tag str "10.1002/phar.2351"
}
}
},
abstract "BACKGROUND: Voriconazole, a first-line agent for the treatment
of invasive fungal infections, is mainly metabolized by cytochrome P450 (CYP)
2C19. A significant portion of patients fail to achieve therapeutic
voriconazole trough concentrations, with a consequently increased risk of
therapeutic failure. OBJECTIVE: To show the association between
subtherapeutic voriconazole concentrations and factors affecting voriconazole
pharmacokinetics: CYP2C19 genotype and drug-drug interactions. METHODS:
Adults receiving voriconazole for antifungal treatment or prophylaxis were
included in a multicenter prospective study conducted in Spain. The
prevalence of subtherapeutic voriconazole troughs were analyzed in the rapid
metabolizer and ultra-rapid metabolizer patients (RMs and UMs, respectively),
and compared with the rest of the patients. The relationship between
voriconazole concentration, CYP2C19 phenotype, adverse events (AEs), and
drug-drug interactions was also assessed. RESULTS: In this study 78 patients
were included with a wide variability in voriconazole plasma levels with only
44.8% of patients attaining trough concentrations within the therapeutic
range of 1 and 5.5 microg/ml. The allele frequency of *17 variant was found
to be 29.5%. Compared with patients with other phenotypes, RMs and UMs had a
lower voriconazole plasma concentration (RM/UM: 1.85+/-0.24 microg/ml versus
other phenotypes: 2.36+/-0.26 microg/ml, ). Adverse events were more common
in patients with higher voriconazole concentrations (p<0.05). No association
between voriconazole trough concentration and other factors (age, weight,
route of administration, and concomitant administration of enzyme inducer,
enzyme inhibitor, glucocorticoids, or proton pump inhibitors) was found.
CONCLUSION: These results suggest the potential clinical utility of using
CYP2C19 genotype-guided voriconazole dosing to achieve concentrations in the
therapeutic range in the early course of therapy. Larger studies are needed
to confirm the impact of pharmacogenetics on voriconazole pharmacokinetics.",
pmid 31782536,
pub-type {
"Journal Article"
},
status publisher
}
}
这就是我卡住的地方。我不知道如何从 asn1 中提取信息并将其放入 pyspark 数据帧中。任何人都可以建议这样做的方法吗?
您的问题可能并不简单,但值得尝试。
方法一:
根据您的规范,您可以尝试寻找将创建数据模型的 ASN.1 工具(也称为 ASN.1 编译器)。在您的情况下,因为您下载了文本 ASN.1 值,所以您需要此工具来提供 ASN.1 值解码器。
如果该工具正在生成 Java 代码,它会像这样:
// decode a Pubmed-entry
// input is your data
Asn1ValueReader reader = new Asn1ValueReader(input);
PubmedEntry obj = PubmedEntry.readPdu(reader);
// access the data
obj.getPmid();
obj.getMedent();
一些注意事项:
- 可以做所有事情的工具不会是免费的(如果你能找到的话)。这里的问题是你有一个文本 ASN1 值,而工具通常会提供二进制解码器(BER、DER 等 ..)
- 您需要编写大量胶水代码来创建进入您的 pyspark 数据框的记录
我前段时间写了 this 但它没有文本 ASN1 值解码器
方法二:
如果你的数据足够简单,因为它们是文本数据,你可以尝试编写自己的解析器(使用像 ANTLR 这样的工具)...不容易,如果你不熟悉解析器,评估这个方法.
编辑: 不幸的是,specification 无效。
以上数据绝对在一个"ASN.1 format"中。这种格式称为 ASN.1 值表示法,用于以文本方式表示 ASN.1 值。 (这种格式早于 JSON 编码规则的标准化。今天,人们可以将 JSON 用于相同的目的,只是 JSON 的处理方式与ASN.1 值表示法)。
正如 YaFred 自己指出的那样,YaFred 上面发布的 ASN.1 模式包含一些错误。您自己发布的符号似乎也包含一些错误。我查看了 NCBI 的整套 ASN.1 文件并注意到它们包含几个错误。因此,除非它们已修复,否则它们不能由符合标准的 ASN.1 工具(例如 ASN.1 游乐场)处理。其中一些错误很容易修复,但修复其他错误需要了解这些文件作者的意图。这种情况可能是由于 NCBI 项目使用他们自己的 ASN.1 工具包,该工具包可能以某种非标准方式使用 ASN.1。
我想在 NCBI 工具包中应该有一些方法可以让你解码上面的值符号,所以如果我是你,我会研究那个工具包。我不了解NCBI工具包,无法给你更好的建议。