如何从包含相同结构的多个 xml 文件创建数据框?

How to create a data frame from multiple xml files containing same structure?

我有 1000 多个 XML 可能具有相同结构的文件。我想使用所有文件中的数据创建一个数据库。 我从来不知道 XML 文件在昨天之前是什么样子的。在 Google 的帮助下,我尝试使用 r-packages 在 RStudio 中加载单个 XML 文件。但是当我试图将其转换为数据框时,发生了错误。

这是文件的样子:文件 A

<?xml-stylesheet type='text/xsl' href='anzctrTransform.xsl'?>
<ANZCTR_Trial requestNumber="42">
  <stage>Registered</stage>
  <submitdate>19/07/2005</submitdate>
  <approvaldate>19/07/2005</approvaldate>
  <dateLastUpdated>14/12/2010</dateLastUpdated>
  <actrnumber>ACTRN12605000026628</actrnumber>
  <trial_identification>
    <studytitle>Phase II study of fixed dose rate Gemcitabine-Oxaliplatin Integrated with concomitant 5FU and 3-D Conformal Radiotherapy for the treatment of localised pancreatic cancer: GOFURTGO</studytitle>
    <scientifictitle>Phase II study of fixed dose rate Gemcitabine-Oxaliplatin Integrated with concomitant 5FU and 3-D Conformal Radiotherapy for the treatment of localised pancreatic cancer: GOFURTGO</scientifictitle>
    <utrn />
    <trialacronym>GOFURTGO</trialacronym>
    <secondaryid>GOFURTGO</secondaryid>
  </trial_identification>
  <conditions>
    <healthcondition>Locally advanced or locally recurrent inoperable pancreatic cancer not previously treated with chemotherapy or radiotherapy.</healthcondition>
    <conditioncode>
      <conditioncode1>Cancer</conditioncode1>
      <conditioncode2>Pancreatic</conditioncode2>
    </conditioncode>
  </conditions>
  <interventions>
    <interventions>All patients enrolled in the study will receive the same treatment consisting of all of the following:
a) 1 cycle of chemotherapy: the cycle is 28 days (gemcitabine on days 1 and 15 and oxaliplatin on days 2 and 16, followed by:
b)radiotherpay plus continuous 5FU infusion: 5FU is given continuously (7 days a week for 6 weeks), radiotherpay is given 5 days a week (Mon-Fri) for 6 weeks followed by:
c) 3 cycles of chemotherapy: each cycle is 28 days (gemcitabine on days 1 and 15 and oxaliplatin on days 2 and 16</interventions>
    <comparator>This is a single group trial</comparator>
    <control>Uncontrolled</control>
    <interventioncode>Treatment: Other</interventioncode>
  </interventions>
  <outcomes>
    <primaryOutcome>
      <outcome>The primary objective is to determine the proportions of patients starting and finishing greater than or equal to 80% of the planned dose on time for each component of the treatment.</outcome>
      <timepoint>The outcome will be measured once all patients have enrolled and have completeed the study treatment.</timepoint>
    </primaryOutcome>
    <secondaryOutcome>
      <outcome>Adverse events</outcome>
      <timepoint>Assessed at the end of ecah treatment cycle, and at end of treatment.</timepoint>
    </secondaryOutcome>
    <secondaryOutcome>
      <outcome>Objective tumour response rates</outcome>
      <timepoint>Before and after radiotherapy, at the end of treatment, and then as clinically indicated.</timepoint>
    </secondaryOutcome>
    <secondaryOutcome>
      <outcome>Time to progression</outcome>
      <timepoint>Before and after radiotherapy, at the end of treatment, and then as clinically indicated.</timepoint>
    </secondaryOutcome>
    <secondaryOutcome>
      <outcome>CA 19-9 response rates</outcome>
      <timepoint>Before and after radiotherapy, at the end of treatment, and then 2 monthly during follow up.</timepoint>
    </secondaryOutcome>
    <secondaryOutcome>
      <outcome>Health-related quality of life.</outcome>
      <timepoint>Before and after radiotherapy, at the end of treatment, and then 2 monthly until progression/disease recurrence.</timepoint>
    </secondaryOutcome>
  </outcomes>
  <eligibility>
    <inclusivecriteria>Patient must have histologically/cytologically proven adenocarcinoma of the pancreas located in the head or the body of the pancreas (primary) or in the pancreatic bed (locally recurrent).Locoregional disease must be confirmed by dual phase CT (arterial and portal phases) without distant metastases (confirmed by CT of the chest, abdomen and pelvis).Patients must be assessed by a surgeon and considered inoperable.Performance status must be ECOG grade 0, 1 or 2.</inclusivecriteria>
    <inclusiveminage>0</inclusiveminage>
    <inclusiveminagetype>Not stated</inclusiveminagetype>
    <inclusivemaxage>0</inclusivemaxage>
    <inclusivemaxagetype>Not stated</inclusivemaxagetype>
    <inclusivegender>Both males and females</inclusivegender>
    <healthyvolunteer>No</healthyvolunteer>
    <exclusivecriteria>1.Histological types other than pancreatic ductal adenocarcinoma
2. Metastatic disease.
3. Tumours of the tail of pancreas
4. Major co-morbid illnesses that, in the opinion of the investigator, would jeopardise the likely completion of the treatment program
5. Patients with peripheral sensory neuropathy with functional impairment.
6. Derangement of LFTs consistent with hepatic cellular dysfunction (ALT and/or AST &gt;3 times upper limit of normal), or a bilirubin &gt;3 times upper limit of normal. Patients with LFTs consistent with hepatic obstruction that is relieved (eg. by stenting, bypass) are eligible, provided the bilirubin has fallen to &lt;3 times upper limit of normal.
7. Patients with significant loss of bodyweight, who, at the investigator’s discretion, is deemed   not suitable for this study (eg.&gt;15% weight loss since surgery or diagnosis)
8. Treatment with a drug within the last 30 days that has not received regulatory approval at the time of study entry.
9. Treatment with any previous cytotoxic chemotherapy for this malignancy. Previous hormonal manipulation (including HRT) is allowed.
10. Previous abdominal radiotherapy
11. A previous history of malignancy other than non-melanomatous skin cancers, in –situ carcinoma, or patients who are disease–free from non-pancreatic tumours treated definitively more than 5 years ago.
12. Pregnant or lactating women, or women of childbearing potential not using adequate contraception.</exclusivecriteria>
  </eligibility>
  <trial_design>
    <studytype>Interventional</studytype>
    <purpose>Treatment</purpose>
    <allocation>Non-randomised trial</allocation>
    <concealment>Paper enrolment through the AGITG Coordinating Centre, NHMRC Clinical Trials Centre</concealment>
    <sequence>n/a</sequence>
    <masking>Open (masking not used)</masking>
    <assignment>Single group</assignment>
    <designfeatures />
    <endpoint>Safety</endpoint>
    <statisticalmethods />
    <masking1 />
    <masking2 />
    <masking3 />
    <masking4 />
    <patientregistry />
    <followup />
    <followuptype />
    <purposeobs />
    <duration />
    <selection />
    <timing />
  </trial_design>
  <recruitment>
    <phase>Phase 2</phase>
    <anticipatedstartdate>13/04/2005</anticipatedstartdate>
    <actualstartdate />
    <anticipatedenddate />
    <actualenddate />
    <samplesize>45</samplesize>
    <actualsamplesize />
    <currentsamplesize />
    <recruitmentstatus>Completed</recruitmentstatus>
    <anticipatedlastvisitdate />
    <actuallastvisitdate />
    <dataanalysis />
    <withdrawnreason />
    <withdrawnreasonother />
    <recruitmentcountry>Australia</recruitmentcountry>
    <recruitmentstate />
  </recruitment>
  <sponsorship>
    <primarysponsortype>Other Collaborative groups</primarysponsortype>
    <primarysponsorname>AGITG</primarysponsorname>
    <primarysponsoraddress>92-94 Parramatta Rd, Camperdown NSW 2050</primarysponsoraddress>
    <primarysponsorcountry>Australia</primarysponsorcountry>
    <fundingsource>
      <fundingtype>Commercial sector/Industry</fundingtype>
      <fundingname>Sanofi-Aventis</fundingname>
      <fundingaddress>Sanofi-Aventis Group 
Talavera Corporate Centre 
Building D 
12-24 Talavera Road 
Macquarie Park NSW 2113</fundingaddress>
      <fundingcountry>Australia</fundingcountry>
    </fundingsource>
    <fundingsource>
      <fundingtype>Other Collaborative groups</fundingtype>
      <fundingname>AGITG</fundingname>
      <fundingaddress>NHMRC Clinical Trials Centre
University of Sydney
Locked Bag 77
CAMPERDOWN NSW 1450</fundingaddress>
      <fundingcountry>Australia</fundingcountry>
    </fundingsource>
    <fundingsource>
      <fundingtype>University</fundingtype>
      <fundingname>CTC</fundingname>
      <fundingaddress>NHMRC Clinical Trials Centre
University of Sydney
Locked Bag 77
CAMPERDOWN NSW 1450</fundingaddress>
      <fundingcountry>Australia</fundingcountry>
    </fundingsource>
    <secondarysponsor>
      <sponsortype>Other Collaborative groups</sponsortype>
      <sponsorname>AGITG</sponsorname>
      <sponsoraddress>NHMRC Clinical Trials Centre
University of Sydney
Locked Bag 77
CAMPERDOWN NSW 1450</sponsoraddress>
      <sponsorcountry>Australia</sponsorcountry>
    </secondarysponsor>
  </sponsorship>
  <ethicsAndSummary>
    <summary />
    <trialwebsite />
    <publication />
    <ethicsreview>Approved</ethicsreview>
    <publicnotes />
    <ethicscommitee>
      <ethicname>University of Sydney</ethicname>
      <ethicaddress>Human Research Ethics Committee
Main Quad
University of Sydney NSW 2006</ethicaddress>
      <ethicapprovaldate />
      <hrec>11-2004/5/7779</hrec>
      <ethicsubmitdate />
      <ethiccountry>Australia</ethiccountry>
    </ethicscommitee>
    <ethicscommitee>
      <ethicname>Prince of Wales Hospital</ethicname>
      <ethicaddress />
      <ethicapprovaldate />
      <hrec />
      <ethicsubmitdate />
      <ethiccountry>Australia</ethiccountry>
    </ethicscommitee>
    <ethicscommitee>
      <ethicname>Border Medical Oncology</ethicname>
      <ethicaddress />
      <ethicapprovaldate />
      <hrec />
      <ethicsubmitdate />
      <ethiccountry>Australia</ethiccountry>
    </ethicscommitee>
    <ethicscommitee>
      <ethicname>St. George Hospital</ethicname>
      <ethicaddress />
      <ethicapprovaldate />
      <hrec />
      <ethicsubmitdate />
      <ethiccountry>Australia</ethiccountry>
    </ethicscommitee>
    <ethicscommitee>
      <ethicname>Newcastle Mater</ethicname>
      <ethicaddress />
      <ethicapprovaldate />
      <hrec />
      <ethicsubmitdate />
      <ethiccountry>Australia</ethiccountry>
    </ethicscommitee>
    <ethicscommitee>
      <ethicname>Alfred Hospital</ethicname>
      <ethicaddress />
      <ethicapprovaldate />
      <hrec />
      <ethicsubmitdate />
      <ethiccountry>Australia</ethiccountry>
    </ethicscommitee>
    <ethicscommitee>
      <ethicname>Nepean Hospital</ethicname>
      <ethicaddress />
      <ethicapprovaldate />
      <hrec />
      <ethicsubmitdate />
      <ethiccountry>Australia</ethiccountry>
    </ethicscommitee>
    <ethicscommitee>
      <ethicname>Royal Adelaide Hospital</ethicname>
      <ethicaddress />
      <ethicapprovaldate />
      <hrec />
      <ethicsubmitdate />
      <ethiccountry>Australia</ethiccountry>
    </ethicscommitee>
  </ethicsAndSummary>
  <attachment />
  <contacts>
    <contact>
      <title />
      <name>Dr David Goldstein</name>
      <address>Department of Medical Oncology
Prince of Wales Hospital
High Street
Randwick NSW 2031</address>
      <phone>+61 2 93822577</phone>
      <fax>+61 2 93822578</fax>
      <email>D.Goldstein@unsw.edu.au</email>
      <country>Australia</country>
      <type>Scientific Queries</type>
    </contact>
    <contact>
      <title />
      <name>Dr David Goldstein</name>
      <address>Department of Medical Oncology
Prince of Wales Hospital
High Street
Randwick NSW 2031</address>
      <phone>+61 2 93822577</phone>
      <fax>+61 2 93822578</fax>
      <email>D.Goldstein@unsw.edu.au</email>
      <country>Australia</country>
      <type>Public Queries</type>
    </contact>
    <contact>
      <title />
      <name />
      <address />
      <phone />
      <fax />
      <email />
      <country />
      <type>Principal Investigator</type>
    </contact>
  </contacts>
</ANZCTR_Trial>

文件 B.

<?xml-stylesheet type='text/xsl' href='anzctrTransform.xsl'?>
<ANZCTR_Trial requestNumber="6">
  <stage>Registered</stage>
  <submitdate>08/07/2005</submitdate>
  <approvaldate>08/07/2005</approvaldate>
  <dateLastUpdated>24/06/2010</dateLastUpdated>
  <actrnumber>ACTRN12605000003673</actrnumber>
  <trial_identification>
    <studytitle>Bisphosphonate and Anastrozole trial - Bone Maintenance Algorithm Assessment</studytitle>
    <scientifictitle>Maintaining skeletal health in postmenopausal women with surgically resected Stage I-IIIa hormone-receptor positive breast cancer who are receiving anastrozole, through the use of alendronate as determined by the Osteoporosis Australia Bone Maintenance Algorithm</scientifictitle>
    <utrn />
    <trialacronym>BATMAN</trialacronym>
    <secondaryid>Andrew Love Cancer Centre: ALCC 04.02</secondaryid>
  </trial_identification>
  <conditions>
    <healthcondition>Breast Cancer</healthcondition>
    <conditioncode>
      <conditioncode1>Cancer</conditioncode1>
      <conditioncode2>Breast</conditioncode2>
    </conditioncode>
  </conditions>
  <interventions>
    <interventions>This trial aims to assess the utility, through DEXA scans and biochemical markers of bone turnover, of a strategy of monitoring and intervention with oral alendronate in postmenopausal women with hormone-receptor positive breast cancer receiving five years of adjuvant anastrozole. It specifically addressed the issues of osteopaenic and osteoporotic women in this setting and will test three years versus five years of alendronate use.</interventions>
    <comparator>Five years of treatment with 70mg oral alendronate once weekly</comparator>
    <control>Active</control>
    <interventioncode>Treatment: Drugs</interventioncode>
  </interventions>
  <outcomes>
    <primaryOutcome>
      <outcome>Changes in lumbar vertebra and femoral neck BMD T-score after 5 years of anastrozole treatment</outcome>
      <timepoint>After 5 years of anastrozole treatment</timepoint>
    </primaryOutcome>
    <secondaryOutcome>
      <outcome>Percent change in the lumbar vertebrae</outcome>
      <timepoint>Annually for 5 years</timepoint>
    </secondaryOutcome>
    <secondaryOutcome>
      <outcome>Biochemical markers</outcome>
      <timepoint>6 months after commencing alendronate</timepoint>
    </secondaryOutcome>
    <secondaryOutcome>
      <outcome>Evaluate the Osteoporosis Australia strategy for bone protection for this patient group.</outcome>
      <timepoint>At 5 years</timepoint>
    </secondaryOutcome>
  </outcomes>
  <eligibility>
    <inclusivecriteria>Postmenopausal women- Adequately diagnosed and treated Stage I-IIIa early breast cancer- Oestrogen receptor and/or progesterone receptor positive breast cancer- Anastrozole is clinically indicated to be the best adjuvant strategy</inclusivecriteria>
    <inclusiveminage>18</inclusiveminage>
    <inclusiveminagetype>Years</inclusiveminagetype>
    <inclusivemaxage>0</inclusivemaxage>
    <inclusivemaxagetype>Not stated</inclusivemaxagetype>
    <inclusivegender>Females</inclusivegender>
    <healthyvolunteer>No</healthyvolunteer>
    <exclusivecriteria>Clinical or radiological evidence of distant spread- prior treatment with bisphosphonates within the past 12 months</exclusivecriteria>
  </eligibility>
  <trial_design>
    <studytype>Interventional</studytype>
    <purpose>Prevention</purpose>
    <allocation>Randomised controlled trial</allocation>
    <concealment>central randomisation via fax and phone</concealment>
    <sequence>Computer generated stratified blocks</sequence>
    <masking>Open (masking not used)</masking>
    <assignment>Parallel</assignment>
    <designfeatures />
    <endpoint>Efficacy</endpoint>
    <statisticalmethods />
    <masking1 />
    <masking2 />
    <masking3 />
    <masking4 />
    <patientregistry />
    <followup />
    <followuptype />
    <purposeobs />
    <duration />
    <selection />
    <timing />
  </trial_design>
  <recruitment>
    <phase>Phase 3</phase>
    <anticipatedstartdate>05/07/2005</anticipatedstartdate>
    <actualstartdate />
    <anticipatedenddate />
    <actualenddate />
    <samplesize>300</samplesize>
    <actualsamplesize />
    <currentsamplesize />
    <recruitmentstatus>Active, not recruiting</recruitmentstatus>
    <anticipatedlastvisitdate />
    <actuallastvisitdate />
    <dataanalysis />
    <withdrawnreason />
    <withdrawnreasonother />
    <recruitmentcountry>Australia</recruitmentcountry>
    <recruitmentstate />
  </recruitment>
  <sponsorship>
    <primarysponsortype>Hospital</primarysponsortype>
    <primarysponsorname>Barwon Health</primarysponsorname>
    <primarysponsoraddress>272-322 Ryrie Street, Geelong, Victoria 3220</primarysponsoraddress>
    <primarysponsorcountry>Australia</primarysponsorcountry>
    <fundingsource>
      <fundingtype>Commercial sector/Industry</fundingtype>
      <fundingname>Astra Zeneca</fundingname>
      <fundingaddress>P.O Box 131, North Ryde PBC NSW 1670</fundingaddress>
      <fundingcountry>Australia</fundingcountry>
    </fundingsource>
    <secondarysponsor>
      <sponsortype>None</sponsortype>
      <sponsorname>Nil</sponsorname>
      <sponsoraddress>Nil</sponsoraddress>
      <sponsorcountry />
    </secondarysponsor>
  </sponsorship>
  <ethicsAndSummary>
    <summary />
    <trialwebsite />
    <publication />
    <ethicsreview>Approved</ethicsreview>
    <publicnotes />
    <ethicscommitee>
      <ethicname>Barwon Health</ethicname>
      <ethicaddress />
      <ethicapprovaldate />
      <hrec />
      <ethicsubmitdate />
      <ethiccountry>Australia</ethiccountry>
    </ethicscommitee>
  </ethicsAndSummary>
  <attachment />
  <contacts>
    <contact>
      <title />
      <name>Associate Professor Richard Bell</name>
      <address>Andrew Love Cancer Centre
The Geelong Hospital
70 Swanston Street
Geelong VIC 3220</address>
      <phone>+61 3 52267855</phone>
      <fax>+61 3 52465168</fax>
      <email>richardb@barwonhealth.org.au</email>
      <country>Australia</country>
      <type>Scientific Queries</type>
    </contact>
    <contact>
      <title />
      <name>Ms Elaine Yeow</name>
      <address>Andrew Love Cancer Centre
The Geelong Hospital
70 Swanston Street
Geelong VIC 3220</address>
      <phone>+61 3 52267858</phone>
      <fax>+61 3 52465168</fax>
      <email>elainey@barwonhealth.org.au</email>
      <country>Australia</country>
      <type>Public Queries</type>
    </contact>
    <contact>
      <title />
      <name />
      <address />
      <phone />
      <fax />
      <email />
      <country />
      <type>Principal Investigator</type>
    </contact>
  </contacts>
</ANZCTR_Trial>

以下是我的代码。

library(XML)
library(xml2)
x =  read_xml("ACTRN12605000026628.xml")
print(x)

试验 1.

x_df = as.data.frame(x)
Error in as.data.frame.default(x) : 
  cannot coerce class ‘c("xml_document", "xml_node")’ to a data.frame

试验 2.

 xmlToList(x)
Error in UseMethod("xmlSApply") : 
  no applicable method for 'xmlSApply' applied to an object of class "c('xml_document', 'xml_node')"

试炼 3.

xmlToDataFrame(x)
Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘xmlToDataFrame’ for signature ‘"xml_document", "missing", "missing", "missing", "missing"’

我需要有关为什么会出现该错误以及如何将多个文件的数据转换为数据框或 table 在 r.

方面的帮助

您不能直接将 XML 文件转换为 dataframe。您需要获取这些标签内的标签和数据,然后创建 dataframe.

下面是可以解决问题的代码:

library(XML)
library(xml2)
df <- read_xml("1.xml")

records <- xml_find_all(df, "//ANZCTR_Trial")
records

nodenames <- xml_name(xml_children(records))
nodevalues <- trimws(xml_text(xml_children(records)))

df <- as.data.frame(t(nodevalues))
colnames(df) <- nodenames

write.csv(x = df, file = 'trialData.csv')

records包含父ta内部的所有标签和数据。在您的情况下,您在问题中共享的两个文件中都是 ANZCTR_Trial

nodenames 是标签的名称,即父标签。而 nodevalues 包含数据。

要从标签内的标签 grandchildren 获取数据(例如 phone,联系人内的传真),您需要进一步更新代码,如下所示:

records <- xml_find_all(df, "//contacts")  ### You just keep changing it according to your need
records

一切照旧。