从 <p> 个元素中提取标记的实体
extract tagged entities from within <p> elements
我的数据集具有以下结构:
<p>The <ORGANIZATION>Peter Hall Company</ORGANIZATION>'s production of ''Blithe Spirit,'' directed by <PERSON>Thea Sharrock</PERSON>, is one of those attractively and unimaginatively upholstered productions of brittle classics that become must-have middlebrow tickets every few years. Most notable for <PERSON>Penelope Keith</PERSON>'s startlingly brisk and no-nonsense interpretation of the madcap medium <ORGANIZATION>Madame Arcati</ORGANIZATION>, Ms. <PERSON>Sharrock</PERSON>'s take on <PERSON>Coward</PERSON>'s 1941 comedy of a man visited by his dead wife's impish spirit delivers bright badinage, dazed double takes and marital melees at the same efficient clip.</p>
如您所见,在标签 <p>
和 </p>
中有多个标记实体,例如 <ORGANIZATION>Peter Hall Company</ORGANIZATION>
和 <PERSON>Penelope Keith</PERSON>
我想使用 jsoup 列出 <p>
标签中包含的所有实体。
我猜 jsoup 应该能够处理这个问题,我已经看到一些与此特定实例相关的问题,但我无法让它们在我的情况下工作,这可能是因为 <ORGANIZATION>
和 <PERSON>
不是真正的 html 标签?我必须为那些使用正则表达式吗?如果我可以用 jsoup 做,怎么做?
到目前为止我试过这个:
for (Iterator<Element> iterator = contents.iterator(); iterator.hasNext();)
{
Element content = iterator.next();
String text = content.text();
String title = content.select("PERSON").text();
String output = text.replaceFirst(title, "").trim();
System.out.println(output);
}
还有这个:
for (Element content : contents)
{
String PERSON = content.attr("PERSON");
String linkText = content.text();
//print
System.out.println(PERSON);
System.out.println(linkText);
}
两者均无效。
这可行,但不够优雅
//people
Elements contents_person = doc.getElementsByTag("p").select("PERSON");
for (Element content : contents_person)
{
//String PERSON = content.attr("PERSON");
String linkText = content.text();
//print
//System.out.println(PERSON);
System.out.println(linkText);
}
//places
Elements contents_place = doc.getElementsByTag("p").select("LOCATION");
for (Element content : contents_place)
{
//String PERSON = content.attr("PERSON");
String linkText = content.text();
//print
//System.out.println(PERSON);
System.out.println(linkText);
}
//things
Elements contents_things = doc.getElementsByTag("p").select("ORGANIZATION");
for (Element content : contents_things)
{
//String PERSON = content.attr("PERSON");
String linkText = content.text();
//print
//System.out.println(PERSON);
System.out.println(linkText);
}
您只需要为此使用 css 选择器:
public class Foo {
public static void main(String... args) {
String xml = "<p>The <ORGANIZATION>Peter Hall Company</ORGANIZATION>'s production of ''Blithe Spirit,'' directed by <PERSON>Thea Sharrock</PERSON>, is one of those attractively and unimaginatively upholstered productions of brittle classics that become must-have middlebrow tickets every few years. Most notable for <PERSON>Penelope Keith</PERSON>'s startlingly brisk and no-nonsense interpretation of the madcap medium <ORGANIZATION>Madame Arcati</ORGANIZATION>, Ms. <PERSON>Sharrock</PERSON>'s take on <PERSON>Coward</PERSON>'s 1941 comedy of a man visited by his dead wife's impish spirit delivers bright badinage, dazed double takes and marital melees at the same efficient clip.</p>";
Document doc = Jsoup.parse(xml);
for (Element e: doc.select("p > ORGANIZATION, p > PERSON")) {
System.out.printf("-> %s: %s\n", e.tagName(), e.text());
}
}
}
输出:
-> organization: Peter Hall Company
-> person: Thea Sharrock
-> person: Penelope Keith
-> organization: Madame Arcati
-> person: Sharrock
-> person: Coward
编辑:如果您想过滤掉这些标签并只保留内容,您可以在迭代元素时用它们的文本内容替换元素,如下所示:
public class Foo {
public static void main(String... args) {
String xml = "<p>The <ORGANIZATION>Peter Hall Company</ORGANIZATION>'s production of ''Blithe Spirit,'' directed by <PERSON>Thea Sharrock</PERSON>, is one of those attractively and unimaginatively upholstered productions of brittle classics that become must-have middlebrow tickets every few years. Most notable for <PERSON>Penelope Keith</PERSON>'s startlingly brisk and no-nonsense interpretation of the madcap medium <ORGANIZATION>Madame Arcati</ORGANIZATION>, Ms. <PERSON>Sharrock</PERSON>'s take on <PERSON>Coward</PERSON>'s 1941 comedy of a man visited by his dead wife's impish spirit delivers bright badinage, dazed double takes and marital melees at the same efficient clip.</p>";
Document doc = Jsoup.parse(xml);
for (Element e: doc.select("p > ORGANIZATION, p > PERSON")) {
System.out.printf("-> %s: %s\n", e.tagName(), e.text());
e.replaceWith(new TextNode(e.text(), ""));
}
System.out.println("\nFiltered out:\n" + doc.select("p").html());
}
}
输出:
-> organization: Peter Hall Company
-> person: Thea Sharrock
-> person: Penelope Keith
-> organization: Madame Arcati
-> person: Sharrock
-> person: Coward
Filtered out:
The Peter Hall Company's production of ''Blithe Spirit,'' directed by Thea Sharrock, is one of those attractively and unimaginatively upholstered productions of brittle classics that become must-have middlebrow tickets every few years. Most notable for Penelope Keith's startlingly brisk and no-nonsense interpretation of the madcap medium Madame Arcati, Ms. Sharrock's take on Coward's 1941 comedy of a man visited by his dead wife's impish spirit delivers bright badinage, dazed double takes and marital melees at the same efficient clip.
我的数据集具有以下结构:
<p>The <ORGANIZATION>Peter Hall Company</ORGANIZATION>'s production of ''Blithe Spirit,'' directed by <PERSON>Thea Sharrock</PERSON>, is one of those attractively and unimaginatively upholstered productions of brittle classics that become must-have middlebrow tickets every few years. Most notable for <PERSON>Penelope Keith</PERSON>'s startlingly brisk and no-nonsense interpretation of the madcap medium <ORGANIZATION>Madame Arcati</ORGANIZATION>, Ms. <PERSON>Sharrock</PERSON>'s take on <PERSON>Coward</PERSON>'s 1941 comedy of a man visited by his dead wife's impish spirit delivers bright badinage, dazed double takes and marital melees at the same efficient clip.</p>
如您所见,在标签 <p>
和 </p>
中有多个标记实体,例如 <ORGANIZATION>Peter Hall Company</ORGANIZATION>
和 <PERSON>Penelope Keith</PERSON>
我想使用 jsoup 列出 <p>
标签中包含的所有实体。
我猜 jsoup 应该能够处理这个问题,我已经看到一些与此特定实例相关的问题,但我无法让它们在我的情况下工作,这可能是因为 <ORGANIZATION>
和 <PERSON>
不是真正的 html 标签?我必须为那些使用正则表达式吗?如果我可以用 jsoup 做,怎么做?
到目前为止我试过这个:
for (Iterator<Element> iterator = contents.iterator(); iterator.hasNext();)
{
Element content = iterator.next();
String text = content.text();
String title = content.select("PERSON").text();
String output = text.replaceFirst(title, "").trim();
System.out.println(output);
}
还有这个:
for (Element content : contents)
{
String PERSON = content.attr("PERSON");
String linkText = content.text();
//print
System.out.println(PERSON);
System.out.println(linkText);
}
两者均无效。
这可行,但不够优雅
//people
Elements contents_person = doc.getElementsByTag("p").select("PERSON");
for (Element content : contents_person)
{
//String PERSON = content.attr("PERSON");
String linkText = content.text();
//print
//System.out.println(PERSON);
System.out.println(linkText);
}
//places
Elements contents_place = doc.getElementsByTag("p").select("LOCATION");
for (Element content : contents_place)
{
//String PERSON = content.attr("PERSON");
String linkText = content.text();
//print
//System.out.println(PERSON);
System.out.println(linkText);
}
//things
Elements contents_things = doc.getElementsByTag("p").select("ORGANIZATION");
for (Element content : contents_things)
{
//String PERSON = content.attr("PERSON");
String linkText = content.text();
//print
//System.out.println(PERSON);
System.out.println(linkText);
}
您只需要为此使用 css 选择器:
public class Foo {
public static void main(String... args) {
String xml = "<p>The <ORGANIZATION>Peter Hall Company</ORGANIZATION>'s production of ''Blithe Spirit,'' directed by <PERSON>Thea Sharrock</PERSON>, is one of those attractively and unimaginatively upholstered productions of brittle classics that become must-have middlebrow tickets every few years. Most notable for <PERSON>Penelope Keith</PERSON>'s startlingly brisk and no-nonsense interpretation of the madcap medium <ORGANIZATION>Madame Arcati</ORGANIZATION>, Ms. <PERSON>Sharrock</PERSON>'s take on <PERSON>Coward</PERSON>'s 1941 comedy of a man visited by his dead wife's impish spirit delivers bright badinage, dazed double takes and marital melees at the same efficient clip.</p>";
Document doc = Jsoup.parse(xml);
for (Element e: doc.select("p > ORGANIZATION, p > PERSON")) {
System.out.printf("-> %s: %s\n", e.tagName(), e.text());
}
}
}
输出:
-> organization: Peter Hall Company
-> person: Thea Sharrock
-> person: Penelope Keith
-> organization: Madame Arcati
-> person: Sharrock
-> person: Coward
编辑:如果您想过滤掉这些标签并只保留内容,您可以在迭代元素时用它们的文本内容替换元素,如下所示:
public class Foo {
public static void main(String... args) {
String xml = "<p>The <ORGANIZATION>Peter Hall Company</ORGANIZATION>'s production of ''Blithe Spirit,'' directed by <PERSON>Thea Sharrock</PERSON>, is one of those attractively and unimaginatively upholstered productions of brittle classics that become must-have middlebrow tickets every few years. Most notable for <PERSON>Penelope Keith</PERSON>'s startlingly brisk and no-nonsense interpretation of the madcap medium <ORGANIZATION>Madame Arcati</ORGANIZATION>, Ms. <PERSON>Sharrock</PERSON>'s take on <PERSON>Coward</PERSON>'s 1941 comedy of a man visited by his dead wife's impish spirit delivers bright badinage, dazed double takes and marital melees at the same efficient clip.</p>";
Document doc = Jsoup.parse(xml);
for (Element e: doc.select("p > ORGANIZATION, p > PERSON")) {
System.out.printf("-> %s: %s\n", e.tagName(), e.text());
e.replaceWith(new TextNode(e.text(), ""));
}
System.out.println("\nFiltered out:\n" + doc.select("p").html());
}
}
输出:
-> organization: Peter Hall Company
-> person: Thea Sharrock
-> person: Penelope Keith
-> organization: Madame Arcati
-> person: Sharrock
-> person: Coward
Filtered out:
The Peter Hall Company's production of ''Blithe Spirit,'' directed by Thea Sharrock, is one of those attractively and unimaginatively upholstered productions of brittle classics that become must-have middlebrow tickets every few years. Most notable for Penelope Keith's startlingly brisk and no-nonsense interpretation of the madcap medium Madame Arcati, Ms. Sharrock's take on Coward's 1941 comedy of a man visited by his dead wife's impish spirit delivers bright badinage, dazed double takes and marital melees at the same efficient clip.