提取研究文档中参考编号周围的文本
Extract text around the reference number in research document
我想提取参考号周围的文本。
例如:
文字是:
The sociological assumption is a constraint on the trust in the underlying social graph: the graph needs to have strong trust as evidenced,
for example, by face to face interaction demonstrating social nodes
knowledge of each other [10, 11]. While the first assumption has
been questioned recently in [8], where it is shown that even the honest subgraph may have some cuts that disrupt the algorithmic property on which Sybil defenses are based, the trust, though being a crucial requirement for these designs, was not considered carefully. Even worse, these defense [10, 11, 2, 4] — when verified against real-world networks — have considered samples of online social graphs, which are known to possess weaker value of rust.
这里我想提取参考编号 [8] 的引用文本,[10]、[11] [2] 和 [4] 的引用文本也是如此。
您实际上没有给出要收集的输出示例。您也没有提供任何用于尝试此操作的代码。
无论如何。如果我假设您想要引用之前的所有文本,那么您的正则表达式将类似于:
(.*?)\[(.*?)\]
这会捕获两组,第一组是文本,第二组是引文。您可以使用以下代码将其应用于文本:
private static final Pattern pattern = Pattern.compile("(.*?)\[(.*?)\]");
public static void extract(String input) {
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
String text = matcher.group(1);
String citation = matcher.group(2);
System.out.println("The text: '" + text + "'\n\thas citation(s): " + citation);
}
}
对于您提供的输入,此收集以下内容:
The text: 'The sociological assumption is a constraint on the trust in the underlying social graph: the graph needs to have strong trust as evidenced, for example, by face to face interaction demonstrating social nodes knowledge of each other '
has citation(s): 10, 11
The text: '. While the first assumption has been questioned recently in '
has citation(s): 8
阅读您的评论后,您似乎想查找可能出现在任何给定句子中的引文。由于句子以句号结束,并且可能包含多个引用,因此您需要分两步处理:
public static void main(String[] args) {
String input = "...";
List<CitedSentence> citations = new ArrayList<CitedSentence>();
for (String sentence : convertToSentences(input)) {
citations.addAll(findCitations(sentence));
}
for (CitedSentence citation : citations) {
System.out.println(citation);
}
}
public static String[] convertToSentences(String input) {
return input.split("\s*\.\s*");
}
private static final Pattern pattern = Pattern.compile("\[(.*?)\]");
public static List<CitedSentence> findCitations(String sentence) {
Matcher matcher = pattern.matcher(sentence);
List<CitedSentence> result = new ArrayList<CitedSentence>();
while (matcher.find()) {
String citation = matcher.group(1);
for (String currentCitation : citation.split(",")) {
result.add(new CitedSentence(sentence, currentCitation.trim()));
}
}
return result;
}
static class CitedSentence {
String sentence, citation;
public CitedSentence(String sentence, String citation) {
this.sentence = sentence;
this.citation = citation;
}
public String toString() {
return "[" + citation + "]: " + sentence;
}
}
当我运行这个时,它会产生以下内容:
[10]: The sociological assumption is a constraint on the trust in the underlying social graph: the graph needs to have strong trust as evidenced, for example, by face to face interaction demonstrating social nodes knowledge of each other [10, 11]
[11]: The sociological assumption is a constraint on the trust in the underlying social graph: the graph needs to have strong trust as evidenced, for example, by face to face interaction demonstrating social nodes knowledge of each other [10, 11]
[8]: While the first assumption has been questioned recently in [8], where it is shown that even the Copyright is held by the author/owner(s)
我只使用了部分示例文本。
这应该有效
(^.|[.?!])+([^?!.]{0,})\[[0-9, ]{0,}?\]
我想提取参考号周围的文本。
例如:
文字是:
The sociological assumption is a constraint on the trust in the underlying social graph: the graph needs to have strong trust as evidenced, for example, by face to face interaction demonstrating social nodes knowledge of each other [10, 11]. While the first assumption has been questioned recently in [8], where it is shown that even the honest subgraph may have some cuts that disrupt the algorithmic property on which Sybil defenses are based, the trust, though being a crucial requirement for these designs, was not considered carefully. Even worse, these defense [10, 11, 2, 4] — when verified against real-world networks — have considered samples of online social graphs, which are known to possess weaker value of rust.
这里我想提取参考编号 [8] 的引用文本,[10]、[11] [2] 和 [4] 的引用文本也是如此。
您实际上没有给出要收集的输出示例。您也没有提供任何用于尝试此操作的代码。
无论如何。如果我假设您想要引用之前的所有文本,那么您的正则表达式将类似于:
(.*?)\[(.*?)\]
这会捕获两组,第一组是文本,第二组是引文。您可以使用以下代码将其应用于文本:
private static final Pattern pattern = Pattern.compile("(.*?)\[(.*?)\]");
public static void extract(String input) {
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
String text = matcher.group(1);
String citation = matcher.group(2);
System.out.println("The text: '" + text + "'\n\thas citation(s): " + citation);
}
}
对于您提供的输入,此收集以下内容:
The text: 'The sociological assumption is a constraint on the trust in the underlying social graph: the graph needs to have strong trust as evidenced, for example, by face to face interaction demonstrating social nodes knowledge of each other '
has citation(s): 10, 11
The text: '. While the first assumption has been questioned recently in '
has citation(s): 8
阅读您的评论后,您似乎想查找可能出现在任何给定句子中的引文。由于句子以句号结束,并且可能包含多个引用,因此您需要分两步处理:
public static void main(String[] args) {
String input = "...";
List<CitedSentence> citations = new ArrayList<CitedSentence>();
for (String sentence : convertToSentences(input)) {
citations.addAll(findCitations(sentence));
}
for (CitedSentence citation : citations) {
System.out.println(citation);
}
}
public static String[] convertToSentences(String input) {
return input.split("\s*\.\s*");
}
private static final Pattern pattern = Pattern.compile("\[(.*?)\]");
public static List<CitedSentence> findCitations(String sentence) {
Matcher matcher = pattern.matcher(sentence);
List<CitedSentence> result = new ArrayList<CitedSentence>();
while (matcher.find()) {
String citation = matcher.group(1);
for (String currentCitation : citation.split(",")) {
result.add(new CitedSentence(sentence, currentCitation.trim()));
}
}
return result;
}
static class CitedSentence {
String sentence, citation;
public CitedSentence(String sentence, String citation) {
this.sentence = sentence;
this.citation = citation;
}
public String toString() {
return "[" + citation + "]: " + sentence;
}
}
当我运行这个时,它会产生以下内容:
[10]: The sociological assumption is a constraint on the trust in the underlying social graph: the graph needs to have strong trust as evidenced, for example, by face to face interaction demonstrating social nodes knowledge of each other [10, 11]
[11]: The sociological assumption is a constraint on the trust in the underlying social graph: the graph needs to have strong trust as evidenced, for example, by face to face interaction demonstrating social nodes knowledge of each other [10, 11]
[8]: While the first assumption has been questioned recently in [8], where it is shown that even the Copyright is held by the author/owner(s)
我只使用了部分示例文本。
这应该有效
(^.|[.?!])+([^?!.]{0,})\[[0-9, ]{0,}?\]