为什么 matcher.find() 没有给出任何结果?为什么会结冰?
Why matcher.find() is not giving any result? Why it freezes?
我正在创建电子邮件抓取工具。但是当我尝试使用一个特定的 URL matcher.find()
时,没有给出任何 boolean
结果。如我所见,它冻结了。但对于其他一些 URLs,代码工作正常。
这是我的代码,
private Matcher matcher;
private Pattern pattern = null;
private final String emailPattern = "([\w\-]([\.\w])+[\w]+@([\w\-]+\.)+[A-Za-z]{2,4})";
public void scrape() {
pattern = Pattern.compile(emailPattern);
Document documentTwo = null;
try {
documentTwo = Jsoup.connect("https://www.mercurynews.com/2020/03/21/how-can-i-get-tested-for-covid-19-in-the-bay-area/")
.ignoreHttpErrors(true)
.userAgent(RandomUserAgent.getRandomUserAgent())
.header("Content-Language", "en-US")
.get();
} catch (IOException ex) {
break;
}
String pageBody = documentTwo.toString();
matcher = pattern.matcher(pageBody);
while (matcher.find()) {
// this will never execute for the above web address
}
}
为了检查,我在 while 循环上方添加了 System.out.println(matcher.find());
,它卡在那里没有打印任何值。那么我在这里做错了什么?我尝试过许多不同的电子邮件正则表达式模式,但上面的模式是有效的。
您的正则表达式有问题。下面给出的是带有工作正则表达式的代码:
import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class Main {
public static void main(String[] args) {
Document documentTwo = null;
try {
documentTwo = Jsoup
.connect(
"https://www.mercurynews.com/2020/03/21/how-can-i-get-tested-for-covid-19-in-the-bay-area/")
.header("Content-Language", "en-US").get();
} catch (IOException e) {
e.printStackTrace();
}
String pageBody = documentTwo.toString();
Pattern pattern = Pattern.compile(
"([a-zA-Z0-9\+\.\_\%\-\+]{1,256}\@[a-zA-Z0-9][a-zA-Z0-9\-]{0,64}(\.[a-zA-Z0-9][a-zA-Z0-9\-]{0,25})+)");
Matcher matcher = pattern.matcher(pageBody);
while (matcher.find()) {
System.out.println(matcher.group());
}
}
}
输出:
lkrieger@bayareanewsgroup.com
lkrieger@bayareanewsgroup.com
fkelliher@bayareanewsgroup.com
lkrieger@bayareanewsgroup.com
lkrieger@bayareanewsgroup.com
fkelliher@bayareanewsgroup.com
fkelliher@bayareanewsgroup.com
lkrieger@bayareanewsgroup.com
lkrieger@bayareanewsgroup.com
fkelliher@bayareanewsgroup.com
fkelliher@bayareanewsgroup.com
lkrieger@bayareanewsgroup.com
我正在创建电子邮件抓取工具。但是当我尝试使用一个特定的 URL matcher.find()
时,没有给出任何 boolean
结果。如我所见,它冻结了。但对于其他一些 URLs,代码工作正常。
这是我的代码,
private Matcher matcher;
private Pattern pattern = null;
private final String emailPattern = "([\w\-]([\.\w])+[\w]+@([\w\-]+\.)+[A-Za-z]{2,4})";
public void scrape() {
pattern = Pattern.compile(emailPattern);
Document documentTwo = null;
try {
documentTwo = Jsoup.connect("https://www.mercurynews.com/2020/03/21/how-can-i-get-tested-for-covid-19-in-the-bay-area/")
.ignoreHttpErrors(true)
.userAgent(RandomUserAgent.getRandomUserAgent())
.header("Content-Language", "en-US")
.get();
} catch (IOException ex) {
break;
}
String pageBody = documentTwo.toString();
matcher = pattern.matcher(pageBody);
while (matcher.find()) {
// this will never execute for the above web address
}
}
为了检查,我在 while 循环上方添加了 System.out.println(matcher.find());
,它卡在那里没有打印任何值。那么我在这里做错了什么?我尝试过许多不同的电子邮件正则表达式模式,但上面的模式是有效的。
您的正则表达式有问题。下面给出的是带有工作正则表达式的代码:
import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class Main {
public static void main(String[] args) {
Document documentTwo = null;
try {
documentTwo = Jsoup
.connect(
"https://www.mercurynews.com/2020/03/21/how-can-i-get-tested-for-covid-19-in-the-bay-area/")
.header("Content-Language", "en-US").get();
} catch (IOException e) {
e.printStackTrace();
}
String pageBody = documentTwo.toString();
Pattern pattern = Pattern.compile(
"([a-zA-Z0-9\+\.\_\%\-\+]{1,256}\@[a-zA-Z0-9][a-zA-Z0-9\-]{0,64}(\.[a-zA-Z0-9][a-zA-Z0-9\-]{0,25})+)");
Matcher matcher = pattern.matcher(pageBody);
while (matcher.find()) {
System.out.println(matcher.group());
}
}
}
输出:
lkrieger@bayareanewsgroup.com
lkrieger@bayareanewsgroup.com
fkelliher@bayareanewsgroup.com
lkrieger@bayareanewsgroup.com
lkrieger@bayareanewsgroup.com
fkelliher@bayareanewsgroup.com
fkelliher@bayareanewsgroup.com
lkrieger@bayareanewsgroup.com
lkrieger@bayareanewsgroup.com
fkelliher@bayareanewsgroup.com
fkelliher@bayareanewsgroup.com
lkrieger@bayareanewsgroup.com