无法使用样板管道解析纽约时报文章
Not able to parse new york times article using boilerpipe
我正在尝试从 'new york times' url 获取新闻文章,但它没有提供任何输出,但如果我尝试任何其他报纸,它会提供输出。我想知道我的代码是否有问题,或者 boilerpipe 无法获取它。另外,有时输出不是英语,这意味着它主要以 unicode 显示 'daily news',我也想知道原因。
导入 java.io.InputStream;
导入 java.net.URL;
import org.xml.sax.InputSource;
import de.l3s.boilerpipe.document.TextDocument;
import de.l3s.boilerpipe.extractors.ArticleExtractor;
import de.l3s.boilerpipe.extractors.DefaultExtractor;
import de.l3s.boilerpipe.sax.BoilerpipeSAXInput;
class ExtractData
{
public static void main(final String[] args) throws Exception
{
URL url;
url = new URL(
"http://www.nytimes.com/2013/03/02/nyregion/us-judges-offer-addicts-a-way-to-avoid-prison.html?hp&_r=0");
// NOTE We ignore HTTP-based character encoding in this demo...
final InputStream urlStream = url.openStream();
final InputSource is = new InputSource(urlStream);
final BoilerpipeSAXInput in = new BoilerpipeSAXInput(is);
final TextDocument doc = in.getTextDocument();
urlStream.close();
// You have the choice between different Extractors
//System.out.println(DefaultExtractor.INSTANCE.getText(doc));
System.out.println(ArticleExtractor.INSTANCE.getText(doc));
}
}
Nytimes.com 有一个付费专区,它 returns HTTP 303 for your request, you could try to handle the redirect and cookies。尝试其他用户代理字符串也可能有效。
我正在尝试从 'new york times' url 获取新闻文章,但它没有提供任何输出,但如果我尝试任何其他报纸,它会提供输出。我想知道我的代码是否有问题,或者 boilerpipe 无法获取它。另外,有时输出不是英语,这意味着它主要以 unicode 显示 'daily news',我也想知道原因。 导入 java.io.InputStream; 导入 java.net.URL;
import org.xml.sax.InputSource;
import de.l3s.boilerpipe.document.TextDocument;
import de.l3s.boilerpipe.extractors.ArticleExtractor;
import de.l3s.boilerpipe.extractors.DefaultExtractor;
import de.l3s.boilerpipe.sax.BoilerpipeSAXInput;
class ExtractData
{
public static void main(final String[] args) throws Exception
{
URL url;
url = new URL(
"http://www.nytimes.com/2013/03/02/nyregion/us-judges-offer-addicts-a-way-to-avoid-prison.html?hp&_r=0");
// NOTE We ignore HTTP-based character encoding in this demo...
final InputStream urlStream = url.openStream();
final InputSource is = new InputSource(urlStream);
final BoilerpipeSAXInput in = new BoilerpipeSAXInput(is);
final TextDocument doc = in.getTextDocument();
urlStream.close();
// You have the choice between different Extractors
//System.out.println(DefaultExtractor.INSTANCE.getText(doc));
System.out.println(ArticleExtractor.INSTANCE.getText(doc));
}
}
Nytimes.com 有一个付费专区,它 returns HTTP 303 for your request, you could try to handle the redirect and cookies。尝试其他用户代理字符串也可能有效。