使用 GoogleFinanceSource 函数使用 tm.plugin.webmining 包进行文本挖掘
Text mining with tm.plugin.webmining package using GoogleFinanceSource function
我正在学习在线书籍http://tidytextmining.com/上的文本挖掘。
在第五章:
http://tidytextmining.com/dtm.html#financial
以下代码:
library(tm.plugin.webmining)
library(purrr)
company <- c("Microsoft", "Apple", "Google", "Amazon", "Facebook",
"Twitter", "IBM", "Yahoo", "Netflix")
symbol <- c("MSFT", "AAPL", "GOOG", "AMZN", "FB", "TWTR", "IBM", "YHOO", "NFLX")
download_articles <- function(symbol) {
WebCorpus(GoogleFinanceSource(paste0("NASDAQ:", symbol)))
}
stock_articles <- data_frame(company = company,
symbol = symbol) %>%
mutate(corpus = map(symbol, download_articles))
给我错误:
StartTag: invalid element name
Extra content at the end of the document
Error: 1: StartTag: invalid element name
2: Extra content at the end of the document
有什么提示吗?
有人建议去掉"Twitter"相关的公司和符号,但是还是不行和returns一样的错误。
非常感谢
我遇到了同样的问题,不过,已稍微缩小范围。此代码片段导致相同的错误。
GoogleFinanceSource("NASDAQ:MSFT")
StartTag: invalid element name
Extra content at the end of the document
Error: 1: StartTag: invalid element name
2: Extra content at the end of the document
我还看到其他人建议删除 Twitter。我知道它会失败,因为 Twitter 不在纳斯达克。然而,我尝试了建议的 "NYSE:TWTR" 并得到了相同的结果。
我尝试使用 GoogleNewsSource 来查看我是否会遇到同样的问题并得到不同的错误,github 上的这篇文章表明是由解析器引起的。我想知道这两个问题是否相关。 github.com/mannau/tm.plugin.webmining/issues/14.
GoogleNewsSource("Microsoft")
Unknown IO error failed to load external entity "http://news.google.com/news?hl=en&q=Microsoft&ie=utf-8&num=100&output=rss"
Error: 1: Unknown IO error2: failed to load external entity "http://news.google.com/news?hl=en&q=Microsoft&ie=utf-8&num=100&output=rss"
综上所述,我找到了使用修改后的代码列表和 YahooFinanceSource 的解决方法,如下所示:
company <- c("Microsoft", "Apple", "Google")
symbol <- c("MSFT", "AAPL", "GOOG")
download_articles <- function(symbol) {
WebCorpus(YahooFinanceSource(symbol))
}
stock_articles <- data_frame(company = company,
symbol = symbol) %>%
mutate(corpus = map(symbol, download_articles))
在下面的代码行中,尝试将默认的 ie = "utf-8" 更改为 ie = "ansi"。尝试将其应用到您的脚本中,它应该可以工作。
WebCorpus(GoogleFinanceSource("NASDAQ:MSFT", params = list(hl = "en", q = "NASDAQ:MSFT", ie = "ansi", start = 0, num = 20, output = "rss")))
问题是软件包 tm.plugin.webmining
已过期。
在回复时只有 YahooFinanceSource
和 YahooNewsSource
还活着。
这是一个快速参考和测试。
从作者写的Vignette page来看,应该有8个可能的来源站点:
- GoogleBlogSearchSource
- GoogleFinaceSource
- GoogleNewsSource
- 纽约时报来源
- 路透社新闻来源
- YahooFinanceSource
- YahooInplaySource
- YahooNewsSource
但是根据Github page,第一个"GoogleBlogSearchSource"已经被证实停产了。对于剩下的7个源,我做了一个简单的测试,看看它们是否有效:
library(tm)
library(tm.plugin.webmining)
googlefinance <- WebCorpus(GoogleFinanceSource("A"))
googlenews <- WebCorpus(GoogleNewsSource("A"))
nytimes <- WebCorpus(NYTimesSource("A", appid = nytimes_appid))
reutersnews <- WebCorpus(ReutersNewsSource("A"))
yahoofinance <- WebCorpus(YahooFinanceSource("A"))
yahooinplay <- WebCorpus(YahooInplaySource())
yahoonews <- WebCorpus(YahooNewsSource("M"))
结果显示,所有yahoo's sourses在技术上仍然是运行,但是无论我选择什么参数,YahooInplaySource
returns 0个文档。
> googlefinance <- WebCorpus(GoogleFinanceSource("NASDAQ:MSFT"))
StartTag: invalid element name
Extra content at the end of the document
Error in inherits(x, "WebSource") : 1: StartTag: invalid element name
2: Extra content at the end of the document
> googlefinance <- WebCorpus(GoogleFinanceSource("A"))
StartTag: invalid element name
Extra content at the end of the document
Error in inherits(x, "WebSource") : 1: StartTag: invalid element name
2: Extra content at the end of the document
> googlenews <- WebCorpus(GoogleNewsSource("A"))
Unknown IO errorfailed to load external entity "http://news.google.com/news?hl=en&q=A&ie=utf-8&num=100&output=rss"
Error in inherits(x, "WebSource") :
1: Unknown IO error2: failed to load external entity "http://news.google.com/news?hl=en&q=A&ie=utf-8&num=100&output=rss"
> nytimes <- WebCorpus(NYTimesSource("A", appid = nytimes_appid))
Error in inherits(x, "WebSource") : object 'nytimes_appid' not found
> reutersnews <- WebCorpus(ReutersNewsSource("A"))
Entity 'ldquo' not defined
Entity 'rdquo' not defined
Opening and ending tag mismatch: div line 60 and body
Opening and ending tag mismatch: body line 59 and html
Premature end of data in tag html line 1
Error in inherits(x, "WebSource") : 1: Entity 'ldquo' not defined
2: Entity 'rdquo' not defined
3: Opening and ending tag mismatch: div line 60 and body
4: Opening and ending tag mismatch: body line 59 and html
5: Premature end of data in tag html line 1
> yahoofinance <- WebCorpus(YahooFinanceSource("A"))
> yahoofinance
<<WebCorpus>>
Metadata: corpus specific: 3, document level (indexed): 0
Content: documents: 16
> yahooinplay <- WebCorpus(YahooInplaySource())
> yahooinplay
<<WebCorpus>>
Metadata: corpus specific: 3, document level (indexed): 0
Content: documents: 0
> yahoonews <- WebCorpus(YahooNewsSource("A"))
> yahoonews
<<WebCorpus>>
Metadata: corpus specific: 3, document level (indexed): 0
Content: documents: 0
> yahoonews <- WebCorpus(YahooNewsSource("M"))
> yahoonews
<<WebCorpus>>
Metadata: corpus specific: 3, document level (indexed): 0
Content: documents: 10
另外值得一提的是,即使 YahooFinanceSourse
有效,它也不会 return 与 GoogleFinanceSource
应该做的类似内容。如果您想尝试 中的示例,我认为您可以使用带有自定义查询列表的 YahooNewsSource
。
我正在学习在线书籍http://tidytextmining.com/上的文本挖掘。 在第五章: http://tidytextmining.com/dtm.html#financial
以下代码:
library(tm.plugin.webmining)
library(purrr)
company <- c("Microsoft", "Apple", "Google", "Amazon", "Facebook",
"Twitter", "IBM", "Yahoo", "Netflix")
symbol <- c("MSFT", "AAPL", "GOOG", "AMZN", "FB", "TWTR", "IBM", "YHOO", "NFLX")
download_articles <- function(symbol) {
WebCorpus(GoogleFinanceSource(paste0("NASDAQ:", symbol)))
}
stock_articles <- data_frame(company = company,
symbol = symbol) %>%
mutate(corpus = map(symbol, download_articles))
给我错误:
StartTag: invalid element name
Extra content at the end of the document
Error: 1: StartTag: invalid element name
2: Extra content at the end of the document
有什么提示吗? 有人建议去掉"Twitter"相关的公司和符号,但是还是不行和returns一样的错误。 非常感谢
我遇到了同样的问题,不过,已稍微缩小范围。此代码片段导致相同的错误。
GoogleFinanceSource("NASDAQ:MSFT")
StartTag: invalid element name Extra content at the end of the document Error: 1: StartTag: invalid element name 2: Extra content at the end of the document
我还看到其他人建议删除 Twitter。我知道它会失败,因为 Twitter 不在纳斯达克。然而,我尝试了建议的 "NYSE:TWTR" 并得到了相同的结果。
我尝试使用 GoogleNewsSource 来查看我是否会遇到同样的问题并得到不同的错误,github 上的这篇文章表明是由解析器引起的。我想知道这两个问题是否相关。 github.com/mannau/tm.plugin.webmining/issues/14.
GoogleNewsSource("Microsoft")
Unknown IO error failed to load external entity "http://news.google.com/news?hl=en&q=Microsoft&ie=utf-8&num=100&output=rss" Error: 1: Unknown IO error2: failed to load external entity "http://news.google.com/news?hl=en&q=Microsoft&ie=utf-8&num=100&output=rss"
综上所述,我找到了使用修改后的代码列表和 YahooFinanceSource 的解决方法,如下所示:
company <- c("Microsoft", "Apple", "Google")
symbol <- c("MSFT", "AAPL", "GOOG")
download_articles <- function(symbol) {
WebCorpus(YahooFinanceSource(symbol))
}
stock_articles <- data_frame(company = company,
symbol = symbol) %>%
mutate(corpus = map(symbol, download_articles))
在下面的代码行中,尝试将默认的 ie = "utf-8" 更改为 ie = "ansi"。尝试将其应用到您的脚本中,它应该可以工作。
WebCorpus(GoogleFinanceSource("NASDAQ:MSFT", params = list(hl = "en", q = "NASDAQ:MSFT", ie = "ansi", start = 0, num = 20, output = "rss")))
问题是软件包 tm.plugin.webmining
已过期。
在回复时只有 YahooFinanceSource
和 YahooNewsSource
还活着。
这是一个快速参考和测试。
从作者写的Vignette page来看,应该有8个可能的来源站点:
- GoogleBlogSearchSource
- GoogleFinaceSource
- GoogleNewsSource
- 纽约时报来源
- 路透社新闻来源
- YahooFinanceSource
- YahooInplaySource
- YahooNewsSource
但是根据Github page,第一个"GoogleBlogSearchSource"已经被证实停产了。对于剩下的7个源,我做了一个简单的测试,看看它们是否有效:
library(tm)
library(tm.plugin.webmining)
googlefinance <- WebCorpus(GoogleFinanceSource("A"))
googlenews <- WebCorpus(GoogleNewsSource("A"))
nytimes <- WebCorpus(NYTimesSource("A", appid = nytimes_appid))
reutersnews <- WebCorpus(ReutersNewsSource("A"))
yahoofinance <- WebCorpus(YahooFinanceSource("A"))
yahooinplay <- WebCorpus(YahooInplaySource())
yahoonews <- WebCorpus(YahooNewsSource("M"))
结果显示,所有yahoo's sourses在技术上仍然是运行,但是无论我选择什么参数,YahooInplaySource
returns 0个文档。
> googlefinance <- WebCorpus(GoogleFinanceSource("NASDAQ:MSFT"))
StartTag: invalid element name
Extra content at the end of the document
Error in inherits(x, "WebSource") : 1: StartTag: invalid element name
2: Extra content at the end of the document
> googlefinance <- WebCorpus(GoogleFinanceSource("A"))
StartTag: invalid element name
Extra content at the end of the document
Error in inherits(x, "WebSource") : 1: StartTag: invalid element name
2: Extra content at the end of the document
> googlenews <- WebCorpus(GoogleNewsSource("A"))
Unknown IO errorfailed to load external entity "http://news.google.com/news?hl=en&q=A&ie=utf-8&num=100&output=rss"
Error in inherits(x, "WebSource") :
1: Unknown IO error2: failed to load external entity "http://news.google.com/news?hl=en&q=A&ie=utf-8&num=100&output=rss"
> nytimes <- WebCorpus(NYTimesSource("A", appid = nytimes_appid))
Error in inherits(x, "WebSource") : object 'nytimes_appid' not found
> reutersnews <- WebCorpus(ReutersNewsSource("A"))
Entity 'ldquo' not defined
Entity 'rdquo' not defined
Opening and ending tag mismatch: div line 60 and body
Opening and ending tag mismatch: body line 59 and html
Premature end of data in tag html line 1
Error in inherits(x, "WebSource") : 1: Entity 'ldquo' not defined
2: Entity 'rdquo' not defined
3: Opening and ending tag mismatch: div line 60 and body
4: Opening and ending tag mismatch: body line 59 and html
5: Premature end of data in tag html line 1
> yahoofinance <- WebCorpus(YahooFinanceSource("A"))
> yahoofinance
<<WebCorpus>>
Metadata: corpus specific: 3, document level (indexed): 0
Content: documents: 16
> yahooinplay <- WebCorpus(YahooInplaySource())
> yahooinplay
<<WebCorpus>>
Metadata: corpus specific: 3, document level (indexed): 0
Content: documents: 0
> yahoonews <- WebCorpus(YahooNewsSource("A"))
> yahoonews
<<WebCorpus>>
Metadata: corpus specific: 3, document level (indexed): 0
Content: documents: 0
> yahoonews <- WebCorpus(YahooNewsSource("M"))
> yahoonews
<<WebCorpus>>
Metadata: corpus specific: 3, document level (indexed): 0
Content: documents: 10
另外值得一提的是,即使 YahooFinanceSourse
有效,它也不会 return 与 GoogleFinanceSource
应该做的类似内容。如果您想尝试 中的示例,我认为您可以使用带有自定义查询列表的 YahooNewsSource
。