推文之间的杰卡德距离
Jaccard distance between tweets
我目前正在尝试测量数据集中推文之间的 Jaccard 距离
这是数据集所在的位置
http://www3.nd.edu/~dwang5/courses/spring15/assignments/A2/Tweets.json
我已经尝试了一些方法来测量距离
这是我目前所拥有的
我将链接数据集保存到名为 Tweets.json
的文件中
json_alldata <- fromJSON(sprintf("[%s]", paste(readLines(file("Tweets.json")),collapse=",")))
然后我将 json_alldata 转换为 tweet.features 并删除了地理列
# get rid of geo column
tweet.features = json_alldata
tweet.features$geo <- NULL
这是前两条推文的样子
tweet.features$text[1]
[1] "RT @ItsJennaMarbles: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims. #PrayforBoston"
> tweet.features$text[2]
[1] "RT @NBCSN: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims #PrayforBoston"
我尝试的第一件事是使用 stringdist 库
下的方法 stringdist
install.packages("stringdist")
library(stringdist)
#This works?
#
stringdist(tweet.features$text[1], tweet.features$text[2], method = "jaccard")
当我 运行 时,我得到
[1] 0.1621622
不过我不确定这是否正确。 A intersection B = 23, and A union B = 25. Jaccard distance 是 A intersection B/A union B -- 对吧?那么根据我的计算,杰卡德距离应该是0.92?
所以我想我可以按组来做。简单计算交集和并集和除集
这是我试过的
# Jaccard distance is the intersection of A and B divided by the Union of A and B
#
#create set for First Tweet
A1 <- as.set(tweet.features$text[1])
A2 <- as.set(tweet.features$text[2])
当我尝试做交集时,我得到这个:输出只是 list()
Intersection <- intersect(A1, A2)
list()
当我尝试 Union 时,我得到了这个:
联合(A1,A2)
[[1]]
[1] "RT @ItsJennaMarbles: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims. #PrayforBoston"
[[2]]
[1] "RT @NBCSN: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims #PrayforBoston"
这似乎不是将单词组合成一个集合。
我想我可以用并集来划分交集。但我想我需要程序来计算每组中的数量或单词,然后进行计算。
不用说,我有点卡住了,我不确定我是否在正确的轨道上。
如有任何帮助,我们将不胜感激。谢谢你。
intersect
和 union
期望向量(as.set
不存在)。我想你想比较单词所以你可以使用 strsplit
但拆分的方式属于你。下面的例子:
tweet.features <- list(tweet1="RT @ItsJennaMarbles: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims. #PrayforBoston",
tweet2= "RT @NBCSN: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims #PrayforBoston")
jaccard_i <- function(tw1, tw2){
tw1 <- unlist(strsplit(tw1, " |\."))
tw2 <- unlist(strsplit(tw2, " |\."))
i <- length(intersect(tw1, tw2))
u <- length(union(tw1, tw2))
list(i=i, u=u, j=i/u)
}
jaccard_i(tweet.features[[1]], tweet.features[[2]])
$i
[1] 20
$u
[1] 23
$j
[1] 0.8695652
这是你想要的吗?
strsplit
在这里为每个 space 或点完成。您可能希望从 strsplit
中改进 split
参数并替换 " |\."
以获得更具体的内容(参见 ?regex
)。
我目前正在尝试测量数据集中推文之间的 Jaccard 距离
这是数据集所在的位置
http://www3.nd.edu/~dwang5/courses/spring15/assignments/A2/Tweets.json
我已经尝试了一些方法来测量距离
这是我目前所拥有的
我将链接数据集保存到名为 Tweets.json
的文件中json_alldata <- fromJSON(sprintf("[%s]", paste(readLines(file("Tweets.json")),collapse=",")))
然后我将 json_alldata 转换为 tweet.features 并删除了地理列
# get rid of geo column
tweet.features = json_alldata
tweet.features$geo <- NULL
这是前两条推文的样子
tweet.features$text[1]
[1] "RT @ItsJennaMarbles: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims. #PrayforBoston"
> tweet.features$text[2]
[1] "RT @NBCSN: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims #PrayforBoston"
我尝试的第一件事是使用 stringdist 库
下的方法stringdist
install.packages("stringdist")
library(stringdist)
#This works?
#
stringdist(tweet.features$text[1], tweet.features$text[2], method = "jaccard")
当我 运行 时,我得到
[1] 0.1621622
不过我不确定这是否正确。 A intersection B = 23, and A union B = 25. Jaccard distance 是 A intersection B/A union B -- 对吧?那么根据我的计算,杰卡德距离应该是0.92?
所以我想我可以按组来做。简单计算交集和并集和除集
这是我试过的
# Jaccard distance is the intersection of A and B divided by the Union of A and B
#
#create set for First Tweet
A1 <- as.set(tweet.features$text[1])
A2 <- as.set(tweet.features$text[2])
当我尝试做交集时,我得到这个:输出只是 list()
Intersection <- intersect(A1, A2)
list()
当我尝试 Union 时,我得到了这个:
联合(A1,A2)
[[1]]
[1] "RT @ItsJennaMarbles: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims. #PrayforBoston"
[[2]]
[1] "RT @NBCSN: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims #PrayforBoston"
这似乎不是将单词组合成一个集合。
我想我可以用并集来划分交集。但我想我需要程序来计算每组中的数量或单词,然后进行计算。
不用说,我有点卡住了,我不确定我是否在正确的轨道上。
如有任何帮助,我们将不胜感激。谢谢你。
intersect
和 union
期望向量(as.set
不存在)。我想你想比较单词所以你可以使用 strsplit
但拆分的方式属于你。下面的例子:
tweet.features <- list(tweet1="RT @ItsJennaMarbles: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims. #PrayforBoston",
tweet2= "RT @NBCSN: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims #PrayforBoston")
jaccard_i <- function(tw1, tw2){
tw1 <- unlist(strsplit(tw1, " |\."))
tw2 <- unlist(strsplit(tw2, " |\."))
i <- length(intersect(tw1, tw2))
u <- length(union(tw1, tw2))
list(i=i, u=u, j=i/u)
}
jaccard_i(tweet.features[[1]], tweet.features[[2]])
$i
[1] 20
$u
[1] 23
$j
[1] 0.8695652
这是你想要的吗?
strsplit
在这里为每个 space 或点完成。您可能希望从 strsplit
中改进 split
参数并替换 " |\."
以获得更具体的内容(参见 ?regex
)。