R 文本挖掘 - 文本字段之间的交集
R text mining - intersection between text fields
我想知道是否有一种快速的方法可以找到 2 个文本字符串之间的有向交集,例如
t1 <- "I have achieved my goals over the past 20 years and look forward for my next chalanges"
t2 <- " have achieved goals and look my chalanges some other words bla bla"
t1 isContainedIn t2 会 return 7 因为在 t1 中出现的 7 个词也在 t2 中出现。
此外,t1 和 t2 是数据框中的 2 列,因此我需要在整个数据框中应用该函数并将结果列附加到我的原始数据框中。
这就是我的数据框 'data.selected' 的样子:
keywords title
1 Samsung UN48H6350 48" Samsung UN48H6350 48" Full 1080p Smart HDTV 120Hz with Wi-Fi + Visa Gift Card
2 Samsung UN48H6350 48" Samsung UN48H6350 48" Full HD Smart LED TV -Bundle- (See Below for Contents)
3 Samsung UN48H6350 48" Samsung UN48H6350 48" Class Full HD Smart LED TV -BUNDLE- See below Details
4 Samsung UN48H6350 48" Samsung UN48H6350 48" Full HD Smart LED TV With BD-H5100 Blu-ray Disc Player
5 Samsung UN48H6350 48" Samsung UN48H6350 48" Smart 1080p Clear Motion Rate 240 LED HDTV
6 Samsung UN48H6350 48" Samsung UN48H6350 - 48-Inch Full HD 1080p Smart HDTV 120Hz with Wi-Fi
7 Samsung UN48H6350 48" Samsung 6350 Series UN48H6350 48" 1080p HD LED LCD Internet TV NEW
8 Samsung UN48H6350 48" Samsung Un48h6350af 75" 1080p Led-lcd Tv - 16:9 - Hdtv 1080p - (un75h6350afxza)
9 Samsung UN48H6350 48" Samsung UN48H6350 - 48" HD 1080p Smart HDTV 120Hz Bundle
10 Samsung UN48H6350 48" Samsung UN48H6350 - 48-Inch Full HD 1080p Smart HDTV 120Hz with Wi-Fi, (R#416)
我不太清楚你说的方向很重要是什么意思。除非您更改数据,否则交叉点的长度不应更改。这可能就是您要找的。
length(Reduce(intersect, strsplit(c(t1, t2), "\s+")))
# [1] 7
如果将 c(t1, t2)
切换为 c(t2, t1)
,您可以在 Reduce
输出中看到差异。但正如我所说,长度仍然相同。只是集合的顺序不同。
我想另一种类似的方法就是使用简单的 match
string <- strsplit(c(t1, t2), "\s+") # similar to @Richard
length(na.omit(match(string[[2]], string[[1]])))
## [1] 7
或者lapply
length(unlist(lapply(string[[2]], intersect, string[[1]])))
## [1] 7
我想知道是否有一种快速的方法可以找到 2 个文本字符串之间的有向交集,例如
t1 <- "I have achieved my goals over the past 20 years and look forward for my next chalanges"
t2 <- " have achieved goals and look my chalanges some other words bla bla"
t1 isContainedIn t2 会 return 7 因为在 t1 中出现的 7 个词也在 t2 中出现。 此外,t1 和 t2 是数据框中的 2 列,因此我需要在整个数据框中应用该函数并将结果列附加到我的原始数据框中。 这就是我的数据框 'data.selected' 的样子:
keywords title
1 Samsung UN48H6350 48" Samsung UN48H6350 48" Full 1080p Smart HDTV 120Hz with Wi-Fi + Visa Gift Card
2 Samsung UN48H6350 48" Samsung UN48H6350 48" Full HD Smart LED TV -Bundle- (See Below for Contents)
3 Samsung UN48H6350 48" Samsung UN48H6350 48" Class Full HD Smart LED TV -BUNDLE- See below Details
4 Samsung UN48H6350 48" Samsung UN48H6350 48" Full HD Smart LED TV With BD-H5100 Blu-ray Disc Player
5 Samsung UN48H6350 48" Samsung UN48H6350 48" Smart 1080p Clear Motion Rate 240 LED HDTV
6 Samsung UN48H6350 48" Samsung UN48H6350 - 48-Inch Full HD 1080p Smart HDTV 120Hz with Wi-Fi
7 Samsung UN48H6350 48" Samsung 6350 Series UN48H6350 48" 1080p HD LED LCD Internet TV NEW
8 Samsung UN48H6350 48" Samsung Un48h6350af 75" 1080p Led-lcd Tv - 16:9 - Hdtv 1080p - (un75h6350afxza)
9 Samsung UN48H6350 48" Samsung UN48H6350 - 48" HD 1080p Smart HDTV 120Hz Bundle
10 Samsung UN48H6350 48" Samsung UN48H6350 - 48-Inch Full HD 1080p Smart HDTV 120Hz with Wi-Fi, (R#416)
我不太清楚你说的方向很重要是什么意思。除非您更改数据,否则交叉点的长度不应更改。这可能就是您要找的。
length(Reduce(intersect, strsplit(c(t1, t2), "\s+")))
# [1] 7
如果将 c(t1, t2)
切换为 c(t2, t1)
,您可以在 Reduce
输出中看到差异。但正如我所说,长度仍然相同。只是集合的顺序不同。
我想另一种类似的方法就是使用简单的 match
string <- strsplit(c(t1, t2), "\s+") # similar to @Richard
length(na.omit(match(string[[2]], string[[1]])))
## [1] 7
或者lapply
length(unlist(lapply(string[[2]], intersect, string[[1]])))
## [1] 7