R语言:检查包含文本的两列是否高度相关
R language : Check if two columns containing text are highly correlated
在 R 中,我们可以使用 cor
函数来获取两列之间的相关性,但它不适用于非数值。
我问这个是因为我需要预处理一些数据,我怀疑 2 列非常相似,因为通过查看,我发现当第一列显示“A”时,第二列总是显示“B”,但我确实,如果我知道第一列中的值,我想确定第二列中的值。
如果我不清楚这里有一个例子来说明。
dataframe <- read.csv(file = 'data/company_product.csv')
其中 data/company_product.csv 是 table 这样的
Company Name Main Product rest of the data ...
By Apple A phone some_other_data ...
By Apple A phone some_other_data ...
By Microsoft A computer some_other_data ...
By Nokia A tablet some_other_data ...
By Nokia A tablet some_other_data ...
By Nokia A tablet some_other_data ...
... ... ...
如您在此文件中所见,Main Product 列是无用的,因为如果我知道 Company Name 列是“By Apple”,则 Main Product 将始终是“A phone”。
这意味着列 Company Name 与列 Main Product 高度相关,但我没有在 R 中找到一种简单的方法来表明这一点
我不确定答案是否会非常微不足道,或者它是否是文本挖掘中的关键问题,但我不需要精确的相关性,我想要的只是 Yes/No 对于“每个当一个值出现在第一列时,它在第二列中将始终是相同的值
谢谢
使用table来评估这个:
table(df[, 1:2])
给出以下每行和每列仅显示一个非零值,显示 Apple 与 A 相关联 phone,Microsoft 与计算机相关联,诺基亚与 A 相关联 tablet.
second
first A computer A phone A tablet
By Apple 0 2 0
By Microsoft 1 0 0
By Nokia 0 0 2
或者简单地计算每个唯一行出现的次数:
aggregate(list(count = df[[1]]), df, length)
## first second count
## 1 By Microsoft A computer 1
## 2 By Apple A phone 2
## 3 By Nokia A tablet 2
或
library(dplyr)
count(df, first, second)
## first second n
## 1 By Apple A phone 2
## 2 By Microsoft A computer 1
## 3 By Nokia A tablet 2
或者如果您不关心计数,只需查看唯一行:
unique(df[, 1:2])
## first second
## 1 By Apple A phone
## 2 By Microsoft A computer
## 4 By Nokia A tablet
可视化如下:
library(igraph)
g <- graph_from_incidence_matrix(table(df[, 1:2]))
plot(g, layout = layout.bipartite)
也许你可以尝试 table
、xtabs
或 dcast
来自 data.table
包
> table(df)
second
first A computer A phone A tablet
By Apple 0 2 0
By Microsoft 1 0 0
By Nokia 0 0 2
> xtabs(~ first + second, df)
second
first A computer A phone A tablet
By Apple 0 2 0
By Microsoft 1 0 0
By Nokia 0 0 2
> dcast(data.table::setDT(df), first ~ second)
Using 'second' as value column. Use 'value.var' to override
Aggregate function missing, defaulting to 'length'
first A computer A phone A tablet
1: By Apple 0 2 0
2: By Microsoft 1 0 0
3: By Nokia 0 0 2
数据
dt <- data.frame(
first = c("By Apple", "By Microsoft", "By Apple", "By Nokia", "By Nokia"),
second = c("A phone", "A computer", "A phone", "A tablet", "A tablet")
)
在 R 中,我们可以使用 cor
函数来获取两列之间的相关性,但它不适用于非数值。
我问这个是因为我需要预处理一些数据,我怀疑 2 列非常相似,因为通过查看,我发现当第一列显示“A”时,第二列总是显示“B”,但我确实,如果我知道第一列中的值,我想确定第二列中的值。
如果我不清楚这里有一个例子来说明。
dataframe <- read.csv(file = 'data/company_product.csv')
其中 data/company_product.csv 是 table 这样的
Company Name Main Product rest of the data ...
By Apple A phone some_other_data ...
By Apple A phone some_other_data ...
By Microsoft A computer some_other_data ...
By Nokia A tablet some_other_data ...
By Nokia A tablet some_other_data ...
By Nokia A tablet some_other_data ...
... ... ...
如您在此文件中所见,Main Product 列是无用的,因为如果我知道 Company Name 列是“By Apple”,则 Main Product 将始终是“A phone”。
这意味着列 Company Name 与列 Main Product 高度相关,但我没有在 R 中找到一种简单的方法来表明这一点
我不确定答案是否会非常微不足道,或者它是否是文本挖掘中的关键问题,但我不需要精确的相关性,我想要的只是 Yes/No 对于“每个当一个值出现在第一列时,它在第二列中将始终是相同的值
谢谢
使用table来评估这个:
table(df[, 1:2])
给出以下每行和每列仅显示一个非零值,显示 Apple 与 A 相关联 phone,Microsoft 与计算机相关联,诺基亚与 A 相关联 tablet.
second
first A computer A phone A tablet
By Apple 0 2 0
By Microsoft 1 0 0
By Nokia 0 0 2
或者简单地计算每个唯一行出现的次数:
aggregate(list(count = df[[1]]), df, length)
## first second count
## 1 By Microsoft A computer 1
## 2 By Apple A phone 2
## 3 By Nokia A tablet 2
或
library(dplyr)
count(df, first, second)
## first second n
## 1 By Apple A phone 2
## 2 By Microsoft A computer 1
## 3 By Nokia A tablet 2
或者如果您不关心计数,只需查看唯一行:
unique(df[, 1:2])
## first second
## 1 By Apple A phone
## 2 By Microsoft A computer
## 4 By Nokia A tablet
可视化如下:
library(igraph)
g <- graph_from_incidence_matrix(table(df[, 1:2]))
plot(g, layout = layout.bipartite)
也许你可以尝试 table
、xtabs
或 dcast
来自 data.table
包
> table(df)
second
first A computer A phone A tablet
By Apple 0 2 0
By Microsoft 1 0 0
By Nokia 0 0 2
> xtabs(~ first + second, df)
second
first A computer A phone A tablet
By Apple 0 2 0
By Microsoft 1 0 0
By Nokia 0 0 2
> dcast(data.table::setDT(df), first ~ second)
Using 'second' as value column. Use 'value.var' to override
Aggregate function missing, defaulting to 'length'
first A computer A phone A tablet
1: By Apple 0 2 0
2: By Microsoft 1 0 0
3: By Nokia 0 0 2
数据
dt <- data.frame(
first = c("By Apple", "By Microsoft", "By Apple", "By Nokia", "By Nokia"),
second = c("A phone", "A computer", "A phone", "A tablet", "A tablet")
)