R语言:检查包含文本的两列是否高度相关

R language : Check if two columns containing text are highly correlated

在 R 中,我们可以使用 cor 函数来获取两列之间的相关性,但它不适用于非数值。

我问这个是因为我需要预处理一些数据,我怀疑 2 列非常相似,因为通过查看,我发现当第一列显示“A”时,第二列总是显示“B”,但我确实,如果我知道第一列中的值,我想确定第二列中的值。

如果我不清楚这里有一个例子来说明。

dataframe <- read.csv(file = 'data/company_product.csv')

其中 data/company_product.csv 是 table 这样的

Company Name     Main Product       rest of the data    ...
By Apple         A phone            some_other_data     ...
By Apple         A phone            some_other_data     ...
By Microsoft     A computer         some_other_data     ...
By Nokia         A tablet           some_other_data     ...
By Nokia         A tablet           some_other_data     ...
By Nokia         A tablet           some_other_data     ...
...              ...                ...

如您在此文件中所见,Main Product 列是无用的,因为如果我知道 Company Name 列是“By Apple”,则 Main Product 将始终是“A phone”。

这意味着列 Company Name 与列 Main Product 高度相关,但我没有在 R 中找到一种简单的方法来表明这一点

我不确定答案是否会非常微不足道,或者它是否是文本挖掘中的关键问题,但我不需要精确的相关性,我想要的只是 Yes/No 对于“每个当一个值出现在第一列时,它在第二列中将始终是相同的值

谢谢

使用table来评估这个:

table(df[, 1:2])

给出以下每行和每列仅显示一个非零值,显示 Apple 与 A 相关联 phone,Microsoft 与计算机相关联,诺基亚与 A 相关联 tablet.

              second
first          A computer A phone A tablet
  By Apple              0       2        0
  By Microsoft          1       0        0
  By Nokia              0       0        2

或者简单地计算每个唯一行出现的次数:

aggregate(list(count = df[[1]]), df, length)
##          first     second count
## 1 By Microsoft A computer     1
## 2     By Apple    A phone     2
## 3     By Nokia   A tablet     2

library(dplyr)
count(df, first, second)
##          first     second n
## 1     By Apple    A phone 2
## 2 By Microsoft A computer 1
## 3     By Nokia   A tablet 2

或者如果您不关心计数,只需查看唯一行:

unique(df[, 1:2])
##          first     second
## 1     By Apple    A phone
## 2 By Microsoft A computer
## 4     By Nokia   A tablet

可视化如下:

library(igraph)
g <- graph_from_incidence_matrix(table(df[, 1:2]))
plot(g, layout = layout.bipartite)

也许你可以尝试 tablextabsdcast 来自 data.table

> table(df)
              second
first          A computer A phone A tablet
  By Apple              0       2        0
  By Microsoft          1       0        0
  By Nokia              0       0        2

> xtabs(~ first + second, df)
              second
first          A computer A phone A tablet
  By Apple              0       2        0
  By Microsoft          1       0        0
  By Nokia              0       0        2

> dcast(data.table::setDT(df), first ~ second)
Using 'second' as value column. Use 'value.var' to override
Aggregate function missing, defaulting to 'length'
          first A computer A phone A tablet
1:     By Apple          0       2        0
2: By Microsoft          1       0        0
3:     By Nokia          0       0        2

数据

dt <- data.frame(
  first = c("By Apple", "By Microsoft", "By Apple", "By Nokia", "By Nokia"),
  second = c("A phone", "A computer", "A phone", "A tablet", "A tablet")
)