R 中具有多个电子邮件收件人的网络分析

Network Analysis with multiple email recipients in R

我正在尝试构建一个 R 脚本,它可以接收电子邮件元数据,将其处理成网络图并将其保存为交互式 HTML 页面以供探索。

我首先将我的数据简化为两个人(一个发件人和一个收件人)之间的电子邮件,并且我的脚本可以使用它(请参阅下面的脚本,该脚本加载数据并生成 ndoe 和边缘列表).

虽然在我的实际数据中,收件人列表可以有多个收件人,我也想包括这些互动。电子邮件地址由空格分隔,所以我应该能够很容易地将它们分开,但我不知道该怎么做!

####Load Data#####
library(tidyverse)
library(tcltk)

#Load Base Data File
baseData <- read.csv(tk_choose.files(caption = "Select the main data file"))

#change all email addresses to lower case
levels(baseData$Sender) <- tolower(levels(baseData$Sender))
levels(baseData$Recipients) <- tolower(levels(baseData$Recipients))

attrs <- read.csv(tk_choose.files(caption = "Select the attribute data file"))

####Generate Node & Edge Lists####

#Generate Node List
sources <- baseData %>%
  distinct(Sender) %>%
  rename(email = Sender)

destinations <- baseData %>%
  distinct(Recipients) %>%
  rename(email = Recipients)

nodes <- full_join(destinations, sources, by = "email")
nodes <- nodes %>% rowid_to_column("id")
nodes <- nodes

#Tag nodes with employee attributes
nodes <- merge(x = nodes, y = attrs, by.x = "email", by.y = "EmailAddress", all.x = TRUE)

#Make graph display name as node label, rather than email address
colnames(nodes)[colnames(nodes) == 'EmployeeName'] <- 'label'

#Replace gender for whatever field you want to group by
colnames(nodes)[colnames(nodes) == 'Gender'] <- 'group'

#Generate Edge List
per_route <- baseData %>%
  group_by(Sender, Recipients) %>%
  summarise(weight = n()) %>%
  ungroup()

edges <- per_route %>%
  left_join(nodes, by = c("Sender" = "email")) %>%
  rename(from = id)

edges <- edges %>%
  left_join(nodes, by = c("Recipients" = "email")) %>%
  rename(to = id)

edges <- select(edges, from, to, weight)
edges <- mutate(edges, width = weight/20 + 1)


####Generate Network####
#[TRUNCATED]

已编辑以添加示例数据

我的数据目前是这样的:

Timestamp   MessageId   Sender  Recipients  RecipientCount
26/09/2017 16:39    msg1    sender1@sender.com  recip1@recipient.com recip2@recipient.com recip3@recipient.com  3
28/09/2017 13:27    msg2    sender2@sender.com  recip1@recipient.com recip2@recipient.com recip3@recipient.com  3

我想尝试让它看起来像这样,然后我现有的代码就可以正常工作了:

Timestamp   MessageId   Sender  Recipients
26/09/2017 16:39    msg1    sender1@sender.com  recip1@recipient.com
26/09/2017 16:39    msg1    sender1@sender.com  recip2@recipient.com
26/09/2017 16:39    msg1    sender1@sender.com  recip3@recipient.com
28/09/2017 13:27    msg2    sender2@sender.com  recip1@recipient.com
28/09/2017 13:27    msg2    sender2@sender.com  recip2@recipient.com
28/09/2017 13:27    msg2    sender2@sender.com  recip3@recipient.com

看来这并不是真正关于网络分析的问题。它确实与格式化数据有关。这应该有效。

library(tidyverse)

data_new <- data %>%
  mutate(unique_recipient = str_split(Recipients, " ")) %>%  #ASSUMING THERE IS ONLY ONE SPACE BETWEEN EMAILS
  unnest()