R 中具有多个电子邮件收件人的网络分析
Network Analysis with multiple email recipients in R
我正在尝试构建一个 R 脚本,它可以接收电子邮件元数据,将其处理成网络图并将其保存为交互式 HTML 页面以供探索。
我首先将我的数据简化为两个人(一个发件人和一个收件人)之间的电子邮件,并且我的脚本可以使用它(请参阅下面的脚本,该脚本加载数据并生成 ndoe 和边缘列表).
虽然在我的实际数据中,收件人列表可以有多个收件人,我也想包括这些互动。电子邮件地址由空格分隔,所以我应该能够很容易地将它们分开,但我不知道该怎么做!
####Load Data#####
library(tidyverse)
library(tcltk)
#Load Base Data File
baseData <- read.csv(tk_choose.files(caption = "Select the main data file"))
#change all email addresses to lower case
levels(baseData$Sender) <- tolower(levels(baseData$Sender))
levels(baseData$Recipients) <- tolower(levels(baseData$Recipients))
attrs <- read.csv(tk_choose.files(caption = "Select the attribute data file"))
####Generate Node & Edge Lists####
#Generate Node List
sources <- baseData %>%
distinct(Sender) %>%
rename(email = Sender)
destinations <- baseData %>%
distinct(Recipients) %>%
rename(email = Recipients)
nodes <- full_join(destinations, sources, by = "email")
nodes <- nodes %>% rowid_to_column("id")
nodes <- nodes
#Tag nodes with employee attributes
nodes <- merge(x = nodes, y = attrs, by.x = "email", by.y = "EmailAddress", all.x = TRUE)
#Make graph display name as node label, rather than email address
colnames(nodes)[colnames(nodes) == 'EmployeeName'] <- 'label'
#Replace gender for whatever field you want to group by
colnames(nodes)[colnames(nodes) == 'Gender'] <- 'group'
#Generate Edge List
per_route <- baseData %>%
group_by(Sender, Recipients) %>%
summarise(weight = n()) %>%
ungroup()
edges <- per_route %>%
left_join(nodes, by = c("Sender" = "email")) %>%
rename(from = id)
edges <- edges %>%
left_join(nodes, by = c("Recipients" = "email")) %>%
rename(to = id)
edges <- select(edges, from, to, weight)
edges <- mutate(edges, width = weight/20 + 1)
####Generate Network####
#[TRUNCATED]
已编辑以添加示例数据
我的数据目前是这样的:
Timestamp MessageId Sender Recipients RecipientCount
26/09/2017 16:39 msg1 sender1@sender.com recip1@recipient.com recip2@recipient.com recip3@recipient.com 3
28/09/2017 13:27 msg2 sender2@sender.com recip1@recipient.com recip2@recipient.com recip3@recipient.com 3
我想尝试让它看起来像这样,然后我现有的代码就可以正常工作了:
Timestamp MessageId Sender Recipients
26/09/2017 16:39 msg1 sender1@sender.com recip1@recipient.com
26/09/2017 16:39 msg1 sender1@sender.com recip2@recipient.com
26/09/2017 16:39 msg1 sender1@sender.com recip3@recipient.com
28/09/2017 13:27 msg2 sender2@sender.com recip1@recipient.com
28/09/2017 13:27 msg2 sender2@sender.com recip2@recipient.com
28/09/2017 13:27 msg2 sender2@sender.com recip3@recipient.com
看来这并不是真正关于网络分析的问题。它确实与格式化数据有关。这应该有效。
library(tidyverse)
data_new <- data %>%
mutate(unique_recipient = str_split(Recipients, " ")) %>% #ASSUMING THERE IS ONLY ONE SPACE BETWEEN EMAILS
unnest()
我正在尝试构建一个 R 脚本,它可以接收电子邮件元数据,将其处理成网络图并将其保存为交互式 HTML 页面以供探索。
我首先将我的数据简化为两个人(一个发件人和一个收件人)之间的电子邮件,并且我的脚本可以使用它(请参阅下面的脚本,该脚本加载数据并生成 ndoe 和边缘列表).
虽然在我的实际数据中,收件人列表可以有多个收件人,我也想包括这些互动。电子邮件地址由空格分隔,所以我应该能够很容易地将它们分开,但我不知道该怎么做!
####Load Data#####
library(tidyverse)
library(tcltk)
#Load Base Data File
baseData <- read.csv(tk_choose.files(caption = "Select the main data file"))
#change all email addresses to lower case
levels(baseData$Sender) <- tolower(levels(baseData$Sender))
levels(baseData$Recipients) <- tolower(levels(baseData$Recipients))
attrs <- read.csv(tk_choose.files(caption = "Select the attribute data file"))
####Generate Node & Edge Lists####
#Generate Node List
sources <- baseData %>%
distinct(Sender) %>%
rename(email = Sender)
destinations <- baseData %>%
distinct(Recipients) %>%
rename(email = Recipients)
nodes <- full_join(destinations, sources, by = "email")
nodes <- nodes %>% rowid_to_column("id")
nodes <- nodes
#Tag nodes with employee attributes
nodes <- merge(x = nodes, y = attrs, by.x = "email", by.y = "EmailAddress", all.x = TRUE)
#Make graph display name as node label, rather than email address
colnames(nodes)[colnames(nodes) == 'EmployeeName'] <- 'label'
#Replace gender for whatever field you want to group by
colnames(nodes)[colnames(nodes) == 'Gender'] <- 'group'
#Generate Edge List
per_route <- baseData %>%
group_by(Sender, Recipients) %>%
summarise(weight = n()) %>%
ungroup()
edges <- per_route %>%
left_join(nodes, by = c("Sender" = "email")) %>%
rename(from = id)
edges <- edges %>%
left_join(nodes, by = c("Recipients" = "email")) %>%
rename(to = id)
edges <- select(edges, from, to, weight)
edges <- mutate(edges, width = weight/20 + 1)
####Generate Network####
#[TRUNCATED]
已编辑以添加示例数据
我的数据目前是这样的:
Timestamp MessageId Sender Recipients RecipientCount
26/09/2017 16:39 msg1 sender1@sender.com recip1@recipient.com recip2@recipient.com recip3@recipient.com 3
28/09/2017 13:27 msg2 sender2@sender.com recip1@recipient.com recip2@recipient.com recip3@recipient.com 3
我想尝试让它看起来像这样,然后我现有的代码就可以正常工作了:
Timestamp MessageId Sender Recipients
26/09/2017 16:39 msg1 sender1@sender.com recip1@recipient.com
26/09/2017 16:39 msg1 sender1@sender.com recip2@recipient.com
26/09/2017 16:39 msg1 sender1@sender.com recip3@recipient.com
28/09/2017 13:27 msg2 sender2@sender.com recip1@recipient.com
28/09/2017 13:27 msg2 sender2@sender.com recip2@recipient.com
28/09/2017 13:27 msg2 sender2@sender.com recip3@recipient.com
看来这并不是真正关于网络分析的问题。它确实与格式化数据有关。这应该有效。
library(tidyverse)
data_new <- data %>%
mutate(unique_recipient = str_split(Recipients, " ")) %>% #ASSUMING THERE IS ONLY ONE SPACE BETWEEN EMAILS
unnest()