将 twitter status/hyperlink/date 与 R 分开

Separate twitter status/hyperlink/date with R

我会自动分隔以下推文以在三个单独的列中获取推文本身、超链接和日期。任何人都可以帮忙吗?我的数据集的名称是 DB_YS,它是一个 txt 文件。

以下是我的数据框中的几条推文:

Thank you, everyone!  indyref http://t.co/1kTzqjyGE7 Sep 18, 2014 
  As the polls close, total likes on the @YesScotland Facebook page have passed David Cameron s one.  indyref  voteYes http://t.co/x7IoB1EtfY Sep 18, 2014 
We can be proud of  indyref, which has seen a flourishing of Scotland’s self-confidence as a nation  VoteYes http://t.co/1OqxvbpoS9 Sep 18, 2014 
We can afford world-class public services. A Yes vote means we can strengthen our NHS.  VoteYes  indyref http://t.co/D9Vn5OqStV Sep 18, 2014 
This is a once in a lifetime opportunity to choose a new and better path for Scotland  VoteYes  indyref http://t.co/9knT6Mx4vZ Sep 18, 2014 
Our young people shouldn t have to leave to find decent jobs.  VoteYes  indyref http://t.co/vAE164f0Oy Sep 18, 2014 

这是使用一系列正则表达式的基本包的解决方案:

# Assume df is your data frame with a column called txt

# Match text until the beginning of the URL
tweet.regex <- regexpr("^.*(?=http)", df$txt, perl=T)

# Extract tweet text
tweet <- substr(df$txt, tweet.regex, attr(tweet.regex, "match.length"))

# Match text from the beginning of the URL to the next space
url.regex <- regexpr("http[^ ]+(?= )", df$txt, perl=T)

# Extract URL
url <- substr(df$txt, url.regex, url.regex + attr(url.regex, "match.length"))

# Match the date
date.regex <- regexpr("[A-Za-z]+ \d+, \d{4} *$", df$txt, perl=T)

# Extract date
date <- substr(df$txt, date.regex, date.regex + attr(date.regex, "match.length"))

# Combine results
tweet.df <- data.frame(tweet, url, date, stringsAsFactors=F)

对于每一个,我们都使用正则表达式来匹配一条推文,获取匹配开始的索引,然后使用 substr() 从匹配索引中提取到匹配索引和匹配长度。

第一个正则表达式 ^.*(?=http) 使用先行匹配从字符串的开头(表示为 ^)到 http 之前的最后一个字符。

第二个,http[^ ]+(?= ) 匹配从 http 到下一个 space,因为 spaces 不能是 URL 的一部分。

由于日期是固定格式的,我们可以使用更直接的正则表达式来获取日期。 [A-Za-z]+ 匹配任何字母 "a" 到 "z",不区分大小写。 \d 得到一个数字,即 0-9。添加 + 意味着至少匹配最后一次。那么 \d{4} 表示恰好匹配一行中的 4 个数字。我们可以通过匹配直到字符串末尾来确保我们不会获得包含在推文中的日期。这里我们使用 *(即 space 加 *)来获取任何尾随空格,然后我们得到以 $.

结尾的字符串

regexpr() 函数 returns 匹配索引的向量。也就是说,它告诉您它在字符串中的什么位置找到了匹配的开始。该向量还有一个名为 match.length 的属性,它告诉您匹配的长度。我们使用 attr(..., "match.length").

提取该信息

这是使用 stringr 包的解决方案。

library("stringr")
dat <- c("Thank you, everyone!  indyref http://t.co/1kTzqjyGE7 Sep 18, 2014 ",
"As the polls close, total likes on the @YesScotland Facebook page have passed David Cameron s one.  indyref  voteYes http://t.co/x7IoB1EtfY Sep 18, 2014 ",
"We can be proud of  indyref, which has seen a flourishing of Scotland’s self-confidence as a nation  VoteYes http://t.co/1OqxvbpoS9 Sep 18, 2014 ",
"We can afford world-class public services. A Yes vote means we can strengthen our NHS.  VoteYes  indyref http://t.co/D9Vn5OqStV Sep 18, 2014 ",
"This is a once in a lifetime opportunity to choose a new and better path for Scotland  VoteYes  indyref http://t.co/9knT6Mx4vZ Sep 18, 2014 ",
"Our young people shouldn t have to leave to find decent jobs.  VoteYes  indyref http://t.co/vAE164f0Oy Sep 18, 2014 ")

dates <- str_extract(dat, "[A-Z]{1}[a-z]{2} [0-9]{1,2}, [0-9]{4}")
url <- str_extract(dat, "http://t.co/[0-9A-Za-z]{10}")
text <- gsub("  indyref.+", "", dat)
df <- data.frame(dates, text, url, stringsAsFactors=F)

这里是也使用 "stringr" 包的解决方案。它基于科里的回答,但它纠正了一些如果你有非常规的推文会发生的错误 它假定您有一个名为 DB_YS.txt 的 .txt 文件,其中包含原始文本格式的所有推文。并且您已经安装了库 "stringr"。否则你必须做 install.packages("stringr")

library(stringr)
#Load your data into R
RawData <- read.table("DB_YS.txt", sep="\n", header = F)
#Extract the dates into a new vector called dates
dates <- str_extract(RawData$V1, "[A-Za-z]+ \d+, \d{4} *$")
#Extract the urls assuming that all urls will start by http and store them in a new vector called url
url <- str_extract(RawData$V1, "http.+")
#Remove the urls from text and store them into a vector called text
text <- gsub("http.+", "", RawData$V1)
#Remove the "indyref" that tells twitter where to put the urls in a tweet and overwrite the result in the text vector
text <- gsub("  indyref", "", text)
#Create a data.frame containing the tidy data
Data <- data.frame(dates, text, url, stringsAsFactors=F)