使用正则表达式或子字符串从 R 中的列中提取特定单词

Extract specific words using regex or substring from a column in R

我有以下数据:

    Opex_Spend_Month    Opex_Spend_YTD  Major_Category  NBS_Region  Sub_Category
92179.84            113542.84       Contingent Labour   EUROPE  TEMP:OTH.CONT.WORKER
297.82              82392.82        Contingent Labour   EUROPE  TEMP:OTH.CONT.WORKER
13974.8             34917.8         Contingent Labour   EUROPE  TEMP:OTH.CONT.WORKER
138.6               63125.6         Contingent Labour   EUROPE  TEMP:OTH.CONT.WORKER
NA                  73097           Contingent Labour   EUROPE  TEMP:MSP NON IT
NA                  96035           Contingent Labour   EUROPE  TEMP:MSP NON IT
1388.65             68934.65        Contingent Labour   EUROPE  TEMP:MSP NON IT
5393.76             18748.76        Contingent Labour   EUROPE  TEMP:MSP IT
528.38              82195.38        Contingent Labour   EUROPE  TEMP:MSP IT
22369               95468           Contingent Labour   EUROPE  TEMP:MSP IT

来自专栏 Sub_Category 我希望能够 select Cont Worker,Non IT & IT 的最后部分,我不确定要使用什么正则表达式或子字符串函数。

期望输出

Opex_Spend_Month    Opex_Spend_YTD  Major_Category  NBS_Region  Sub_Category            Category
92179.84            113542.84       Contingent Labour   EUROPE  TEMP:OTH.CONT.WORKER    Cont Worker
297.82              82392.82        Contingent Labour   EUROPE  TEMP:OTH.CONT.WORKER    Cont Worker
13974.8             34917.8         Contingent Labour   EUROPE  TEMP:OTH.CONT.WORKER    Cont Worker
138.6               63125.6         Contingent Labour   EUROPE  TEMP:OTH.CONT.WORKER    Cont Worker
NA                  73097           Contingent Labour   EUROPE  TEMP:MSP NON IT         Non IT
NA                  96035           Contingent Labour   EUROPE  TEMP:MSP NON IT         Non IT
1388.65             68934.65        Contingent Labour   EUROPE  TEMP:MSP NON IT         Non IT
5393.76             18748.76        Contingent Labour   EUROPE  TEMP:MSP IT             IT
528.38              82195.38        Contingent Labour   EUROPE  TEMP:MSP IT             IT
22369               95468           Contingent Labour   EUROPE  TEMP:MSP IT             IT

有人可以帮我解决这个问题吗?

我们可以使用str_extract

library(stringr)
str_extract(df1$Sub_Category, "(CONT\.WORKER|NON IT|IT)$")

在基础 R 中:

df$Category = trimws(gsub('([A-Z]+:[A-Z]+|\.)', ' ', df$Sub_Category))
You can do:

 gsub(".*?(\.|\s)(\w+)","\2 ",dat$Sub_Category)

这里有一个例子:将只调用最后两列(5:6)让你看看会发生什么:

transform(dat,category=gsub(".*?(\.|\s)(\w+)","\2 ",Sub_Category))[5:6]
           Sub_Category     category
1  TEMP:OTH.CONT.WORKER CONT WORKER 
2  TEMP:OTH.CONT.WORKER CONT WORKER 
3  TEMP:OTH.CONT.WORKER CONT WORKER 
4  TEMP:OTH.CONT.WORKER CONT WORKER 
5       TEMP:MSP NON IT      NON IT 
6       TEMP:MSP NON IT      NON IT 
7       TEMP:MSP NON IT      NON IT 
8           TEMP:MSP IT          IT 
9           TEMP:MSP IT          IT 
10          TEMP:MSP IT          IT