使用正则表达式或子字符串从 R 中的列中提取特定单词
Extract specific words using regex or substring from a column in R
我有以下数据:
Opex_Spend_Month Opex_Spend_YTD Major_Category NBS_Region Sub_Category
92179.84 113542.84 Contingent Labour EUROPE TEMP:OTH.CONT.WORKER
297.82 82392.82 Contingent Labour EUROPE TEMP:OTH.CONT.WORKER
13974.8 34917.8 Contingent Labour EUROPE TEMP:OTH.CONT.WORKER
138.6 63125.6 Contingent Labour EUROPE TEMP:OTH.CONT.WORKER
NA 73097 Contingent Labour EUROPE TEMP:MSP NON IT
NA 96035 Contingent Labour EUROPE TEMP:MSP NON IT
1388.65 68934.65 Contingent Labour EUROPE TEMP:MSP NON IT
5393.76 18748.76 Contingent Labour EUROPE TEMP:MSP IT
528.38 82195.38 Contingent Labour EUROPE TEMP:MSP IT
22369 95468 Contingent Labour EUROPE TEMP:MSP IT
来自专栏 Sub_Category 我希望能够 select Cont Worker,Non IT & IT 的最后部分,我不确定要使用什么正则表达式或子字符串函数。
期望输出
Opex_Spend_Month Opex_Spend_YTD Major_Category NBS_Region Sub_Category Category
92179.84 113542.84 Contingent Labour EUROPE TEMP:OTH.CONT.WORKER Cont Worker
297.82 82392.82 Contingent Labour EUROPE TEMP:OTH.CONT.WORKER Cont Worker
13974.8 34917.8 Contingent Labour EUROPE TEMP:OTH.CONT.WORKER Cont Worker
138.6 63125.6 Contingent Labour EUROPE TEMP:OTH.CONT.WORKER Cont Worker
NA 73097 Contingent Labour EUROPE TEMP:MSP NON IT Non IT
NA 96035 Contingent Labour EUROPE TEMP:MSP NON IT Non IT
1388.65 68934.65 Contingent Labour EUROPE TEMP:MSP NON IT Non IT
5393.76 18748.76 Contingent Labour EUROPE TEMP:MSP IT IT
528.38 82195.38 Contingent Labour EUROPE TEMP:MSP IT IT
22369 95468 Contingent Labour EUROPE TEMP:MSP IT IT
有人可以帮我解决这个问题吗?
我们可以使用str_extract
library(stringr)
str_extract(df1$Sub_Category, "(CONT\.WORKER|NON IT|IT)$")
在基础 R 中:
df$Category = trimws(gsub('([A-Z]+:[A-Z]+|\.)', ' ', df$Sub_Category))
You can do:
gsub(".*?(\.|\s)(\w+)","\2 ",dat$Sub_Category)
这里有一个例子:将只调用最后两列(5:6)让你看看会发生什么:
transform(dat,category=gsub(".*?(\.|\s)(\w+)","\2 ",Sub_Category))[5:6]
Sub_Category category
1 TEMP:OTH.CONT.WORKER CONT WORKER
2 TEMP:OTH.CONT.WORKER CONT WORKER
3 TEMP:OTH.CONT.WORKER CONT WORKER
4 TEMP:OTH.CONT.WORKER CONT WORKER
5 TEMP:MSP NON IT NON IT
6 TEMP:MSP NON IT NON IT
7 TEMP:MSP NON IT NON IT
8 TEMP:MSP IT IT
9 TEMP:MSP IT IT
10 TEMP:MSP IT IT
我有以下数据:
Opex_Spend_Month Opex_Spend_YTD Major_Category NBS_Region Sub_Category
92179.84 113542.84 Contingent Labour EUROPE TEMP:OTH.CONT.WORKER
297.82 82392.82 Contingent Labour EUROPE TEMP:OTH.CONT.WORKER
13974.8 34917.8 Contingent Labour EUROPE TEMP:OTH.CONT.WORKER
138.6 63125.6 Contingent Labour EUROPE TEMP:OTH.CONT.WORKER
NA 73097 Contingent Labour EUROPE TEMP:MSP NON IT
NA 96035 Contingent Labour EUROPE TEMP:MSP NON IT
1388.65 68934.65 Contingent Labour EUROPE TEMP:MSP NON IT
5393.76 18748.76 Contingent Labour EUROPE TEMP:MSP IT
528.38 82195.38 Contingent Labour EUROPE TEMP:MSP IT
22369 95468 Contingent Labour EUROPE TEMP:MSP IT
来自专栏 Sub_Category 我希望能够 select Cont Worker,Non IT & IT 的最后部分,我不确定要使用什么正则表达式或子字符串函数。
期望输出
Opex_Spend_Month Opex_Spend_YTD Major_Category NBS_Region Sub_Category Category
92179.84 113542.84 Contingent Labour EUROPE TEMP:OTH.CONT.WORKER Cont Worker
297.82 82392.82 Contingent Labour EUROPE TEMP:OTH.CONT.WORKER Cont Worker
13974.8 34917.8 Contingent Labour EUROPE TEMP:OTH.CONT.WORKER Cont Worker
138.6 63125.6 Contingent Labour EUROPE TEMP:OTH.CONT.WORKER Cont Worker
NA 73097 Contingent Labour EUROPE TEMP:MSP NON IT Non IT
NA 96035 Contingent Labour EUROPE TEMP:MSP NON IT Non IT
1388.65 68934.65 Contingent Labour EUROPE TEMP:MSP NON IT Non IT
5393.76 18748.76 Contingent Labour EUROPE TEMP:MSP IT IT
528.38 82195.38 Contingent Labour EUROPE TEMP:MSP IT IT
22369 95468 Contingent Labour EUROPE TEMP:MSP IT IT
有人可以帮我解决这个问题吗?
我们可以使用str_extract
library(stringr)
str_extract(df1$Sub_Category, "(CONT\.WORKER|NON IT|IT)$")
在基础 R 中:
df$Category = trimws(gsub('([A-Z]+:[A-Z]+|\.)', ' ', df$Sub_Category))
You can do:
gsub(".*?(\.|\s)(\w+)","\2 ",dat$Sub_Category)
这里有一个例子:将只调用最后两列(5:6)让你看看会发生什么:
transform(dat,category=gsub(".*?(\.|\s)(\w+)","\2 ",Sub_Category))[5:6]
Sub_Category category
1 TEMP:OTH.CONT.WORKER CONT WORKER
2 TEMP:OTH.CONT.WORKER CONT WORKER
3 TEMP:OTH.CONT.WORKER CONT WORKER
4 TEMP:OTH.CONT.WORKER CONT WORKER
5 TEMP:MSP NON IT NON IT
6 TEMP:MSP NON IT NON IT
7 TEMP:MSP NON IT NON IT
8 TEMP:MSP IT IT
9 TEMP:MSP IT IT
10 TEMP:MSP IT IT