有没有更好的方法用另一个词替换字符串中的几个词? SAS

Is there a better way replace several words in a string with another word? SAS

我有很多数据,其中一列是自由文本描述。我正在尝试在 SAS 中处理这个问题,作为其中的一部分,我想更正一些拼写并删除一些不会真正为文本内容增加任何价值的单词(又名 'stopwords')。

我有一种方法可以做到这一点,它是下面显示的代码,它工作正常,但这意味着我需要为我想要更改的单词的每个变体编写一行代码.

在下面的例子中我想:

  1. 将单词“mug”(即“mug”、“mugg”、“mugs”)的变体替换为作品“cup”
  2. 删除三个没有真正增加文本含义的词(称为“停用词”,这里我只列出了 3 个:“i”、“me”、“my”)。

为了让这个工作正常,我需要实际替换 spaces 包围的词(即替换“mug”、“mugg”、“mugs”而不是“mug”、“mugg”的实例“,“杯子”)。这是为了避免替换包含该字符串的其他单词的一部分。因此,在进行拼写更改之前,我不得不删除标点符号并在文本字符串的开头和结尾添加 space,这很好。

我确信一定有比下面的代码更好的方法,我很想改进我的 SAS,所以有没有人知道更好的方法。有没有一种方法可以创建一个由“mug”、“mugg”、“mugs”组成的新列表,然后在一行中用“cup”替换所有这些词?

如有任何想法,我们将不胜感激:)

代码如下:

data have;
  infile datalines dsd truncover;
  input ID Description :. Col3 $ Col4 Col5 Col6;
datalines;
1,bla bla my mybla,C1,0,100,0
2,got me tear,C1,0,0,0
3,free text i ,C1,10,100,0
4,house roof tree!?,C1,10,100,0
5,house mugg muggle,C1,10,0,0
6,sky** computer mug mug mugs!,C3,0,20,1
;
/* add a space to the start and end so every word is surounded by spaces */
data data_1;
set have;
Space = "_";
Description_new = catt(Space, Description, Space);
Description_new = tranwrd(Description_new,"_", " _ ");
run;

/* remove punctuation so every word is surounded by spaces */
data data_2;
set data_1;
Description_new = COMPRESS(Description_new,,'p');
drop Space;
run;

/* correct spelling of mug to cup*/
data data_3;
set data_2;
Description_new = tranwrd(Description_new," mug ", " cup ");
Description_new = tranwrd(Description_new," mugs ", " cup ");
Description_new = tranwrd(Description_new," mugg ", " cup ");
run;

/* remove stopwords */
data data_4;
set data_3;
Description_new = tranwrd(Description_new," i ", " ");
Description_new = tranwrd(Description_new," me ", " ");
Description_new = tranwrd(Description_new," my ", " ");
run;

您可以使用一种格式来转换原始变量中的每个单词。

data have;
  infile datalines dsd truncover;
  input ID Description :. Col3 $ Col4 Col5 Col6;
datalines;
1,bla bla my mybla,C1,0,100,0
2,got me tear,C1,0,0,0
3,free text i ,C1,10,100,0
4,house roof tree!?,C1,10,100,0
5,house mugg muggle,C1,10,0,0
6,sky** computer mug mug mugs!,C3,0,20,1
;

proc format ;
value $fix (max=200)
  "mug", "mugg", "mugs" = "cup"
  "i", "me", "my" = " "
;
run;

data want;
  set have;
  fixed=description;
  fixed=' ';
  do index=1 to countw(description,' ');
    fixed=catx(' ',fixed,put(scan(description,index,' '),$fix200.));
  end;
run;