用 R 中的常规减号替换长破折号的所有实例

Replace all instances of long dashes – with regular minus - signs in R

在我下面的文本数据中,显然有一个类似于长破折号的特殊字符。但这实际上需要一个常规的减号-

有没有办法用 R 中的常规减号 - 替换所有长破折号 的实例,这样我就可以在 dat 中使用: read.table(text = dat, header = TRUE)?

dat <- "
Study Outcome Subscale g Variance Precision
1 1 1 –.251 .024 41.455
2 1 1 –.069 .001 1,361.067
3 1 5 .138 .001 957.620
4 1 1 –.754 .085 11.809
5 1 1 –.228 .020 49.598
6 1 6 –.212 .004 246.180
6 2 7 .219 .004 246.095
7 1 1 .000 .012 83.367
8 1 2 –.103 .006 162.778
8 2 3 .138 .006 162.612
8 3 4 –.387 .006 160.133
9 1 1 –.032 .023 44.415
10 1 5 –.020 .058 17.110
11 1 1 .128 .017 59.999
12 1 1 –.262 .032 31.505
13 1 1 –.046 .071 14.080
14 1 6 –.324 .003 381.620
14 2 6 –.409 .003 378.611
14 3 7 .080 .003 386.319
14 4 7 –.140 .003 385.542
15 1 1 .311 .005 185.364
16 1 1 .036 .005 205.063
17 1 6 –.259 .001 925.643
17 2 7 .196 .001 928.897
18 1 1 .157 .013 74.094
19 1 1 .000 .056 17.985
20 1 1 .000 .074 13.600
21 1 6 –.013 .039 25.425
21 2 7 –.004 .039 25.426
22 1 1 –.202 .001 1,487.992
23 1 1 .000 .086 11.628
24 1 1 –.221 .001 713.110
25 1 1 –.099 .001 749.964
26 1 5 –.165 .000 6,505.024
27 1 1 –.523 .063 15.856
28 1 1 .000 .001 1,611.801
29 1 6 .377 .045 22.045
29 2 7 .575 .046 21.677
30 1 1 .590 .074 13.477
31 1 1 .020 .001 1,335.991
32 1 1 .121 .043 23.489
33 1 1 –.101 .003 363.163
34 1 1 –.101 .003 369.507
35 1 1 –.104 .004 255.507
36 1 1 –.270 .003 340.761
37 1 1 .179 .150 6.645
38 1 2 .468 .020 51.255
38 2 4 –.479 .020 51.193
39 1 5 –.081 .024 42.536
40 1 1 –.071 .043 23.519
41 1 1 .201 .077 13.036
42 1 6 –.070 .006 180.844
42 2 7 .190 .006 180.168
43 1 1 .277 .013 79.220
44 1 5 –.086 .001 903.924
45 1 5 –.338 .002 469.260
46 1 1 .262 .003 290.330
47 1 5 .000 .003 304.959
48 1 1 –.645 .055 18.192
49 1 5 –.120 .002 461.802
50 1 5 –.286 .009 106.189
51 1 1 –.124 .006 172.261
52 1 1 .023 .028 35.941
53 1 5 –.064 .001 944.600
54 1 1 .000 .043 23.010
55 1 1 .000 .014 72.723
56 1 5 .000 .012 85.832
57 1 1 .000 .012 85.832
"

使用基础 R 中的 gsub()

dat <- gsub(pattern = "–", replacement = "-", x = dat)


head(read.table(text = dat, header = T))
  Study Outcome Subscale      g Variance Precision
1     1       1        1 -0.251    0.024    41.455
2     2       1        1 -0.069    0.001 1,361.067
3     3       1        5  0.138    0.001   957.620
4     4       1        1 -0.754    0.085    11.809
5     5       1        1 -0.228    0.020    49.598
6     6       1        6 -0.212    0.004   246.180

使用 stringr 的示例。

library(stringr)
library(dplyr)
x <- str_replace_all(dat, "–", "-")
tibble(read.table(textConnection(x), header = TRUE))

轻松标准化所有破折号:

dat <- gsub("\p{Pd}", "-", dat, perl=TRUE)

regex proof

参见https://www.fileformat.info/info/unicode/category/Pd/list.htm

Character   Name    Browser Image
U+002D  HYPHEN-MINUS    -   view
U+058A  ARMENIAN HYPHEN ֊   view
U+05BE  HEBREW PUNCTUATION MAQAF    ־   view
U+1400  CANADIAN SYLLABICS HYPHEN   ᐀   view
U+1806  MONGOLIAN TODO SOFT HYPHEN  ᠆   view
U+2010  HYPHEN  ‐   view
U+2011  NON-BREAKING HYPHEN ‑   view
U+2012  FIGURE DASH ‒   view
U+2013  EN DASH –   view
U+2014  EM DASH —   view
U+2015  HORIZONTAL BAR  ―   view
U+2E17  DOUBLE OBLIQUE HYPHEN   ⸗   view
U+2E1A  HYPHEN WITH DIAERESIS   ⸚   view
U+2E3A  TWO-EM DASH ⸺   view
U+2E3B  THREE-EM DASH   ⸻   view
U+2E40  DOUBLE HYPHEN   ⹀   view
U+301C  WAVE DASH   〜   view
U+3030  WAVY DASH   〰   view
U+30A0  KATAKANA-HIRAGANA DOUBLE HYPHEN ゠   view
U+FE31  PRESENTATION FORM FOR VERTICAL EM DASH  ︱   view
U+FE32  PRESENTATION FORM FOR VERTICAL EN DASH  ︲   view
U+FE58  SMALL EM DASH   ﹘   view
U+FE63  SMALL HYPHEN-MINUS  ﹣   view
U+FF0D  FULLWIDTH HYPHEN-MINUS  -   view
U+10EAD YEZIDI HYPHENATION MARK   view