替换函数在最后给出额外的字符

replace function giving extra character at the end

我正在尝试使用替换函数清理数据框中列的数据。 输出最后总是给我一个额外的字符。 我越是运行一样的代码,最后加的字符越多。 任何人都可以帮我解决这个问题吗?

owner = ['China' 'Chinese' 'Hong Kong' 'Hongkong' 'China (Taiwan)' 'Japan'
 'Sweden' 'Canada' 'HK' 'United States' 'Indian' 'American' 'Japanese'
 'U.S.' 'Taiwanese' 115 'Australia' 'HongKong' 'France' 'Taiwan'
 'Malaysia' 'Taiwan, China' 1380 'Switzerland' 'US' 'Netherlands'
 'Chinese-Hongkong' 208 447 153 151 'Ireland' 'Taiwan China' 'china'
 'Taiwan of China' 'Mainland China' 'Chinese(HK)' 'HONG KONG OF CHINA'
 'USA' 'Korea, Republic of' 'Chinses' 1834 'South Africa' 40 184 190 427
 'German' 'Singapore' 'The philipines' 193 397 'Janpan' 'Japan and Taiwan'
 48 46 1274 'Chines' 1641 89 'Korea' 50 43 380 'Hong Kong, China'
 'China (HK)' 85 'Germany' 'English' 205 'Hongkong, China' 35
 'South Korean' 'British Virgin Islands' 'China (Hong Kong)' 156 490 94 95
 138 319 'Mandarin' 'Spainish' 'South Korea' 'Hongkong,China' 'U.S.A'
 'Hongkong China' 2583 98 'Korean' 5000 'India' 100 38 'chinese' 'The USA'
 'Canadian' 'Taiwan/HK/Macao' 'Chinese(Taiwan)' 'Republic of Korea'
 'China and South Korea' 'South korea' 'China,Korean' 'Denmark and China'
 272 256 235 143 0 'UK' 73 'Sri Lanka' 240 159 275 'Tai Wan' 192
 'China Taiwan' 225 146 78 200 'Amazon report' 'Chian' 'Not provided'
 'China, Hongkong' 'Thailand' 37 97 77 191 2951 897 140 199 636 'Macao']

dict = {'Owner': owner}

df = pd.DataFrame(dict) 

df.replace('China', 'Chinese', regex=True, inplace=True)
df.replace('china', 'Chinese', regex=True, inplace=True)
df.replace('Mainland China', 'Chinese', regex=True, inplace=True)
df.replace('Chinses', 'Chinese', regex=True, inplace=True)
df.replace('Chines', 'Chinese', regex=True, inplace=True)

df['Owner new'] = np.where(df['Owner'] != 'Chinese', 'Foriegn', df['Owner'] )

print(df['Owner'].unique())

我得到的输出:

['Chinesee' 'Hong Kong' 'Hongkong' 'Chinesee (Taiwan)' 'Japan' 'Sweden'
 'Canada' 'HK' 'United States' 'Indian' 'American' 'Japanese' 'U.S.'
 'Taiwanese' 115 'Australia' 'HongKong' 'France' 'Taiwan' 'Malaysia'
 'Taiwan, Chinesee' 1380 'Switzerland' 'US' 'Netherlands'
 'Chinesee-Hongkong' 208 447 153 151 'Ireland' 'Taiwan Chinesee'
 'Taiwan of Chinesee' 'Mainland Chinesee' 'Chinesee(HK)'
 'HONG KONG OF CHINA' 'USA' 'Korea, Republic of' 1834 'South Africa' 40
 184 190 427 'German' 'Singapore' 'The philipines' 193 397 'Janpan'
 'Japan and Taiwan' 48 46 1274 'Chinese' 1641 89 'Korea' 50 43 380
 'Hong Kong, Chinesee' 'Chinesee (HK)' 85 'Germany' 'English' 205
 'Hongkong, Chinesee' 35 'South Korean' 'British Virgin Islands'
 'Chinesee (Hong Kong)' 156 490 94 95 138 319 'Mandarin' 'Spainish'
 'South Korea' 'Hongkong,Chinesee' 'U.S.A' 'Hongkong Chinesee' 2583 98
 'Korean' 5000 'India' 100 38 'chinese' 'The USA' 'Canadian'
 'Taiwan/HK/Macao' 'Chinesee(Taiwan)' 'Republic of Korea'
 'Chinesee and South Korea' 'South korea' 'Chinesee,Korean'
 'Denmark and Chinesee' 272 256 235 143 0 'UK' 73 'Sri Lanka' 240 159 275
 'Tai Wan' 192 'Chinesee Taiwan' 225 146 78 200 'Amazon report' 'Chian'
 'Not provided' 'Chinesee, Hongkong' 'Thailand' 37 97 77 191 2951 897 140
 199 636 'Macao']

在您的情况下,您还替换了子字符串(如我的评论中所述)。当您尝试替换整个单词时,您应该分别在单词的开头和结尾添加 ^$。然后,只有匹配的整个单词才会被替换。例如:

以上案例:

>>> df = pd.DataFrame(["Chinese"])
>>> df.replace("Chines", "China", regex=True)
        0
0  Chinae

解决方案 1:使用正则表达式 ^$

>>> df.replace("^Chines$", "China", regex=True)
         0
0  Chinese

解决方案 2:设置 regex=False 以便仅匹配整个单词。 (正则表达式默认为 False)

>>> df.replace("Chines", "China")
         0
0  Chinese