如何从 pandas 数据框列中的文本中删除非 ascii 字符(例如 б§•¿µ´‡»Ž®ºÏƒ¶¹)?
How do I remove non-ascii characters (e.g б§•¿µ´‡»Ž®ºÏƒ¶¹) from texts in pandas dataframe columns?
如何从 pandas 数据框列的文本中删除非 ascii 字符(例如 б§•¿µ´‡»Ž®ºÏƒ¶¹)?
我已经尝试了以下但没有成功
df = pd.read_csv(path, index_col=0)
for col in df.columns:
for j in df.index:
markup1 = str(df.ix[j, col]).replace("\r", "")
markup1 = markup1.replace("\n", "")
markup1 = markup1.decode('unicode_escape').encode('ascii','ignore').strip()
soup = BeautifulSoup(markup1, 'lxml')
df.ix[j, col] = soup.get_text()
print df.ix[j, 'requirements']
我试过使用正则表达式,但它不起作用。
markup1 = str(df.ix[j, 'requirements']).replace("\r", "")
markup1 = markup1.replace("\n", "")
markup1 = re.sub(r'[^\x00-\x7F]+', ' ', markup1)
我仍然不断收到非 ascii 字符。如有任何建议,我们将不胜感激。
我在下面添加了 df 的前三行:
col1 col2 \
1.0 H1B SPONSOR FOR L1/L2/OPT US, NY, New York
2.0 Graphic / Web Designer US, TX, Austin
3.0 Full Stack Developer (.NET or equivalent + Jav... GR, ,
col3 col4 \
1.0 NaN NaN
2.0 Sales and Marketing NaN
3.0 NaN NaN
col5 \
1.0 i28 Technologies has demonstrated expertise in...
2.0 outstanding people who believe that more is po...
3.0 NaN
col6 \
1.0 Hello,Wish you are doing good... ...
2.0 The Graphic / Web Designer will manage, popula...
3.0 You?ll have to join the Moosend dojo. But, yo...
col7 \
1.0 JAVA, .NET, SQL, ORACLE, SAP, Informatica, Big...
2.0 Bachelor?s degree in Graphic Design, Web Desig...
3.0 ? .NET or equivalent (Java etc.)? MVC? Javascr...
col8 col9
1.0 NaN f
2.0 CSD offers a competitive benefits package for ... f
3.0 You?ll be working with the best team in town..... f
选项 1 - 如果您知道完整的非 ascii 字符集:
df
Out[36]:
col1 col2
0 aaб§•¿µbb abcd
1 hf4 efgh
2 xxx ijk9
df.replace(regex=True, to_replace=['Ð', '§', '±'], value='') # incomplete here
Out[37]:
col1 col2
0 aa•¿µbb abcd
1 hf4 efgh
2 xxx ijk9
选项 2 - 如果您不能指定整组非 ascii 字符:
考虑使用 string.printable
:
String of ASCII characters which are considered printable.
from string import printable
printable
Out[38]: '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\]^_`{|}~ \t\n\r\x0b\x0c'
df.applymap(lambda y: ''.join(filter(lambda x:
x in string.printable, y)))
Out[14]:
col1 col2
0 aabb abcd
1 hf4 asdf
2 xxx
请注意,如果 DataFrame 中的元素全部为非 ascii,它将仅替换为 ''。
在 Brad 的回答的启发下,我使用 [0-9][a-z][A-Z] 的 ascii 值列表解决了这个问题。
def remove_non_ascii(text):
L = [32, 44, 46, 65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,97,98,99,100,101,102,103, 104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122]
text = str(text)
return ''.join(i for i in text if ord(i) in L)
如何从 pandas 数据框列的文本中删除非 ascii 字符(例如 б§•¿µ´‡»Ž®ºÏƒ¶¹)?
我已经尝试了以下但没有成功
df = pd.read_csv(path, index_col=0)
for col in df.columns:
for j in df.index:
markup1 = str(df.ix[j, col]).replace("\r", "")
markup1 = markup1.replace("\n", "")
markup1 = markup1.decode('unicode_escape').encode('ascii','ignore').strip()
soup = BeautifulSoup(markup1, 'lxml')
df.ix[j, col] = soup.get_text()
print df.ix[j, 'requirements']
我试过使用正则表达式,但它不起作用。
markup1 = str(df.ix[j, 'requirements']).replace("\r", "")
markup1 = markup1.replace("\n", "")
markup1 = re.sub(r'[^\x00-\x7F]+', ' ', markup1)
我仍然不断收到非 ascii 字符。如有任何建议,我们将不胜感激。
我在下面添加了 df 的前三行:
col1 col2 \
1.0 H1B SPONSOR FOR L1/L2/OPT US, NY, New York
2.0 Graphic / Web Designer US, TX, Austin
3.0 Full Stack Developer (.NET or equivalent + Jav... GR, ,
col3 col4 \
1.0 NaN NaN
2.0 Sales and Marketing NaN
3.0 NaN NaN
col5 \
1.0 i28 Technologies has demonstrated expertise in...
2.0 outstanding people who believe that more is po...
3.0 NaN
col6 \
1.0 Hello,Wish you are doing good... ...
2.0 The Graphic / Web Designer will manage, popula...
3.0 You?ll have to join the Moosend dojo. But, yo...
col7 \
1.0 JAVA, .NET, SQL, ORACLE, SAP, Informatica, Big...
2.0 Bachelor?s degree in Graphic Design, Web Desig...
3.0 ? .NET or equivalent (Java etc.)? MVC? Javascr...
col8 col9
1.0 NaN f
2.0 CSD offers a competitive benefits package for ... f
3.0 You?ll be working with the best team in town..... f
选项 1 - 如果您知道完整的非 ascii 字符集:
df
Out[36]:
col1 col2
0 aaб§•¿µbb abcd
1 hf4 efgh
2 xxx ijk9
df.replace(regex=True, to_replace=['Ð', '§', '±'], value='') # incomplete here
Out[37]:
col1 col2
0 aa•¿µbb abcd
1 hf4 efgh
2 xxx ijk9
选项 2 - 如果您不能指定整组非 ascii 字符:
考虑使用 string.printable
:
String of ASCII characters which are considered printable.
from string import printable
printable
Out[38]: '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\]^_`{|}~ \t\n\r\x0b\x0c'
df.applymap(lambda y: ''.join(filter(lambda x:
x in string.printable, y)))
Out[14]:
col1 col2
0 aabb abcd
1 hf4 asdf
2 xxx
请注意,如果 DataFrame 中的元素全部为非 ascii,它将仅替换为 ''。
在 Brad 的回答的启发下,我使用 [0-9][a-z][A-Z] 的 ascii 值列表解决了这个问题。
def remove_non_ascii(text):
L = [32, 44, 46, 65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,97,98,99,100,101,102,103, 104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122]
text = str(text)
return ''.join(i for i in text if ord(i) in L)