将包含 unicode 的 Pandas 字符串列转换为 ascii 以加载 url
Transform Pandas string column containing unicodes to ascii to load urls
我有一个 pandas DataFrame,其中包含我想要加载的维基百科网址列。但是,某些字符串不会加载,因为它们包含 unicode。例如,'Kruskal %E2%80%93Wallis_one-way_analysis_of_variance' 引发以下
PageError: Page id "Cauchy%E2%80%93Schwarz_inequality" does not match any pages. Try another id!
有没有办法把所有的unicode转成ascii?所以在这种情况下,我需要一个可以创建新列的函数:
old column new column
Cauchy%E2%80%93Schwarz_inequality Cauchy–Schwarz_inequality
Markov%27s_inequality Markov's_inequality
urllib.parse.unquote
应该可以解决问题。希望这有帮助。
In [1]: import urllib
...:
...: import pandas as pd
...:
...:
...: df = pd.DataFrame({'url': ['Markov%27s_inequality', 'Cauchy%E2%80%93Schwarz_inequality']})
...: df['clean_url'] = df['url'].apply(urllib.parse.unquote)
...:
In [2]: df
Out[2]:
url clean_url
0 Markov%27s_inequality Markov's_inequality
1 Cauchy%E2%80%93Schwarz_inequality Cauchy–Schwarz_inequality
我有一个 pandas DataFrame,其中包含我想要加载的维基百科网址列。但是,某些字符串不会加载,因为它们包含 unicode。例如,'Kruskal %E2%80%93Wallis_one-way_analysis_of_variance' 引发以下
PageError: Page id "Cauchy%E2%80%93Schwarz_inequality" does not match any pages. Try another id!
有没有办法把所有的unicode转成ascii?所以在这种情况下,我需要一个可以创建新列的函数:
old column new column
Cauchy%E2%80%93Schwarz_inequality Cauchy–Schwarz_inequality
Markov%27s_inequality Markov's_inequality
urllib.parse.unquote
应该可以解决问题。希望这有帮助。
In [1]: import urllib
...:
...: import pandas as pd
...:
...:
...: df = pd.DataFrame({'url': ['Markov%27s_inequality', 'Cauchy%E2%80%93Schwarz_inequality']})
...: df['clean_url'] = df['url'].apply(urllib.parse.unquote)
...:
In [2]: df
Out[2]:
url clean_url
0 Markov%27s_inequality Markov's_inequality
1 Cauchy%E2%80%93Schwarz_inequality Cauchy–Schwarz_inequality