删除 Dataframe 字符串中的 Unicode 文字
Remove Unicode literals in a Dataframe string
我有一个包含简历的 Dataframe,但它们包含 Unicode 文字,例如 "\xe2\x80\x93"
。
我想删除所有这些值以准备处理文本。
我的问题是我已经尝试了很多推荐的方法来删除它们,并且 none 似乎在应用于我的 df 中的数据时有效。
文本示例:
"COMPETENCIES\nBenefits Administration \xe2\x80\x93 Customer Service
\xe2\x80\x93 Cost Control \xe2\x80\x93 Recruiting \xe2\x80\x93 Acquisition Management"
我觉得困难的部分是如果我把这段文本放在一个字符串变量中,例如 y = <text>
然后使用以下方法之一来处理 unicode 文字:
print(re.sub(r'[^\x00-\x7F+]',' ', y)
print(y.encode('ascii',errors='ignore').decode('ascii'))
它将输出:
"CORE COMPETENCIES
Benefits Administration Customer Service Cost Control Recruiting Acquisition Management"
符合预期。
当我对 Dataframe 中的值进行尝试时,它似乎根本不起作用。
我尝试了以下方法(df 被称为 resume
):
resume = resume.apply(lambda x : re.sub(r'[^\x00-\x7F+]',' ',x))
resume = resume.apply(x.encode('ascii',errors='ignore').decode('ascii')
resume = resume.replace(re.sub(r'[^\x00-\x7F+]',' ',x)```
我什至尝试过:
for x in resume:
x = str(x)
x = (re.sub(r'[^\x00-\x7F+]',' ', x))
print(x)
和:
print(re.sub(r'[^\x00-\x7F+]',' ', resume[0])
只是想看看当我将这些应用于字符串变量时是否可以复制更改,但仍然没有成功。
数据框的形状是 (368,0)
dtype 是我尝试转换为字符串的对象,但我相信它始终保持为对象。
你能试试这个吗:
df['text_clean'] = df['text'].apply(lambda x: x.decode('unicode_escape').\
encode('ascii', 'ignore').\
strip())
假设:
dataframe(df) 有一个 'text' 列,其中包含带有 unicode 文字的简历字符串。
我是这样测试的:
import pandas as pd
# created sample data - same example row inserted 5 times. Not ideal but just was trying to test
df = pd.DataFrame({"text": [b"COMPETENCIES\nBenefits Administration \xe2\x80\x93 Customer Service \xe2\x80\x93 Cost Control \xe2\x80\x93 Recruiting \xe2\x80\x93 Acquisition Management",
b"COMPETENCIES\nBenefits Administration \xe2\x80\x93 Customer Service \xe2\x80\x93 Cost Control \xe2\x80\x93 Recruiting \xe2\x80\x93 Acquisition Management",
b"COMPETENCIES\nBenefits Administration \xe2\x80\x93 Customer Service \xe2\x80\x93 Cost Control \xe2\x80\x93 Recruiting \xe2\x80\x93 Acquisition Management",
b"COMPETENCIES\nBenefits Administration \xe2\x80\x93 Customer Service \xe2\x80\x93 Cost Control \xe2\x80\x93 Recruiting \xe2\x80\x93 Acquisition Management",
b"COMPETENCIES\nBenefits Administration \xe2\x80\x93 Customer Service \xe2\x80\x93 Cost Control \xe2\x80\x93 Recruiting \xe2\x80\x93 Acquisition Management"]})
df['text_clean'] = df['text'].apply(lambda x: x.decode('unicode_escape').\
encode('ascii', 'ignore').\
strip())
使用 kaggle 源文件测试代码:
# path stores the location of the data file downloaded from kaggle
df = pd.read_csv(path)
# remove the binary 'b' prefix reinstated even though the data is
# read as string during df creation
df['Resume'] = [val[1:].encode('utf-8') for val in df['Resume']]
# create a separate column with multiple decode and encode steps to
# retrieve the final clean version
df['text_clean'] = df['Resume'].apply(lambda x: x.decode('unicode_escape').\
encode('ascii', 'ignore').decode('utf-8').strip())
print(df['text_clean'])
输出样本:
0 'John H. Smith, P.H.R.\n800-991-5187 | PO Box ...
1 'Name Surname\nAddress\nMobile No/Email\nPERSO...
2 'Anthony Brown\nHR Assistant\nAREAS OF EXPERTI...
3 'www.downloadmela.com\nSatheesh\nEMAIL ID:\nCa...
4 "HUMAN RESOURCES DIRECTOR\nExpert in organizat...
5 'John H. Smith, P.H.R.\n800-991-5187 | PO Box ...
6 'Resume of Satheesh\n\nwww.downlo\nSatheesh\n\...
7 "GM HR & ADMINISTRATION Resume Sample www.time...
8 "www.uaehrzone.com\n\nRobert Wales\nDubai\nUni...
9 "Human Resources Coordinator Resume\nExample\n...
10 'RESUME WORLD INC.\n1200 Markham Road, Suite 1...
11 'XXXXX XXXXX\nXXXX, Renton, WA 98059\nHome: XX...
12 'SATHEESH\n\nwww.downloadmela.com\n\nObjective...
13 'Alan Bloggs BE\n1Main Street, Irish Town, Co....
14 'www.downloadmela.com\nSatheesh\nSummary\n4+ y...
15 'Anthony Brown\nHR Assistant\nAREAS OF EXPERTI...
16 'T\n\nAYLOR J ONES\n15 Jinglewood Street Melbo...
17 'Human Resources Manager\nCurriculum Vitae Exa...
18 'EDMONDBRADY\n1900SummersDriveMontello,AZ55996...
19 'Jonathan Burns\n1414 Marcy Drive\n\n\n\nSomet...
20 'Jo Sample\n123 Ocean Drive\nSampleville, FL 1...
21 'Jonathan Burns\n1414 Marcy Drive\n\n\n\nExamp...
22 'Shweta XXX\nMobile: +91-98********\n\nE-mail:...
23 'www.downloadmela.com\n\nSATHEESH\nMobile :\nE...
24 "Steven B. Manning\n3249 Oral Lake Road\nMinne...
25 "www.downloadmela.com\nSatheesh\n\nE-mail:\nHa...
26 "Resume for HR Assistant\nTX\n3 Avenue,\nSale,...
27 'RESUME WORLD INC.\n1200 Markham Road, Suite 1...
28 'HOW TO WRITE A PROFESSIONAL RESUME\n\n
RESUM...
29 'RESUME WORLD INC.\n1200 Markham Road, Suite 1...
...
1189 'Joseph Andrade\nACTOR\nEmail: Jfandrade192@ou...
1190 'MARIAH FORD\nHeight: 5 4\nStars Talent Studio...
1191 'Your Name\nPhone number\nEmail address\nHeigh...
1192 'Jarien Sky-Stutts Senior 3D Artist\n\ncontac...
1193 "Gary White\nMake up artist\nAREAS OF EXPERTIS...
1194 'RESUME\nDan Platt\n5134 Oakdale Ave.\nWoodlan...
1195 "Jeff Wolverton, M.S., B.S.Cis.\nVisual FX Art...
1196 'LETA LOU GRAY\t\n\r\n\n\t\n\r\n\n\t\n\r\n\n\t...
1197 'Curriculum Vitae\nPersonal Details\n\nDarren ...
1198 'Your Name\n\nSchool Address\n123 Main Street\...
1199 'Stacy Adams\nSAG/AFTRA\nHeight:\nWeight:\nHai...
1200 'PERFORMING ARTS RESUME\nContent\nA performers...
1201 'ED WEISS\nTeaching Artist Resume\n\ne-mail: W...
1202 '8/23/2016 sample resume for painter. accounti...
1203 'KELSEY PAINTER\nBlonde Hair/Brown Eyes | Alto...
1204 'Wendy Robin\nProfessional Make-up Artist\n(70...
1205 'Chet Bailey\n100 Desert Street\nDrytown, CA 9...
1206 'Chris Flight Attendant\n11223 East South Aven...
1207 'Bilingual Flight Attendant Resume\n\nANGELICA...
1208 'Flight Attendant Resumes\nFlight-Attendant-Ca...
1209 'Emirates Flight Attendant Resume Sample\n\nAn...
1210 'Entry Level Flight Attendant Resume No Exper...
1211 'CURRICULUM VITAE\nMay 11, 2004\n\nNAME\n\nRob...
1212 'JED REDD\n\n_\n003 Boudry Lane\nFriend, TX 77...
1213 'Lauren B. Pires\nMiami, Florida & New York Ci...
1214 "Free Flight Attendant Resume\nDarlene Flint\n...
1215 'Corporate Flight Attendant Resume\nCAITLIN FL...
1216 'MAJOR CONRAD A. PREEDOM\n2354 Fairchild Dr., ...
1217 'STACY SAMPLE\n\n702 800-0000 cell\n\n0000@ema...
1218 'Entry Level Resume Guide\n\nThis packet is in...
Name: text_clean, Length: 1219, dtype: object
我有一个包含简历的 Dataframe,但它们包含 Unicode 文字,例如 "\xe2\x80\x93"
。
我想删除所有这些值以准备处理文本。 我的问题是我已经尝试了很多推荐的方法来删除它们,并且 none 似乎在应用于我的 df 中的数据时有效。
文本示例:
"COMPETENCIES\nBenefits Administration \xe2\x80\x93 Customer Service
\xe2\x80\x93 Cost Control \xe2\x80\x93 Recruiting \xe2\x80\x93 Acquisition Management"
我觉得困难的部分是如果我把这段文本放在一个字符串变量中,例如 y = <text>
然后使用以下方法之一来处理 unicode 文字:
print(re.sub(r'[^\x00-\x7F+]',' ', y)
print(y.encode('ascii',errors='ignore').decode('ascii'))
它将输出:
"CORE COMPETENCIES Benefits Administration Customer Service Cost Control Recruiting Acquisition Management"
符合预期。
当我对 Dataframe 中的值进行尝试时,它似乎根本不起作用。
我尝试了以下方法(df 被称为 resume
):
resume = resume.apply(lambda x : re.sub(r'[^\x00-\x7F+]',' ',x))
resume = resume.apply(x.encode('ascii',errors='ignore').decode('ascii')
resume = resume.replace(re.sub(r'[^\x00-\x7F+]',' ',x)```
我什至尝试过:
for x in resume:
x = str(x)
x = (re.sub(r'[^\x00-\x7F+]',' ', x))
print(x)
和:
print(re.sub(r'[^\x00-\x7F+]',' ', resume[0])
只是想看看当我将这些应用于字符串变量时是否可以复制更改,但仍然没有成功。
数据框的形状是 (368,0) dtype 是我尝试转换为字符串的对象,但我相信它始终保持为对象。
你能试试这个吗:
df['text_clean'] = df['text'].apply(lambda x: x.decode('unicode_escape').\
encode('ascii', 'ignore').\
strip())
假设: dataframe(df) 有一个 'text' 列,其中包含带有 unicode 文字的简历字符串。
我是这样测试的:
import pandas as pd
# created sample data - same example row inserted 5 times. Not ideal but just was trying to test
df = pd.DataFrame({"text": [b"COMPETENCIES\nBenefits Administration \xe2\x80\x93 Customer Service \xe2\x80\x93 Cost Control \xe2\x80\x93 Recruiting \xe2\x80\x93 Acquisition Management",
b"COMPETENCIES\nBenefits Administration \xe2\x80\x93 Customer Service \xe2\x80\x93 Cost Control \xe2\x80\x93 Recruiting \xe2\x80\x93 Acquisition Management",
b"COMPETENCIES\nBenefits Administration \xe2\x80\x93 Customer Service \xe2\x80\x93 Cost Control \xe2\x80\x93 Recruiting \xe2\x80\x93 Acquisition Management",
b"COMPETENCIES\nBenefits Administration \xe2\x80\x93 Customer Service \xe2\x80\x93 Cost Control \xe2\x80\x93 Recruiting \xe2\x80\x93 Acquisition Management",
b"COMPETENCIES\nBenefits Administration \xe2\x80\x93 Customer Service \xe2\x80\x93 Cost Control \xe2\x80\x93 Recruiting \xe2\x80\x93 Acquisition Management"]})
df['text_clean'] = df['text'].apply(lambda x: x.decode('unicode_escape').\
encode('ascii', 'ignore').\
strip())
使用 kaggle 源文件测试代码:
# path stores the location of the data file downloaded from kaggle
df = pd.read_csv(path)
# remove the binary 'b' prefix reinstated even though the data is
# read as string during df creation
df['Resume'] = [val[1:].encode('utf-8') for val in df['Resume']]
# create a separate column with multiple decode and encode steps to
# retrieve the final clean version
df['text_clean'] = df['Resume'].apply(lambda x: x.decode('unicode_escape').\
encode('ascii', 'ignore').decode('utf-8').strip())
print(df['text_clean'])
输出样本:
0 'John H. Smith, P.H.R.\n800-991-5187 | PO Box ...
1 'Name Surname\nAddress\nMobile No/Email\nPERSO...
2 'Anthony Brown\nHR Assistant\nAREAS OF EXPERTI...
3 'www.downloadmela.com\nSatheesh\nEMAIL ID:\nCa...
4 "HUMAN RESOURCES DIRECTOR\nExpert in organizat...
5 'John H. Smith, P.H.R.\n800-991-5187 | PO Box ...
6 'Resume of Satheesh\n\nwww.downlo\nSatheesh\n\...
7 "GM HR & ADMINISTRATION Resume Sample www.time...
8 "www.uaehrzone.com\n\nRobert Wales\nDubai\nUni...
9 "Human Resources Coordinator Resume\nExample\n...
10 'RESUME WORLD INC.\n1200 Markham Road, Suite 1...
11 'XXXXX XXXXX\nXXXX, Renton, WA 98059\nHome: XX...
12 'SATHEESH\n\nwww.downloadmela.com\n\nObjective...
13 'Alan Bloggs BE\n1Main Street, Irish Town, Co....
14 'www.downloadmela.com\nSatheesh\nSummary\n4+ y...
15 'Anthony Brown\nHR Assistant\nAREAS OF EXPERTI...
16 'T\n\nAYLOR J ONES\n15 Jinglewood Street Melbo...
17 'Human Resources Manager\nCurriculum Vitae Exa...
18 'EDMONDBRADY\n1900SummersDriveMontello,AZ55996...
19 'Jonathan Burns\n1414 Marcy Drive\n\n\n\nSomet...
20 'Jo Sample\n123 Ocean Drive\nSampleville, FL 1...
21 'Jonathan Burns\n1414 Marcy Drive\n\n\n\nExamp...
22 'Shweta XXX\nMobile: +91-98********\n\nE-mail:...
23 'www.downloadmela.com\n\nSATHEESH\nMobile :\nE...
24 "Steven B. Manning\n3249 Oral Lake Road\nMinne...
25 "www.downloadmela.com\nSatheesh\n\nE-mail:\nHa...
26 "Resume for HR Assistant\nTX\n3 Avenue,\nSale,...
27 'RESUME WORLD INC.\n1200 Markham Road, Suite 1...
28 'HOW TO WRITE A PROFESSIONAL RESUME\n\n
RESUM...
29 'RESUME WORLD INC.\n1200 Markham Road, Suite 1...
...
1189 'Joseph Andrade\nACTOR\nEmail: Jfandrade192@ou...
1190 'MARIAH FORD\nHeight: 5 4\nStars Talent Studio...
1191 'Your Name\nPhone number\nEmail address\nHeigh...
1192 'Jarien Sky-Stutts Senior 3D Artist\n\ncontac...
1193 "Gary White\nMake up artist\nAREAS OF EXPERTIS...
1194 'RESUME\nDan Platt\n5134 Oakdale Ave.\nWoodlan...
1195 "Jeff Wolverton, M.S., B.S.Cis.\nVisual FX Art...
1196 'LETA LOU GRAY\t\n\r\n\n\t\n\r\n\n\t\n\r\n\n\t...
1197 'Curriculum Vitae\nPersonal Details\n\nDarren ...
1198 'Your Name\n\nSchool Address\n123 Main Street\...
1199 'Stacy Adams\nSAG/AFTRA\nHeight:\nWeight:\nHai...
1200 'PERFORMING ARTS RESUME\nContent\nA performers...
1201 'ED WEISS\nTeaching Artist Resume\n\ne-mail: W...
1202 '8/23/2016 sample resume for painter. accounti...
1203 'KELSEY PAINTER\nBlonde Hair/Brown Eyes | Alto...
1204 'Wendy Robin\nProfessional Make-up Artist\n(70...
1205 'Chet Bailey\n100 Desert Street\nDrytown, CA 9...
1206 'Chris Flight Attendant\n11223 East South Aven...
1207 'Bilingual Flight Attendant Resume\n\nANGELICA...
1208 'Flight Attendant Resumes\nFlight-Attendant-Ca...
1209 'Emirates Flight Attendant Resume Sample\n\nAn...
1210 'Entry Level Flight Attendant Resume No Exper...
1211 'CURRICULUM VITAE\nMay 11, 2004\n\nNAME\n\nRob...
1212 'JED REDD\n\n_\n003 Boudry Lane\nFriend, TX 77...
1213 'Lauren B. Pires\nMiami, Florida & New York Ci...
1214 "Free Flight Attendant Resume\nDarlene Flint\n...
1215 'Corporate Flight Attendant Resume\nCAITLIN FL...
1216 'MAJOR CONRAD A. PREEDOM\n2354 Fairchild Dr., ...
1217 'STACY SAMPLE\n\n702 800-0000 cell\n\n0000@ema...
1218 'Entry Level Resume Guide\n\nThis packet is in...
Name: text_clean, Length: 1219, dtype: object