拆分列并格式化列值
Split column and format the column values
我正在尝试格式化一列数据。我可以找到拆分列的选项,因为它们之间有 ,
,但我无法按照输出中所示对其进行格式化。
输入
TITLE,Issn
NATURE REVIEWS MOLECULAR CELL BIOLOGY,"ISSN 14710072, 14710080"
ANNUAL REVIEW OF IMMUNOLOGY,"ISSN 07320582, 15453278"
NATURE REVIEWS GENETICS,"ISSN 14710056, 14710064"
CA - A CANCER JOURNAL FOR CLINICIANS,"ISSN 15424863, 00079235"
CELL,"ISSN 00928674, 10974172"
ANNUAL REVIEW OF ASTRONOMY AND ASTROPHYSICS,"ISSN 15454282, 00664146"
NATURE REVIEWS IMMUNOLOGY,"ISSN 14741741, 14741733"
NATURE REVIEWS CANCER,ISSN 1474175X
ANNUAL REVIEW OF BIOCHEMISTRY,"ISSN 15454509, 00664154"
REVIEWS OF MODERN PHYSICS,"ISSN 00346861, 15390756"
NATURE GENETICS,ISSN 10614036
- 将 issn 列拆分为两列,因为它具有
,
- 仅从列中删除 ISSN 一词
- 留下数字 4位后放一个
-
预期输出为
TITLE,Issn
NATURE REVIEWS MOLECULAR CELL BIOLOGY,1471-0072, 1471-0080
ANNUAL REVIEW OF IMMUNOLOGY,0732-0582, 1545-3278
NATURE REVIEWS GENETICS,1471-0056, 1471-0064
CA - A CANCER JOURNAL FOR CLINICIANS,1542-4863, 0007-9235
CELL,0092-8674, 1097-4172
ANNUAL REVIEW OF ASTRONOMY AND ASTROPHYSICS,1545-4282, 0066-4146
NATURE REVIEWS IMMUNOLOGY,1474-1741, 1474-1733
NATURE REVIEWS CANCER, 1474-175X
ANNUAL REVIEW OF BIOCHEMISTRY,1545-4509, 0066-4154
REVIEWS OF MODERN PHYSICS,0034-6861, 1539-0756
NATURE GENETICS,1061-4036
如有任何建议 pandas,我们将不胜感激。提前致谢
更新:
当尝试 运行 回答
中提到的两个程序时
import pandas as pd
import re
df = pd.read_csv('new_journal_list.csv', header='TITLE,Issn')
'''
df_split_num = df['Issn'].map(lambda x: x.split('ISSN ')[1].split(', '))
df_dash_num = df_split_num.map(lambda x: [num[:4] + '-' + num[4:] for num in x])
df_split_issn = pd.DataFrame(data=list(df_dash_num), columns=['Issn1', 'Issn2'])
df[['Issn1', 'Issn2']] = df_split_issn
del df['Issn']
print df
'''
df[['Issn1','Issn2']] = (df.pop('Issn').str.extract('ISSN\s+([^,]+),?\s?(.*)', expand=True)
.apply(lambda x: x.str[:4]+'-'+x.str[4:]).replace(r'^-$', '', regex=True))
print df
默认情况下 运行 的两种情况 python 2.7 我收到以下错误
Traceback (most recent call last):
File "clean_journal_list.py", line 1, in <module>
import pandas as pd
File "/usr/local/lib/python2.7/dist-packages/pandas/__init__.py", line 25, in <module>
from pandas import hashtable, tslib, lib
File "pandas/src/numpy.pxd", line 157, in init pandas.hashtable (pandas/hashtable.c:38364)
当 运行 在 python 3.4 中出现下面给出的错误
File "clean_journal_list.py", line 21
print df
^
SyntaxError: invalid syntax
您需要为此添加一些错误处理,并将其包装在逐行迭代中,但要点如下:
leader, issns = line.split(" ISSN ")
numbers = issns.split(", ")
print leader, ', '.join([ num[:4] + '-' + num[4:] for num in numbers])
关键是将每一行拆分为"the ISSN numbers"和"everything else",然后将ISSN号彼此分开并重新格式化。
首先,拆分数字并在其中添加破折号。使用方便的地图功能:
df_split_num = df['Issn'].map(lambda x: x.split('ISSN ')[1].split(', '))
df_dash_num = df_split_num.map(lambda x: [num[:4] + '-' + num[4:] for num in x])
接下来,使用拆分出的 issn 编号创建一个新数据框,并将其放回原始数据框中:
df_split_issn = pd.DataFrame(data=list(df_dash_num), columns=['Issn1', 'Issn2'])
df[['Issn1', 'Issn2']] = df_split_issn
del df['Issn']
IIUC 你可以使用 Series.str.extract()、apply()
和 replace()
方法:
In [33]: df
Out[33]:
TITLE Issn
0 NATURE REVIEWS MOLECULAR CELL BIOLOGY ISSN 14710072, 14710080
1 ANNUAL REVIEW OF IMMUNOLOGY ISSN 07320582, 15453278
2 NATURE REVIEWS GENETICS ISSN 14710056, 14710064
3 CA - A CANCER JOURNAL FOR CLINICIANS ISSN 15424863, 00079235
4 CELL ISSN 00928674, 10974172
5 ANNUAL REVIEW OF ASTRONOMY AND ASTROPHYSICS ISSN 15454282, 00664146
6 NATURE REVIEWS IMMUNOLOGY ISSN 14741741, 14741733
7 NATURE REVIEWS CANCER ISSN 1474175X
8 ANNUAL REVIEW OF BIOCHEMISTRY ISSN 15454509, 00664154
9 REVIEWS OF MODERN PHYSICS ISSN 00346861, 15390756
10 NATURE GENETICS ISSN 10614036
In [34]: df[['Issn1','Issn2']] = (df.pop('Issn')
...: .str.extract('ISSN\s+([^,]+),?\s?(.*)', expand=True)
...: .apply(lambda x: x.str[:4]+'-'+x.str[4:])
...: .replace(r'^-$', '', regex=True))
...:
In [35]: df
Out[35]:
TITLE Issn1 Issn2
0 NATURE REVIEWS MOLECULAR CELL BIOLOGY 1471-0072 1471-0080
1 ANNUAL REVIEW OF IMMUNOLOGY 0732-0582 1545-3278
2 NATURE REVIEWS GENETICS 1471-0056 1471-0064
3 CA - A CANCER JOURNAL FOR CLINICIANS 1542-4863 0007-9235
4 CELL 0092-8674 1097-4172
5 ANNUAL REVIEW OF ASTRONOMY AND ASTROPHYSICS 1545-4282 0066-4146
6 NATURE REVIEWS IMMUNOLOGY 1474-1741 1474-1733
7 NATURE REVIEWS CANCER 1474-175X
8 ANNUAL REVIEW OF BIOCHEMISTRY 1545-4509 0066-4154
9 REVIEWS OF MODERN PHYSICS 0034-6861 1539-0756
10 NATURE GENETICS 1061-4036
我正在尝试格式化一列数据。我可以找到拆分列的选项,因为它们之间有 ,
,但我无法按照输出中所示对其进行格式化。
输入
TITLE,Issn
NATURE REVIEWS MOLECULAR CELL BIOLOGY,"ISSN 14710072, 14710080"
ANNUAL REVIEW OF IMMUNOLOGY,"ISSN 07320582, 15453278"
NATURE REVIEWS GENETICS,"ISSN 14710056, 14710064"
CA - A CANCER JOURNAL FOR CLINICIANS,"ISSN 15424863, 00079235"
CELL,"ISSN 00928674, 10974172"
ANNUAL REVIEW OF ASTRONOMY AND ASTROPHYSICS,"ISSN 15454282, 00664146"
NATURE REVIEWS IMMUNOLOGY,"ISSN 14741741, 14741733"
NATURE REVIEWS CANCER,ISSN 1474175X
ANNUAL REVIEW OF BIOCHEMISTRY,"ISSN 15454509, 00664154"
REVIEWS OF MODERN PHYSICS,"ISSN 00346861, 15390756"
NATURE GENETICS,ISSN 10614036
- 将 issn 列拆分为两列,因为它具有
,
- 仅从列中删除 ISSN 一词
- 留下数字 4位后放一个
-
预期输出为
TITLE,Issn
NATURE REVIEWS MOLECULAR CELL BIOLOGY,1471-0072, 1471-0080
ANNUAL REVIEW OF IMMUNOLOGY,0732-0582, 1545-3278
NATURE REVIEWS GENETICS,1471-0056, 1471-0064
CA - A CANCER JOURNAL FOR CLINICIANS,1542-4863, 0007-9235
CELL,0092-8674, 1097-4172
ANNUAL REVIEW OF ASTRONOMY AND ASTROPHYSICS,1545-4282, 0066-4146
NATURE REVIEWS IMMUNOLOGY,1474-1741, 1474-1733
NATURE REVIEWS CANCER, 1474-175X
ANNUAL REVIEW OF BIOCHEMISTRY,1545-4509, 0066-4154
REVIEWS OF MODERN PHYSICS,0034-6861, 1539-0756
NATURE GENETICS,1061-4036
如有任何建议 pandas,我们将不胜感激。提前致谢
更新:
当尝试 运行 回答
import pandas as pd
import re
df = pd.read_csv('new_journal_list.csv', header='TITLE,Issn')
'''
df_split_num = df['Issn'].map(lambda x: x.split('ISSN ')[1].split(', '))
df_dash_num = df_split_num.map(lambda x: [num[:4] + '-' + num[4:] for num in x])
df_split_issn = pd.DataFrame(data=list(df_dash_num), columns=['Issn1', 'Issn2'])
df[['Issn1', 'Issn2']] = df_split_issn
del df['Issn']
print df
'''
df[['Issn1','Issn2']] = (df.pop('Issn').str.extract('ISSN\s+([^,]+),?\s?(.*)', expand=True)
.apply(lambda x: x.str[:4]+'-'+x.str[4:]).replace(r'^-$', '', regex=True))
print df
默认情况下 运行 的两种情况 python 2.7 我收到以下错误
Traceback (most recent call last):
File "clean_journal_list.py", line 1, in <module>
import pandas as pd
File "/usr/local/lib/python2.7/dist-packages/pandas/__init__.py", line 25, in <module>
from pandas import hashtable, tslib, lib
File "pandas/src/numpy.pxd", line 157, in init pandas.hashtable (pandas/hashtable.c:38364)
当 运行 在 python 3.4 中出现下面给出的错误
File "clean_journal_list.py", line 21
print df
^
SyntaxError: invalid syntax
您需要为此添加一些错误处理,并将其包装在逐行迭代中,但要点如下:
leader, issns = line.split(" ISSN ")
numbers = issns.split(", ")
print leader, ', '.join([ num[:4] + '-' + num[4:] for num in numbers])
关键是将每一行拆分为"the ISSN numbers"和"everything else",然后将ISSN号彼此分开并重新格式化。
首先,拆分数字并在其中添加破折号。使用方便的地图功能:
df_split_num = df['Issn'].map(lambda x: x.split('ISSN ')[1].split(', '))
df_dash_num = df_split_num.map(lambda x: [num[:4] + '-' + num[4:] for num in x])
接下来,使用拆分出的 issn 编号创建一个新数据框,并将其放回原始数据框中:
df_split_issn = pd.DataFrame(data=list(df_dash_num), columns=['Issn1', 'Issn2'])
df[['Issn1', 'Issn2']] = df_split_issn
del df['Issn']
IIUC 你可以使用 Series.str.extract()、apply()
和 replace()
方法:
In [33]: df
Out[33]:
TITLE Issn
0 NATURE REVIEWS MOLECULAR CELL BIOLOGY ISSN 14710072, 14710080
1 ANNUAL REVIEW OF IMMUNOLOGY ISSN 07320582, 15453278
2 NATURE REVIEWS GENETICS ISSN 14710056, 14710064
3 CA - A CANCER JOURNAL FOR CLINICIANS ISSN 15424863, 00079235
4 CELL ISSN 00928674, 10974172
5 ANNUAL REVIEW OF ASTRONOMY AND ASTROPHYSICS ISSN 15454282, 00664146
6 NATURE REVIEWS IMMUNOLOGY ISSN 14741741, 14741733
7 NATURE REVIEWS CANCER ISSN 1474175X
8 ANNUAL REVIEW OF BIOCHEMISTRY ISSN 15454509, 00664154
9 REVIEWS OF MODERN PHYSICS ISSN 00346861, 15390756
10 NATURE GENETICS ISSN 10614036
In [34]: df[['Issn1','Issn2']] = (df.pop('Issn')
...: .str.extract('ISSN\s+([^,]+),?\s?(.*)', expand=True)
...: .apply(lambda x: x.str[:4]+'-'+x.str[4:])
...: .replace(r'^-$', '', regex=True))
...:
In [35]: df
Out[35]:
TITLE Issn1 Issn2
0 NATURE REVIEWS MOLECULAR CELL BIOLOGY 1471-0072 1471-0080
1 ANNUAL REVIEW OF IMMUNOLOGY 0732-0582 1545-3278
2 NATURE REVIEWS GENETICS 1471-0056 1471-0064
3 CA - A CANCER JOURNAL FOR CLINICIANS 1542-4863 0007-9235
4 CELL 0092-8674 1097-4172
5 ANNUAL REVIEW OF ASTRONOMY AND ASTROPHYSICS 1545-4282 0066-4146
6 NATURE REVIEWS IMMUNOLOGY 1474-1741 1474-1733
7 NATURE REVIEWS CANCER 1474-175X
8 ANNUAL REVIEW OF BIOCHEMISTRY 1545-4509 0066-4154
9 REVIEWS OF MODERN PHYSICS 0034-6861 1539-0756
10 NATURE GENETICS 1061-4036