拆分列表中的不同值,用逗号分隔
Split the distinct values in a list separated by a comma
我有一个 pandas 数据框
index
DevType
Count
1
Developer, back-end
3086
2
Developer, back-end;Developer, front-end;Devel...
2227
3
Developer, back-end;Developer, full-stack
1476
4
Developer, front-end
1401
5
Developer, back-end;Developer, desktop or ente...
605
6
Developer, embedded applications or devices
433
这是通过在列上应用 .value_counts()
来实现的,如您所见 Developer 在与其他答案结合时重复出现,我想从这个数据框中创建一个可能的单词列表来计算每个单词的数量他们后来重复了。
我尝试了下面的代码来首先找到唯一值
unqlist=list(df_new['DevType'].unique())
通过使用 'unqlist' 我尝试使用以下代码分隔不同的单词
possiblewords=[]
for word in unqlist:
print(word.split(','))
possiblewords.append(word)
没用
这是一个例子:
list(set(''.join(filter(lambda x: isinstance(x, str), devtype_list)).split(',')))
您可以使用 ,
和 ;
作为分隔符来拆分列表,以分隔唯一的单词。
def split_words(x):
return sum(list(map(lambda y: y.split(";"), x.split(','))), [])
devtype_list = ['Developer, desktop or enterprise applications;Developer, full-stack', 'Developer, full-stack;Developer, mobile', 'nan', 'Designer;Developer, front-end;Developer, mobile', 'Developer, back-end;Developer, front-end;Developer, QA or test;DevOps specialist', 'Developer, back-end;Developer, desktop or enterprise applications;Developer, game or graphics', 'Developer, full-stack', 'Database administrator;']
newlist = list(set(sum(list(map(lambda x: split_words(x), devtype_list)), [])))
newlist = list(map(lambda x: x.strip(), newlist))
for unique_word in newlist:
print(unique_word)
结果:
Developer
front-end
Designer
desktop or enterprise applications
game or graphics
mobile
Database administrator
QA or test
DevOps specialist
nan
back-end
full-stack
可以用Pandas.str.split()
to split on comma and semicolon, put the result in a numpy array. Then, use np.unique
得到二维数组平铺成一维数组后的唯一词,如下:
import numpy as np
list_all = df_new['DevType'].str.split(r'(?:,|;)\s*').dropna().to_numpy()
list_unique = np.unique(sum(list_all, []))
结果:
print(list_unique)
['Devel...' 'Developer' 'back-end' 'desktop or ente...'
'embedded applications or devices' 'front-end' 'full-stack']
我有一个 pandas 数据框
index | DevType | Count |
---|---|---|
1 | Developer, back-end | 3086 |
2 | Developer, back-end;Developer, front-end;Devel... | 2227 |
3 | Developer, back-end;Developer, full-stack | 1476 |
4 | Developer, front-end | 1401 |
5 | Developer, back-end;Developer, desktop or ente... | 605 |
6 | Developer, embedded applications or devices | 433 |
这是通过在列上应用 .value_counts()
来实现的,如您所见 Developer 在与其他答案结合时重复出现,我想从这个数据框中创建一个可能的单词列表来计算每个单词的数量他们后来重复了。
我尝试了下面的代码来首先找到唯一值
unqlist=list(df_new['DevType'].unique())
通过使用 'unqlist' 我尝试使用以下代码分隔不同的单词
possiblewords=[]
for word in unqlist:
print(word.split(','))
possiblewords.append(word)
没用
这是一个例子:
list(set(''.join(filter(lambda x: isinstance(x, str), devtype_list)).split(',')))
您可以使用 ,
和 ;
作为分隔符来拆分列表,以分隔唯一的单词。
def split_words(x):
return sum(list(map(lambda y: y.split(";"), x.split(','))), [])
devtype_list = ['Developer, desktop or enterprise applications;Developer, full-stack', 'Developer, full-stack;Developer, mobile', 'nan', 'Designer;Developer, front-end;Developer, mobile', 'Developer, back-end;Developer, front-end;Developer, QA or test;DevOps specialist', 'Developer, back-end;Developer, desktop or enterprise applications;Developer, game or graphics', 'Developer, full-stack', 'Database administrator;']
newlist = list(set(sum(list(map(lambda x: split_words(x), devtype_list)), [])))
newlist = list(map(lambda x: x.strip(), newlist))
for unique_word in newlist:
print(unique_word)
结果:
Developer
front-end
Designer
desktop or enterprise applications
game or graphics
mobile
Database administrator
QA or test
DevOps specialist
nan
back-end
full-stack
可以用Pandas.str.split()
to split on comma and semicolon, put the result in a numpy array. Then, use np.unique
得到二维数组平铺成一维数组后的唯一词,如下:
import numpy as np
list_all = df_new['DevType'].str.split(r'(?:,|;)\s*').dropna().to_numpy()
list_unique = np.unique(sum(list_all, []))
结果:
print(list_unique)
['Devel...' 'Developer' 'back-end' 'desktop or ente...'
'embedded applications or devices' 'front-end' 'full-stack']