我可以用另一列的特定列表元素填充一列的 NaN 值吗?
Can I fill NaN-Values of one column with specific list elements of another column?
例如,我有以下数据框(称为项目):
| index | itemID | maintopic | subtopics |
|:----- |:------:|:---------:| ------------------:|
| 1 | 235 | FBR | [FZ, 1RH, FL] |
| 2 | 1787 | NaN | [1RH, YRS, FZ, FL] |
| 3 | 2454 | NaN | [FZX, 1RH, FZL] |
| 4 | 3165 | NaN | [YHS] |
我想用子主题列表中以字母开头的第一个元素填充主主题列中的 NaN 值。有人有想法吗? (问题 1)
我试过了,但没用:
import pandas as pd
import string
alphabet = list(string.ascii_lowercase)
items['maintopic'] = items['maintopic'].apply(lambda x : items['maintopic'].fillna(items['subtopics'][x][0]) if items['subtopics'][x][0].lower().startswith(tuple(alphabet)) else x)
高级(问题 2):
更好的办法是查看子主题列表的所有元素,如果有更多元素的第一个字母甚至第一个和第二个字母相同,那么我想接受这个。例如在第 2 行中有 FZ 和 FL,所以我想用 F 填充此行中的主主题。在第 3 行中有 FZX 和 FZL,那么我想用 FZ 填充主主题。但是,如果这太复杂了,那么我也很乐意回答问题 1。
感谢任何帮助!
尝试:
from itertools import chain, combinations
def commonprefix(m):
"Given a list of pathnames, returns the longest common leading component"
if not m:
return ""
s1 = min(m)
s2 = max(m)
for i, c in enumerate(s1):
if c != s2[i]:
return s1[:i]
return s1
def powerset(iterable, n=0):
"powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
s = list(iterable)
return chain.from_iterable(combinations(s, r) for r in range(n, len(s) + 1))
def choose(x):
if not isinstance(x, list):
return x
if len(x) == 1:
return x[0]
filtered = [v for v in x if not v[0].isdigit()]
if not filtered:
return np.nan
longest = ""
for s in powerset(filtered, 2):
pref = commonprefix(s)
if len(pref) > len(longest):
longest = pref
return filtered[0] if longest == "" else longest
m = df["maintopic"].isna()
df.loc[m, "maintopic"] = df.loc[m, "subtopics"].apply(choose)
print(df)
打印:
index itemID maintopic subtopics
0 1 235 FBR [FZ, 1RH, FL]
1 2 1787 F [1RH, YRS, FZ, FL]
2 3 2454 FZ [FZX, 1RH, FZL]
3 4 3165 YHS [YHS]
编辑:添加了对 list/float.
的检查
第一个问题试试这个:
import pandas as pd
import numpy as np
def fill_value(sub):
for i in sub:
if i[0].isalpha():
return i
return sub[0]
data = {
'maintopic': ['FBR', np.nan, np.nan, np.nan],
'subtopic': [['FZ', '1RH', 'FL'] , ['1RH', 'YRS', 'FZ', 'FL'], ['FZX', '1RH', 'FZL'], ['YHS']]
}
df = pd.DataFrame(data)
print('Before\n', df)
df['maintopic'] = df.apply(
lambda row: fill_value(row['subtopic']) if pd.isnull(row['maintopic']) else row['maintopic'],
axis=1
)
print('\nAfter\n', df)
输出:
Before
maintopic subtopic
0 FBR [FZ, 1RH, FL]
1 NaN [1RH, YRS, FZ, FL]
2 NaN [FZX, 1RH, FZL]
3 NaN [YHS]
After
maintopic subtopic
0 FBR [FZ, 1RH, FL]
1 YRS [1RH, YRS, FZ, FL]
2 FZX [FZX, 1RH, FZL]
3 YHS [YHS]
您可以将 fill_value 函数更改为 return 所需的值以填充 NaN 值。现在,我已经 return 编辑了以字母开头的子主题的第一个值。
您可以这样做:获取 subtopics
列列表中每个值中以第一个字母开头的所有子字符串并构建一个计数器,然后根据频率对计数器中的项目进行排序。如果项目的频率相同,则考虑最长的字符串。
from collections import Counter
from functools import cmp_to_key
def get_main_topic_modified(m, l):
if m is not np.nan:
return m
if len(l) == 1:
return l[0]
res = []
for s in l:
il = [s[:i+1] for i in range(len(s)-1)]
res.append(il)
res = [item for s in res for item in s]
c = Counter(res)
d = dict(c)
l = list(d.items())
l.sort(key=cmp_to_key(lambda x, y: len(y[0])-len(x[0]) if x[1] == y[1] else y[1] - x[1]))
return l[0][0]
df['maintopic'] = df[['maintopic', 'subtopics']].apply(
lambda x : get_main_topic_modified(*x), axis = 1)
输出:
index itemID maintopic subtopics
0 1 235 FBR [FZ, 1RH, FL]
1 2 1787 F [1RH, YRS, FZ, FL]
2 3 2454 FZ [FZX, 1RH, FZL]
3 4 3165 YHS [YHS]
例如,我有以下数据框(称为项目):
| index | itemID | maintopic | subtopics |
|:----- |:------:|:---------:| ------------------:|
| 1 | 235 | FBR | [FZ, 1RH, FL] |
| 2 | 1787 | NaN | [1RH, YRS, FZ, FL] |
| 3 | 2454 | NaN | [FZX, 1RH, FZL] |
| 4 | 3165 | NaN | [YHS] |
我想用子主题列表中以字母开头的第一个元素填充主主题列中的 NaN 值。有人有想法吗? (问题 1)
我试过了,但没用:
import pandas as pd
import string
alphabet = list(string.ascii_lowercase)
items['maintopic'] = items['maintopic'].apply(lambda x : items['maintopic'].fillna(items['subtopics'][x][0]) if items['subtopics'][x][0].lower().startswith(tuple(alphabet)) else x)
高级(问题 2): 更好的办法是查看子主题列表的所有元素,如果有更多元素的第一个字母甚至第一个和第二个字母相同,那么我想接受这个。例如在第 2 行中有 FZ 和 FL,所以我想用 F 填充此行中的主主题。在第 3 行中有 FZX 和 FZL,那么我想用 FZ 填充主主题。但是,如果这太复杂了,那么我也很乐意回答问题 1。
感谢任何帮助!
尝试:
from itertools import chain, combinations
def commonprefix(m):
"Given a list of pathnames, returns the longest common leading component"
if not m:
return ""
s1 = min(m)
s2 = max(m)
for i, c in enumerate(s1):
if c != s2[i]:
return s1[:i]
return s1
def powerset(iterable, n=0):
"powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
s = list(iterable)
return chain.from_iterable(combinations(s, r) for r in range(n, len(s) + 1))
def choose(x):
if not isinstance(x, list):
return x
if len(x) == 1:
return x[0]
filtered = [v for v in x if not v[0].isdigit()]
if not filtered:
return np.nan
longest = ""
for s in powerset(filtered, 2):
pref = commonprefix(s)
if len(pref) > len(longest):
longest = pref
return filtered[0] if longest == "" else longest
m = df["maintopic"].isna()
df.loc[m, "maintopic"] = df.loc[m, "subtopics"].apply(choose)
print(df)
打印:
index itemID maintopic subtopics
0 1 235 FBR [FZ, 1RH, FL]
1 2 1787 F [1RH, YRS, FZ, FL]
2 3 2454 FZ [FZX, 1RH, FZL]
3 4 3165 YHS [YHS]
编辑:添加了对 list/float.
的检查第一个问题试试这个:
import pandas as pd
import numpy as np
def fill_value(sub):
for i in sub:
if i[0].isalpha():
return i
return sub[0]
data = {
'maintopic': ['FBR', np.nan, np.nan, np.nan],
'subtopic': [['FZ', '1RH', 'FL'] , ['1RH', 'YRS', 'FZ', 'FL'], ['FZX', '1RH', 'FZL'], ['YHS']]
}
df = pd.DataFrame(data)
print('Before\n', df)
df['maintopic'] = df.apply(
lambda row: fill_value(row['subtopic']) if pd.isnull(row['maintopic']) else row['maintopic'],
axis=1
)
print('\nAfter\n', df)
输出:
Before
maintopic subtopic
0 FBR [FZ, 1RH, FL]
1 NaN [1RH, YRS, FZ, FL]
2 NaN [FZX, 1RH, FZL]
3 NaN [YHS]
After
maintopic subtopic
0 FBR [FZ, 1RH, FL]
1 YRS [1RH, YRS, FZ, FL]
2 FZX [FZX, 1RH, FZL]
3 YHS [YHS]
您可以将 fill_value 函数更改为 return 所需的值以填充 NaN 值。现在,我已经 return 编辑了以字母开头的子主题的第一个值。
您可以这样做:获取 subtopics
列列表中每个值中以第一个字母开头的所有子字符串并构建一个计数器,然后根据频率对计数器中的项目进行排序。如果项目的频率相同,则考虑最长的字符串。
from collections import Counter
from functools import cmp_to_key
def get_main_topic_modified(m, l):
if m is not np.nan:
return m
if len(l) == 1:
return l[0]
res = []
for s in l:
il = [s[:i+1] for i in range(len(s)-1)]
res.append(il)
res = [item for s in res for item in s]
c = Counter(res)
d = dict(c)
l = list(d.items())
l.sort(key=cmp_to_key(lambda x, y: len(y[0])-len(x[0]) if x[1] == y[1] else y[1] - x[1]))
return l[0][0]
df['maintopic'] = df[['maintopic', 'subtopics']].apply(
lambda x : get_main_topic_modified(*x), axis = 1)
输出:
index itemID maintopic subtopics
0 1 235 FBR [FZ, 1RH, FL]
1 2 1787 F [1RH, YRS, FZ, FL]
2 3 2454 FZ [FZX, 1RH, FZL]
3 4 3165 YHS [YHS]