Python 中字符串的频率(逗号分隔)
frequency of string (comma separated) in Python
我正在尝试从该网站上的“Select Investors”字段中查找字符串的频率 https://www.cbinsights.com/research-unicorn-companies
有没有办法提取每个逗号分隔字符串的频率?
例如,“红杉资本中国”一词出现的频率如何?
我用这种正确的、更 pythonic 的方式
import itertools
import collections
import pandas as pd
def fun(x):
x = map(lambda y: y.strip().lower(), str(x).split(','))
return filter(lambda y: y and y != 'nan', x)
# Extract data
url = "https://www.cbinsights.com/research-unicorn-companies"
df = pd.read_html(url)
first_df = df[0]
# Process
investor = first_df['Select Investors'].apply(lambda x: fun(x))
investor = investor.values.flatten()
investor = list(itertools.chain(*investor))
# Organize
final_data = collections.Counter(investor).items()
final_df = pd.DataFrame(final_data, columns=['Investor', 'Frequency'])
final_df
输出:
Investor Frequency
0 Sequoia Capital China 46
1 SIG Asia Investments 3
2 Sina Weibo 2
3 Softbank Group 9
4 Founders Fund 16
... ... ...
1187 Motive Partners. Apollo Global Management 1
1188 JBV Capital 1
1189 Array Ventures 1
1190 AWZ Ventures 1
1191 Endiya Partners 1
@Mazhar提供的解决方案检查某个词是否是逗号分隔的字符串的子串。因此,该方法返回的 'Sequoia Capital'
的出现次数是所有包含 'Sequoia Capital'
的字符串出现次数的总和,即 'Sequoia Capital'
、'Sequoia Capital China'
、'Sequoia Capital India'
、'Sequoia Capital Israel'
和 'and Sequoia Capital China'
。以下代码避免了该问题:
import pandas as pd
from collections import defaultdict
url = "https://www.cbinsights.com/research-unicorn-companies"
df = pd.read_html(url)[0]
freqs = defaultdict(int)
for group in df['Select Investors']:
if hasattr(group, 'lower'):
for raw_investor in group.lower().split(','):
investor = raw_investor.strip()
# Ignore empty strings produced by wrong data like this:
# 'B Capital Group,, GE Ventures, McKesson Ventures'
if investor:
freqs[investor] += 1
演示
In [57]: freqs['sequoia capital']
Out[57]: 41
In [58]: freqs['sequoia capital china']
Out[58]: 46
In [59]: freqs['sequoia capital india']
Out[59]: 25
In [60]: freqs['sequoia capital israel']
Out[60]: 2
In [61]: freqs['and sequoia capital china']
Out[61]: 1
出现次数总和为 115,这与当前接受的解决方案为 'sequoia capital'
返回的频率一致。
我正在尝试从该网站上的“Select Investors”字段中查找字符串的频率 https://www.cbinsights.com/research-unicorn-companies
有没有办法提取每个逗号分隔字符串的频率?
例如,“红杉资本中国”一词出现的频率如何?
我用这种正确的、更 pythonic 的方式
import itertools
import collections
import pandas as pd
def fun(x):
x = map(lambda y: y.strip().lower(), str(x).split(','))
return filter(lambda y: y and y != 'nan', x)
# Extract data
url = "https://www.cbinsights.com/research-unicorn-companies"
df = pd.read_html(url)
first_df = df[0]
# Process
investor = first_df['Select Investors'].apply(lambda x: fun(x))
investor = investor.values.flatten()
investor = list(itertools.chain(*investor))
# Organize
final_data = collections.Counter(investor).items()
final_df = pd.DataFrame(final_data, columns=['Investor', 'Frequency'])
final_df
输出:
Investor Frequency
0 Sequoia Capital China 46
1 SIG Asia Investments 3
2 Sina Weibo 2
3 Softbank Group 9
4 Founders Fund 16
... ... ...
1187 Motive Partners. Apollo Global Management 1
1188 JBV Capital 1
1189 Array Ventures 1
1190 AWZ Ventures 1
1191 Endiya Partners 1
@Mazhar提供的解决方案检查某个词是否是逗号分隔的字符串的子串。因此,该方法返回的 'Sequoia Capital'
的出现次数是所有包含 'Sequoia Capital'
的字符串出现次数的总和,即 'Sequoia Capital'
、'Sequoia Capital China'
、'Sequoia Capital India'
、'Sequoia Capital Israel'
和 'and Sequoia Capital China'
。以下代码避免了该问题:
import pandas as pd
from collections import defaultdict
url = "https://www.cbinsights.com/research-unicorn-companies"
df = pd.read_html(url)[0]
freqs = defaultdict(int)
for group in df['Select Investors']:
if hasattr(group, 'lower'):
for raw_investor in group.lower().split(','):
investor = raw_investor.strip()
# Ignore empty strings produced by wrong data like this:
# 'B Capital Group,, GE Ventures, McKesson Ventures'
if investor:
freqs[investor] += 1
演示
In [57]: freqs['sequoia capital']
Out[57]: 41
In [58]: freqs['sequoia capital china']
Out[58]: 46
In [59]: freqs['sequoia capital india']
Out[59]: 25
In [60]: freqs['sequoia capital israel']
Out[60]: 2
In [61]: freqs['and sequoia capital china']
Out[61]: 1
出现次数总和为 115,这与当前接受的解决方案为 'sequoia capital'
返回的频率一致。