Python 中字符串的频率(逗号分隔)

frequency of string (comma separated) in Python

我正在尝试从该网站上的“Select Investors”字段中查找字符串的频率 https://www.cbinsights.com/research-unicorn-companies

有没有办法提取每个逗号分隔字符串的频率?

例如,“红杉资本中国”一词出现的频率如何?

我用这种正确的、更 pythonic 的方式

import itertools
import collections
import pandas as pd


def fun(x):
    x = map(lambda y: y.strip().lower(), str(x).split(','))
    return filter(lambda y: y and y != 'nan', x)


# Extract data
url = "https://www.cbinsights.com/research-unicorn-companies"
df = pd.read_html(url)
first_df = df[0]


# Process
investor = first_df['Select Investors'].apply(lambda x: fun(x))
investor = investor.values.flatten()
investor = list(itertools.chain(*investor))

# Organize
final_data = collections.Counter(investor).items()
final_df = pd.DataFrame(final_data, columns=['Investor', 'Frequency'])
final_df

输出:

    Investor                                        Frequency
0   Sequoia Capital China                           46
1   SIG Asia Investments                            3
2   Sina Weibo                                      2
3   Softbank Group                                  9
4   Founders Fund                                   16
...     ...     ...
1187    Motive Partners. Apollo Global Management   1
1188    JBV Capital                                 1
1189    Array Ventures                              1
1190    AWZ Ventures                                1
1191    Endiya Partners                             1

@Mazhar提供的解决方案检查某个词是否是逗号分隔的字符串的子串。因此,该方法返回的 'Sequoia Capital' 的出现次数是所有包含 'Sequoia Capital' 的字符串出现次数的总和,即 'Sequoia Capital''Sequoia Capital China''Sequoia Capital India''Sequoia Capital Israel''and Sequoia Capital China'。以下代码避免了该问题:

import pandas as pd
from collections import defaultdict

url = "https://www.cbinsights.com/research-unicorn-companies"
df = pd.read_html(url)[0]

freqs = defaultdict(int)
for group in df['Select Investors']:
    if hasattr(group, 'lower'):
        for raw_investor in group.lower().split(','):
            investor = raw_investor.strip()
            # Ignore empty strings produced by wrong data like this:
            # 'B Capital Group,, GE Ventures, McKesson Ventures'
            if investor:
                freqs[investor] += 1

演示

In [57]: freqs['sequoia capital']
Out[57]: 41

In [58]: freqs['sequoia capital china']
Out[58]: 46

In [59]: freqs['sequoia capital india']
Out[59]: 25

In [60]: freqs['sequoia capital israel']
Out[60]: 2

In [61]: freqs['and sequoia capital china']
Out[61]: 1

出现次数总和为 115,这与当前接受的解决方案为 'sequoia capital' 返回的频率一致。