在 python 数据框中提取许多 URL

Question

我有一个数据框，其中包含一个或多个 URL(s) 的文本：

user_id          text
  1              blabla... http://amazon.com ...blabla
  1              blabla... http://nasa.com ...blabla
  2              blabla... https://google.com ...blabla ...https://yahoo.com ...blabla
  2              blabla... https://fnac.com ...blabla ...
  3              blabla....

我想用每个用户 ID URL(s) 的计数来转换此数据帧：

 user_id          count_URL
    1               2 
    2               3
    3               0

在 Python 中是否有执行此任务的简单方法？

我的代码开始：

URL = pd.DataFrame(columns=['A','B','C','D','E','F','G'])

for i in range(data.shape[0]) :
  for j in range(0,8):
     URL.iloc[i,j] = re.findall("(?P<url>https?://[^\s]+)", str(data.iloc[i]))

谢谢

莱昂内尔

Answer 1

一般来说，URL 的定义比您在示例中的定义复杂得多。除非你确定你有非常简单的 URLs，否则你应该查找一个好的模式。

import re
URLPATTERN = r'(https?://\S+)' # Lousy, but...

首先，从每个字符串中提取URL个并计算它们：

df['urlcount'] = df.text.apply(lambda x: re.findall(URLPATTERN, x)).str.len()

接下来，按用户 ID 对计数进行分组：

df.groupby('user_id').sum()['urlcount']
#user_id
#1    2
#2    3
#3    0

Answer 2

下面还有另一种方法：

#read data
import pandas as pd
data = pd.read_csv("data.csv")

#Divide data into URL and user_id and cast it to pandas DataFrame
URL = pd.DataFrame(data.loc[:,"text"].values)
user_id = pd.DataFrame(data.loc[:,"user_id"].values)

#count the number of appearance of the "http" in each row of data
sub = "http"
count_URL = []
for val in URL.iterrows():
    counter = val[1][0].count(sub)
    count_URL.append(counter)

#list to DataFrame
count_URL = pd.DataFrame(count_URL)

#Concatenate the two data frames and apply the code of @DyZ to group by and count the number of url
finalDF = pd.concat([user_id,count_URL],axis=1)
finalDF.columns=["user_id","urlcount"]
data = finalDF.groupby('user_id').sum()['urlcount']
print(data.head())

在 python 数据框中提取许多 URL

Extracting many URLs in a python dataframe

python

text-extraction