在 python 数据框中提取许多 URL
Extracting many URLs in a python dataframe
我有一个数据框,其中包含一个或多个 URL(s) 的文本:
user_id text
1 blabla... http://amazon.com ...blabla
1 blabla... http://nasa.com ...blabla
2 blabla... https://google.com ...blabla ...https://yahoo.com ...blabla
2 blabla... https://fnac.com ...blabla ...
3 blabla....
我想用每个用户 ID URL(s) 的计数来转换此数据帧:
user_id count_URL
1 2
2 3
3 0
在 Python 中是否有执行此任务的简单方法?
我的代码开始:
URL = pd.DataFrame(columns=['A','B','C','D','E','F','G'])
for i in range(data.shape[0]) :
for j in range(0,8):
URL.iloc[i,j] = re.findall("(?P<url>https?://[^\s]+)", str(data.iloc[i]))
谢谢
莱昂内尔
一般来说,URL 的定义比您在示例中的定义复杂得多。除非你确定你有非常简单的 URLs,否则你应该查找一个好的模式。
import re
URLPATTERN = r'(https?://\S+)' # Lousy, but...
首先,从每个字符串中提取URL个并计算它们:
df['urlcount'] = df.text.apply(lambda x: re.findall(URLPATTERN, x)).str.len()
接下来,按用户 ID 对计数进行分组:
df.groupby('user_id').sum()['urlcount']
#user_id
#1 2
#2 3
#3 0
下面还有另一种方法:
#read data
import pandas as pd
data = pd.read_csv("data.csv")
#Divide data into URL and user_id and cast it to pandas DataFrame
URL = pd.DataFrame(data.loc[:,"text"].values)
user_id = pd.DataFrame(data.loc[:,"user_id"].values)
#count the number of appearance of the "http" in each row of data
sub = "http"
count_URL = []
for val in URL.iterrows():
counter = val[1][0].count(sub)
count_URL.append(counter)
#list to DataFrame
count_URL = pd.DataFrame(count_URL)
#Concatenate the two data frames and apply the code of @DyZ to group by and count the number of url
finalDF = pd.concat([user_id,count_URL],axis=1)
finalDF.columns=["user_id","urlcount"]
data = finalDF.groupby('user_id').sum()['urlcount']
print(data.head())
我有一个数据框,其中包含一个或多个 URL(s) 的文本:
user_id text
1 blabla... http://amazon.com ...blabla
1 blabla... http://nasa.com ...blabla
2 blabla... https://google.com ...blabla ...https://yahoo.com ...blabla
2 blabla... https://fnac.com ...blabla ...
3 blabla....
我想用每个用户 ID URL(s) 的计数来转换此数据帧:
user_id count_URL
1 2
2 3
3 0
在 Python 中是否有执行此任务的简单方法?
我的代码开始:
URL = pd.DataFrame(columns=['A','B','C','D','E','F','G'])
for i in range(data.shape[0]) :
for j in range(0,8):
URL.iloc[i,j] = re.findall("(?P<url>https?://[^\s]+)", str(data.iloc[i]))
谢谢
莱昂内尔
一般来说,URL 的定义比您在示例中的定义复杂得多。除非你确定你有非常简单的 URLs,否则你应该查找一个好的模式。
import re
URLPATTERN = r'(https?://\S+)' # Lousy, but...
首先,从每个字符串中提取URL个并计算它们:
df['urlcount'] = df.text.apply(lambda x: re.findall(URLPATTERN, x)).str.len()
接下来,按用户 ID 对计数进行分组:
df.groupby('user_id').sum()['urlcount']
#user_id
#1 2
#2 3
#3 0
下面还有另一种方法:
#read data
import pandas as pd
data = pd.read_csv("data.csv")
#Divide data into URL and user_id and cast it to pandas DataFrame
URL = pd.DataFrame(data.loc[:,"text"].values)
user_id = pd.DataFrame(data.loc[:,"user_id"].values)
#count the number of appearance of the "http" in each row of data
sub = "http"
count_URL = []
for val in URL.iterrows():
counter = val[1][0].count(sub)
count_URL.append(counter)
#list to DataFrame
count_URL = pd.DataFrame(count_URL)
#Concatenate the two data frames and apply the code of @DyZ to group by and count the number of url
finalDF = pd.concat([user_id,count_URL],axis=1)
finalDF.columns=["user_id","urlcount"]
data = finalDF.groupby('user_id').sum()['urlcount']
print(data.head())