有没有办法修改此代码以减少 运行 时间?
Is there a way to modify this code to reduce run time?
所以我希望修改此代码以减少 fuzzywuzzy 库的运行时间。目前800行的数据集需要一个小时左右,而我在4.5K行的数据集上使用它时,它保持运行将近6个小时,仍然没有结果。我不得不停止内核。
我需要在至少 20K 的数据上使用此代码。任何人都可以建议对此代码进行任何编辑以更快地获得结果吗?这是代码-
import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz,process
df = pd.read_csv(r'path')
df.head()
data = df['Body']
print(data)
clean = []
threshold = 80
for row in data:
# score each sentence against each other
# [('string', score),..]
scores = process.extract(row, data, scorer=fuzz.token_set_ratio)
# basic idea is if there is a close second match we want to evaluate
# and keep the longer of the two
if scores[1][1] > threshold:
clean.append(max([x[0] for x in scores[:2]],key=len))
else:
clean.append(scores[0][0])
# remove dupes
clean = set(clean)
#converting 'clean' list to dataframe and giving the column name for the cleaned column
clean_data = pd.DataFrame(clean, columns=['Body'])
clean_data.to_csv(r'path')
这是我的数据的样子 -
https://docs.google.com/spreadsheets/d/1p9RC9HznhdJFH4kFYdE_TgnHdoRf8P6gTEAkB3lQWEE/edit?usp=sharing
因此,如果您注意到第 14 和 15 行以及第 19 和 20 行是部分重复的,我希望代码能够识别此类句子,并删除较短的句子。
更新-
我对@Darryl G给出的rapidfuzz方案做了一个小改动,现在代码是这样的-
`import pandas as pd
import numpy as np
import openpyxl
from rapidfuzz.fuzz import token_set_ratio as rapid_token_set_ratio
from rapidfuzz import process as process_rapid
from rapidfuzz import utils as rapid_utils
import time
df = pd.read_excel(r'path')
data = df['Body']
print(data)
def excel_sheet_to_dataframe(path):
'''
Loads sheet from Excel workbook using openpyxl
'''
wb = openpyxl.load_workbook(path)
ws = wb.active
data = ws.values
# Get the first line in file as a header line
columns = next(data)[0:]
return pd.DataFrame(data, columns=columns)
clean_rapid = []
threshold = 80
def process_rapid_fuzz(data):
'''
Process using rapid fuzz rather than fuzz_wuzzy
'''
series = (rapid_utils.default_process(d) for d in data) # Pre-process to make lower-case and remove non-alphanumeric
# characters (generator)
processed_data = pd.Series(series)
for query in processed_data:
scores = process_rapid.extract(query, processed_data, scorer=rapid_token_set_ratio, score_cutoff=threshold)
if len(scores) > 1 and scores[1][1] > threshold:
m = max(scores[:2], key = lambda k:len(k[0])) # Of up to two matches above threshold, takes longest
clean_rapid.append(m[0]) # Saving the match index
else:
clean_rapid.append(query)
################ Testing
t0 = time.time()
df = excel_sheet_to_dataframe(r'path') # Using Excel file in working folder
# Desired data in body column
data = df['Body'].dropna() # Dropping None rows (few None rows at end after Excel import)
result_fuzzy_rapid = process_rapid_fuzz(data)
print(f'Elapsed time {time.time() - t0}')
# remove dupes
clean_rapid = set(clean_rapid)
#converting 'clean' list to dataframe and giving the column name for the cleaned column
clean_data = pd.DataFrame(clean_rapid, columns=['Body'])
#exporting the cleaned data
clean_data.to_excel(r'path')`
现在的问题是,在输出文件中,所有句点等都被删除了。我怎样才能留住他们?
该方法基于 Vectorizing or Speeding up Fuzzywuzzy String Matching on PANDAS Column 中的答案中的 RapidFuzz。
结果
- OP 模糊 Wuzzy 方法):2565.7 秒
- RapidFuzz 方法:649.5 秒
因此:4 倍改进
- 注意:测试数据约 2K 条记录来自 OP Google Sheet Data 下载到本地 Excel 工作簿。
快速模糊测试
import pandas as pd
import numpy as np
import openpyxl
from rapidfuzz.fuzz import token_set_ratio as rapid_token_set_ratio
from rapidfuzz import process as process_rapid
from rapidfuzz import utils as rapid_utils
import time
def excel_sheet_to_dataframe(path):
'''
Loads sheet from Excel workbook using openpyxl
'''
wb = openpyxl.load_workbook(path)
ws = wb.active
data = ws.values
# Get the first line in file as a header line
columns = next(data)[0:]
return pd.DataFrame(data, columns=columns)
def process_rapid_fuzz(data):
'''
Process using rapid fuzz rather than fuzz_wuzzy
'''
series = (rapid_utils.default_process(d) for d in data) # Pre-process to make lower-case and remove non-alphanumeric
# characters (generator)
processed_data = pd.Series(series)
clean_rapid = []
threshold = 80
n = 0
for query in processed_data:
scores = process_rapid.extract(query, processed_data, scorer=rapid_token_set_ratio, score_cutoff=threshold)
m = max(scores[:2], key = lambda k:len(k[0])) # Of up to two matches above threshold, takes longest
clean_rapid.append(m[-1]) # Saving the match index
clean_rapid = set(clean_rapid) # remove duplicate indexes
return data[clean_rapid] # Get actual values by indexing to Pandas Series
################ Testing
t0 = time.time()
df = excel_sheet_to_dataframe('Duplicates1.xlsx') # Using Excel file in working folder
# Desired data in body column
data = df['Body'].dropna() # Dropping None rows (few None rows at end after Excel import)
result_fuzzy_rapid = process_rapid_fuzz(data)
print(f'Elapsed time {time.time() - t0}')
用于比较的发布代码版本
import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz, process
import openpyxl
import time
def excel_sheet_to_dataframe(path):
'''
Loads sheet from Excel workbook using openpyxl
'''
wb = openpyxl.load_workbook(path)
ws = wb.active
data = ws.values
# Get the first line in file as a header line
columns = next(data)[0:]
return pd.DataFrame(data, columns=columns)
def process_fuzzy_wuzzy(data):
clean = []
threshold = 80
for idx, query in enumerate(data):
# score each sentence against each other
# [('string', score),..]
scores = process.extract(query, data, scorer=fuzz.token_set_ratio)
# basic idea is if there is a close second match we want to evaluate
# and keep the longer of the two
if len(scores) > 1 and scores[1][1] > threshold: # If second one is close
m = max(scores[:2], key=lambda k:len(k[0]))
clean.append(m[-1])
else:
clean.append(idx)
# remove duplicates
clean = set(clean)
return data[clean] # Get actual values by indexing to Pandas Series
################ Testing
t0 = time.time()
# Get DataFrame for sheet from Excel
df = excel_sheet_to_dataframe('Duplicates1.xlsx')
# Will Process data in 'body' column of DataFrame
data = df['Body'].dropna() # Dropping None rows (few None rows at end after Excel import)
# Process Data (Pandas Series)
result_fuzzy_wuzzy = process_fuzzy_wuzzy(data)
print(f'Elapsed time {time.time() - t0}')
这回答了你问题的第二部分。
processed_data
包含预处理过的字符串,因此查询已经过预处理。默认情况下,预处理由 process.extract
完成。 DarrylG 将此预处理移至循环之前,因此不会对字符串进行多次预处理。如果您不想在不对字符串进行预处理的情况下进行比较,您可以直接遍历原始数据:
变化:
series = (rapid_utils.default_process(d) for d in data)
processed_data = pd.Series(series)
for query in processed_data:
至
for query in data:
如果您想要原始行为,但想要结果中的未处理字符串,您可以使用结果字符串的索引来提取未处理的字符串。
def process_rapid_fuzz(data):
'''
Process using rapid fuzz rather than fuzz_wuzzy
'''
series = (rapid_utils.default_process(d) for d in data)
processed_data = pd.Series(series)
for query in processed_data:
scores = process_rapid.extract(query, processed_data,
scorer=rapid_token_set_ratio,
score_cutoff=threshold,
limit=2)
m = max(scores[:2], key = lambda k:len(k[0]))
clean_rapid.append(data[m[2]])
在实施中还有一些可能的进一步改进:
- 您可以通过在
processed_data
中将其替换为 None 来确保当前的 query
不会被匹配,然后使用 process.extractOne
查找下一个最佳匹配高于阈值。这至少与 process.extract
一样快,而且可能会快得多。
- 您将
processed_data
的每个元素与 processed_data
的每个元素进行比较。这意味着您始终执行比较 data[n] <-> data[m]
和 data[m] <-> data[n]
,即使它们保证具有相同的结果。只执行一次比较应该可以节省大约 50% 的运行时间。
def process_rapid_fuzz(data):
'''
Process using rapid fuzz rather than fuzz_wuzzy
'''
series = (rapid_utils.default_process(d) for d in data)
processed_data = pd.Series(series)
for idx, query in enumerate(processed_data):
# None is skipped by process.extract/extractOne, so it will never be part of the results
processed_data[idx] = None
match = process_rapid.extractOne(query, processed_data,
scorer=rapid_token_set_ratio,
score_cutoff=threshold)
# compare the length using the original strings
# alternatively len(match[0]) > len(query)
# if you do want to compare the length of the processed version
if match and len(data[match[2]]) > len(data[idx]):
clean_rapid.append(data[match[2]])
else:
clean_rapid.append(data[idx])
所以我希望修改此代码以减少 fuzzywuzzy 库的运行时间。目前800行的数据集需要一个小时左右,而我在4.5K行的数据集上使用它时,它保持运行将近6个小时,仍然没有结果。我不得不停止内核。
我需要在至少 20K 的数据上使用此代码。任何人都可以建议对此代码进行任何编辑以更快地获得结果吗?这是代码-
import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz,process
df = pd.read_csv(r'path')
df.head()
data = df['Body']
print(data)
clean = []
threshold = 80
for row in data:
# score each sentence against each other
# [('string', score),..]
scores = process.extract(row, data, scorer=fuzz.token_set_ratio)
# basic idea is if there is a close second match we want to evaluate
# and keep the longer of the two
if scores[1][1] > threshold:
clean.append(max([x[0] for x in scores[:2]],key=len))
else:
clean.append(scores[0][0])
# remove dupes
clean = set(clean)
#converting 'clean' list to dataframe and giving the column name for the cleaned column
clean_data = pd.DataFrame(clean, columns=['Body'])
clean_data.to_csv(r'path')
这是我的数据的样子 -
https://docs.google.com/spreadsheets/d/1p9RC9HznhdJFH4kFYdE_TgnHdoRf8P6gTEAkB3lQWEE/edit?usp=sharing
因此,如果您注意到第 14 和 15 行以及第 19 和 20 行是部分重复的,我希望代码能够识别此类句子,并删除较短的句子。
更新-
我对@Darryl G给出的rapidfuzz方案做了一个小改动,现在代码是这样的-
`import pandas as pd
import numpy as np
import openpyxl
from rapidfuzz.fuzz import token_set_ratio as rapid_token_set_ratio
from rapidfuzz import process as process_rapid
from rapidfuzz import utils as rapid_utils
import time
df = pd.read_excel(r'path')
data = df['Body']
print(data)
def excel_sheet_to_dataframe(path):
'''
Loads sheet from Excel workbook using openpyxl
'''
wb = openpyxl.load_workbook(path)
ws = wb.active
data = ws.values
# Get the first line in file as a header line
columns = next(data)[0:]
return pd.DataFrame(data, columns=columns)
clean_rapid = []
threshold = 80
def process_rapid_fuzz(data):
'''
Process using rapid fuzz rather than fuzz_wuzzy
'''
series = (rapid_utils.default_process(d) for d in data) # Pre-process to make lower-case and remove non-alphanumeric
# characters (generator)
processed_data = pd.Series(series)
for query in processed_data:
scores = process_rapid.extract(query, processed_data, scorer=rapid_token_set_ratio, score_cutoff=threshold)
if len(scores) > 1 and scores[1][1] > threshold:
m = max(scores[:2], key = lambda k:len(k[0])) # Of up to two matches above threshold, takes longest
clean_rapid.append(m[0]) # Saving the match index
else:
clean_rapid.append(query)
################ Testing
t0 = time.time()
df = excel_sheet_to_dataframe(r'path') # Using Excel file in working folder
# Desired data in body column
data = df['Body'].dropna() # Dropping None rows (few None rows at end after Excel import)
result_fuzzy_rapid = process_rapid_fuzz(data)
print(f'Elapsed time {time.time() - t0}')
# remove dupes
clean_rapid = set(clean_rapid)
#converting 'clean' list to dataframe and giving the column name for the cleaned column
clean_data = pd.DataFrame(clean_rapid, columns=['Body'])
#exporting the cleaned data
clean_data.to_excel(r'path')`
现在的问题是,在输出文件中,所有句点等都被删除了。我怎样才能留住他们?
该方法基于 Vectorizing or Speeding up Fuzzywuzzy String Matching on PANDAS Column 中的答案中的 RapidFuzz。
结果
- OP 模糊 Wuzzy 方法):2565.7 秒
- RapidFuzz 方法:649.5 秒
因此:4 倍改进
- 注意:测试数据约 2K 条记录来自 OP Google Sheet Data 下载到本地 Excel 工作簿。
快速模糊测试
import pandas as pd
import numpy as np
import openpyxl
from rapidfuzz.fuzz import token_set_ratio as rapid_token_set_ratio
from rapidfuzz import process as process_rapid
from rapidfuzz import utils as rapid_utils
import time
def excel_sheet_to_dataframe(path):
'''
Loads sheet from Excel workbook using openpyxl
'''
wb = openpyxl.load_workbook(path)
ws = wb.active
data = ws.values
# Get the first line in file as a header line
columns = next(data)[0:]
return pd.DataFrame(data, columns=columns)
def process_rapid_fuzz(data):
'''
Process using rapid fuzz rather than fuzz_wuzzy
'''
series = (rapid_utils.default_process(d) for d in data) # Pre-process to make lower-case and remove non-alphanumeric
# characters (generator)
processed_data = pd.Series(series)
clean_rapid = []
threshold = 80
n = 0
for query in processed_data:
scores = process_rapid.extract(query, processed_data, scorer=rapid_token_set_ratio, score_cutoff=threshold)
m = max(scores[:2], key = lambda k:len(k[0])) # Of up to two matches above threshold, takes longest
clean_rapid.append(m[-1]) # Saving the match index
clean_rapid = set(clean_rapid) # remove duplicate indexes
return data[clean_rapid] # Get actual values by indexing to Pandas Series
################ Testing
t0 = time.time()
df = excel_sheet_to_dataframe('Duplicates1.xlsx') # Using Excel file in working folder
# Desired data in body column
data = df['Body'].dropna() # Dropping None rows (few None rows at end after Excel import)
result_fuzzy_rapid = process_rapid_fuzz(data)
print(f'Elapsed time {time.time() - t0}')
用于比较的发布代码版本
import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz, process
import openpyxl
import time
def excel_sheet_to_dataframe(path):
'''
Loads sheet from Excel workbook using openpyxl
'''
wb = openpyxl.load_workbook(path)
ws = wb.active
data = ws.values
# Get the first line in file as a header line
columns = next(data)[0:]
return pd.DataFrame(data, columns=columns)
def process_fuzzy_wuzzy(data):
clean = []
threshold = 80
for idx, query in enumerate(data):
# score each sentence against each other
# [('string', score),..]
scores = process.extract(query, data, scorer=fuzz.token_set_ratio)
# basic idea is if there is a close second match we want to evaluate
# and keep the longer of the two
if len(scores) > 1 and scores[1][1] > threshold: # If second one is close
m = max(scores[:2], key=lambda k:len(k[0]))
clean.append(m[-1])
else:
clean.append(idx)
# remove duplicates
clean = set(clean)
return data[clean] # Get actual values by indexing to Pandas Series
################ Testing
t0 = time.time()
# Get DataFrame for sheet from Excel
df = excel_sheet_to_dataframe('Duplicates1.xlsx')
# Will Process data in 'body' column of DataFrame
data = df['Body'].dropna() # Dropping None rows (few None rows at end after Excel import)
# Process Data (Pandas Series)
result_fuzzy_wuzzy = process_fuzzy_wuzzy(data)
print(f'Elapsed time {time.time() - t0}')
这回答了你问题的第二部分。
processed_data
包含预处理过的字符串,因此查询已经过预处理。默认情况下,预处理由 process.extract
完成。 DarrylG 将此预处理移至循环之前,因此不会对字符串进行多次预处理。如果您不想在不对字符串进行预处理的情况下进行比较,您可以直接遍历原始数据:
变化:
series = (rapid_utils.default_process(d) for d in data)
processed_data = pd.Series(series)
for query in processed_data:
至
for query in data:
如果您想要原始行为,但想要结果中的未处理字符串,您可以使用结果字符串的索引来提取未处理的字符串。
def process_rapid_fuzz(data):
'''
Process using rapid fuzz rather than fuzz_wuzzy
'''
series = (rapid_utils.default_process(d) for d in data)
processed_data = pd.Series(series)
for query in processed_data:
scores = process_rapid.extract(query, processed_data,
scorer=rapid_token_set_ratio,
score_cutoff=threshold,
limit=2)
m = max(scores[:2], key = lambda k:len(k[0]))
clean_rapid.append(data[m[2]])
在实施中还有一些可能的进一步改进:
- 您可以通过在
processed_data
中将其替换为 None 来确保当前的query
不会被匹配,然后使用process.extractOne
查找下一个最佳匹配高于阈值。这至少与process.extract
一样快,而且可能会快得多。 - 您将
processed_data
的每个元素与processed_data
的每个元素进行比较。这意味着您始终执行比较data[n] <-> data[m]
和data[m] <-> data[n]
,即使它们保证具有相同的结果。只执行一次比较应该可以节省大约 50% 的运行时间。
def process_rapid_fuzz(data):
'''
Process using rapid fuzz rather than fuzz_wuzzy
'''
series = (rapid_utils.default_process(d) for d in data)
processed_data = pd.Series(series)
for idx, query in enumerate(processed_data):
# None is skipped by process.extract/extractOne, so it will never be part of the results
processed_data[idx] = None
match = process_rapid.extractOne(query, processed_data,
scorer=rapid_token_set_ratio,
score_cutoff=threshold)
# compare the length using the original strings
# alternatively len(match[0]) > len(query)
# if you do want to compare the length of the processed version
if match and len(data[match[2]]) > len(data[idx]):
clean_rapid.append(data[match[2]])
else:
clean_rapid.append(data[idx])