是否可以在给定的 format/syntax 限制条件下优化此 sequence/code 代?

IS it possible to optimise this sequence/code generation within the given format/syntax constraints?

我正在尝试生成一个代码,其中一些代码必须遵循特定的预定义规则(请参阅评论)。我只需要传入的 df 行数——代码在简单的每行基础上分配回同一个 df。返回列表 'seems' 不如直接分配给函数内的 df 理想,但我无法实现这一点。不幸的是,由于其他地方处理的其他限制,我需要分别传入 3 个 df,但每次它们都会有一个 不同的 单个字符后缀(例如 X|Y|Z)。这些代码不会 'need' 在不同的 df 之间是连续的,尽管对每个 df 进行一些排序可能很有用......这是我迄今为止尝试过的方式。

然而,我目前的 'working' 尝试在这里,虽然功能......花费了太多时间。我希望有人可以指出优化其中任何部分的一些可能的胜利。通常每个 df <500k,更常见的是 100-200k。

生成报价代码

期望的结果:

采用以下格式的序列:YrCodeMthCode+AAAA+99+[P|H|D] 其中:

*YrCodeMthCode+AAA+99的唯一性只需要每个月覆盖500k条记录(因为MthCode会change/refresh x12)

import numpy as np
import pandas as pd

np.random.seed(0)
df = pd.DataFrame(np.random.randn(200, 3), columns=list('ABC'))



offerCodeLength = 6
allowedOfferCodeChars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"

campaignMonth = 'January'
campaignYear = 2021

yearCodesDict = {2021:'G',2022:'H',2023:'I', 2024:'J', 2025:'K', 2026:'L', 2027:'M'}
monthCodesDict = {'January':'A','Febuary':'B','March':'C',
                  'April':'D','May':'E','June':'F',
                  'July':'G', 'August':'H','September':'I',
                  'October':'J','November':'K','December':'L'}


OfferCodeDateStr = str(yearCodesDict[campaignYear])+str(monthCodesDict[campaignMonth])

iterator = 0
breakPoint = df.shape[0]



def generateOfferCode(OfferCodeDateStr, offerCodeLength, breakPoint, OfferCodeSuffix):
    
    allowedOfferCodeChars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
    
    iterator = 0 # limit amount generated
    offerCodesList = []
    for item in itertools.product(allowedOfferCodeChars, repeat=offerCodeLength): 
        
        # generate a 2 digit number, with NO zeros (to avoid 0 vs o call centre issues)
        psuedoRandNumc = str(int(''.join(random.choices('123456789',k=randint(10,99))))%10**2)
        
        if iterator < breakPoint: # breakpoint as length of associated dataframe/number of codes required
            OfferCodeString = "".join(item)
            OfferCodeString = OfferCodeDateStr+OfferCodeString+psuedoRandNum+OfferCodeSuffix # join Yr,Mth chars to generated rest

            offerCodesList.append(OfferCodeString) 

            iterator +=1 

    return offerCodesList

generateOfferCode(OfferCodeDateStr, offerCodeLength, breakPoint, 'P')

我的示例时间: (OffercodeLength 设置为 4) x100:每个循环 5.99 s ± 227 ms(7 次运行的平均值 ± 标准偏差,每次 1 个循环) 挂墙时间:47.5 秒

x1000:每个循环 5.87 s ± 243 ms(7 次运行的平均值 ± 标准差,每次 1 个循环) 挂墙时间:46.4 秒

IIUC,你可以试试:

def generateOfferCode(OfferCodeDateStr, offerCodeLength, breakPoint, offerCodeSuffix):
    seen = set()
    offerCodesList = list()
    
    for i in range(breakPoint):
        psuedoRandNum = ''.join(random.choices('123456789', k=2))
        OfferCodeString = "".join(random.choices("ABCDEFGHIJKLMNOPQRSTUVWXYZ", k=6))
        while OfferCodeString in seen:
            OfferCodeString = "".join(random.choices("ABCDEFGHIJKLMNOPQRSTUVWXYZ", k=6))
        seen.add(OfferCodeString)
        offerCodesList.append(f"{OfferCodeDateStr}{OfferCodeString}{psuedoRandNum}{offerCodeSuffix}")
    return offerCodesList

df["offerCode"] = generateOfferCode(YrCodeMthCode, 6, df.shape[0], 'P')

>>> df
            A         B         C    offerCode
0    1.764052  0.400157  0.978738  GAZGCPGE28P
1    2.240893  1.867558 -0.977278  GADYNNWU69P
2    0.950088 -0.151357 -0.103219  GAEQUFPI48P
3    0.410599  0.144044  1.454274  GAUCSCHW76P
4    0.761038  0.121675  0.443863  GAFMVTBP28P
..        ...       ...       ...          ...
195 -0.470638 -0.216950  0.445393  GAOXGTOU88P
196 -0.392389 -3.046143  0.543312  GAXPQOFI25P
197  0.439043 -0.219541 -1.084037  GACBKIJV93P
198  0.351780  0.379236 -0.470033  GAVYQEQL46P
199 -0.216731 -0.930157 -0.178589  GALNKYVE23P
性能
>>> %timeit generateOfferCode(YrCodeMthCode, 6, 500000, 'P')
%timeit generateOfferCode(YrCodeMthCode, 6, df.shape[0]*1000, 'P')
829 ms ± 22.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)