批量实现 for 循环
Implementing for loops as batches
我正在 dataframe
列上执行 2 个大的 for 循环任务。上下文就是我所说的“文本损坏”;将结构完美的文本变成充满缺失标点符号和拼写错误的文本,以模仿人为错误。
我发现 运行 10,000 行非常慢,即使在优化 for 循环之后也是如此。
我在这个 post.
上发现了一个名为 Batching 的进程
最上面的答案提供了一个简洁的模板,我认为它比常规的 for 循环迭代快得多。
我如何使用该答案来重新实现以下代码?(我添加了评论以询问更多相关信息)。
或者;是否有 任何技术可以使我的 for 循环变得相当快?
import pandas as pd
import random
import re
# example
df = pd.DataFrame(columns=['Forname', 'Surname', 'Sentence'])
df.loc['0'] = ['Bob', 'Smith', 'Hi, this is a perfectly constructred sentence!']
df.loc['1'] = ['Alice', 'Smith', 'Can you tell this is fake data?']
df.loc['2'] = ['John', 'Smith', 'This poster needs help!']
df.loc['3'] = ['Michael', 'Smith', 'Apparently, this poster is sturggling a bit LOL']
df.loc['4'] = ['Daniel', 'Smith', 'More fake data here; ok.']
df.loc['5'] = ['Sarah', 'Smith', 'Will need to think up of better ideas.']
df.loc['6'] = ['Matthew', 'Smith', 'Love a good bit of Python, me.']
df.loc['7'] = ['Jane', 'Smith', 'Is this a sentence?! (I think so).']
df.loc['8'] = ['Peter', 'Smith', "Remarkable - isn't it?"]
df.loc['9'] = ['Chloe', 'Smith', "Foo Bar... that's all that is left to say."]
print(df)
punctuation_marks = ['?', '…', '!', '.', ',', '—', '–', '–', ':', ';', '\"', '\'', '[', ']', '(', ')', '{', '}']
p = 0.5 # changeable
for idx, string in enumerate(df['Sentence']):
for punc in punctuation_marks:
if punc in string:
CHANCE = (random.randint(1, 100)) / 100
if CHANCE <= p:
df['Sentence'][idx] = string.replace(punc, '')
misspellings_corpus = open('misspellings_corpus.txt', 'r')
misspellings = misspellings_corpus.readlines()
for idx, string in enumerate(df['Sentence']):
word_list = re.sub("[^\w]", " ", string).split() # removes punctuation
for word in word_list:
CHANCE = (random.randint(1, 100)) / 100
try: # break middle for-loop
for ms in misspellings:
if (word in ms) and (CHANCE <= p):
wrong = ms.split('->')[0]
correct = ms.split('->')[1][:-2] # removes '\n'
if ',' in correct: correct = random.choice(my_str.split(',')).strip() # only 1 correct spelling
if correct in string:
df['Sentence'][idx] = string.replace(correct, wrong)
raise StopIteration
except StopIteration: pass
misspellings_corpus.txt
(片段):
affadvit,affa_dava,afadant,afadavate,afadavid,affidate,affidavent,afftadave,athadavid,affiadait,aphadivode,appidavid,afidaded,affi-davit,affidavat,aphadated,affivadat,afidaviat,affedavit,affiavate,affidaved,afefedavid,affidavate,affavidate,affdated,aphidavit,affevivat,affided,affadavid,attipdavid,affidavidit,affidavite,affadivate,affidavited,afdiodave,affidafet,affidivit,afadafit,affedit,afadavide,afidefed,Affi_David,affividate,affaidivit,afidiated,affidovt,affadavat,avadavate,effidavit,afidavit,aphadavid,afedaved,afardivient,apitated,affividative,affedaivite,afteradeated,Afi_David,acavated,affedated,affidevit,affidivat,afaedaviate,affedaved,afatait,afedative,avidated,afidavid,avidiate,afadavit,affedave,affedavid,afidaved,affavidit,afidated,afidavite,afodivid,affidated,afadiadid,affidaphet,affidatet,athadiet,afidabit,affidait,afadated,affadivit,affadavit,afadivite,affidavid,affadapfed,affdavit,aphedavid,athadavit,adivide,afdavit,afedavit,afadiatet,alpadavid,afadaviate,affadivid,aftedavid,affadavite,affadavate,apadenment,aphadavet->affidavit
anverrsy,aneversary,anneversies,anniversity,anavuature,annevarcery,annerfversy,anervery,annaversary,anverserice,annaversery,Anniversary,anivrsary,ananersery,anaversie,anniverserie,annaversity,anifurcaty,anenany,anavirsary,aniversy,anverseary,annervesary,annerverarcy,anaveres,anerviersy,aneversy,aniversary,anivesery,anneversers,anirversary,anniversy,aniversere,aneversere,annaversrey,anavorasy,annversary,aniversiry,anerversurey,Amanversery,anniversery,aniversery,anniversiory,anniversily,anneversary,aneversiary,anaversery,anaversity,anniverserys,anerversary,anniverseray,aniverseray,anniverary,anivessery,anaversarie,aniversity,Annyver,annervirsary,anniversty,annevyercy,aniverusy,anarversieiy,onniver,anaversy,anversity,anaveje,anversicy,anniversay,anerversee,aneversarry,anifersery,anversy,aneversery,annaversiry,annivirsary,annivercery,anvesy,anvertery,annversy,anevers,anniverisy,aneversory,anternesery,avernity,Eenarcrsity,anivarisy,aniverserary,annaverserie,anniversaries,aniversay,anyversary,ananversery,annivesrey,anniversiry,annivesry,anniverscy,annerversery,amryvercary,anneversery,anerversery,anversa,anmersersy,aneversitey,aniversry,aniverserry->anniversary
Ane->And
agenst->agents
eeg,agg->egg
注意:如果需要,我可以粘贴更多示例行。
apply
可用于在每一行上调用函数,并且比 for 循环快得多(矢量化函数甚至更快)。为了让生活更轻松、更高效,我做了一些事情:
- 将您的文本文件转换为字典。这将比原始文本更高效、更易于使用。
- 将所有损坏逻辑放在一个函数中。这将更易于维护,并允许我们使用
apply
- 清理了 up/modified 一点逻辑。我在下面显示的内容与您的要求不完全相同,但应该很容易适应。
好的,这是代码:
import io
import random
# this generates a dict {'word1':['list', 'of', 'misspellings'],} where s is a string copied above file
df2 = pd.DataFrame(io.StringIO(s), columns=["subs"])
sub_dict = df2.subs.str.strip().str.split("->", expand=True).set_index(1)[0].str.split(",").to_dict()
sub_dict["fake"] = ["fak", "fkae", "fke"]
sub_dict["tell"] = ["tel"]
sub_dict["this"] = ["tis", "htsi"]
sub_dict["data"] = ["dat", "dta"]
def corrupt(sentence, sub_dict, p=0.5):
# logic is similar but not identical to your code
for k, v in sub_dict.items():
if k in sentence and random.random() <= p:
corrupted_word = random.choice(v)
sentence = sentence.replace(k, corrupted_word)
return sentence
现在 apply
位:
df["corrupted"] = df.Sentence.apply(lambda sentence: corrupt(sentence, sub_dict))
# works as expected, see second sentence
Forname Surname Sentence corrupted
0 Bob Smith Hi, this is a perfectly constructred sentence! Hi, this is a perfectly constructred sentence!
1 Alice Smith Can you tell this is fake data? Can you tel htsi is fake dta?
2 John Smith This poster needs help! This poster needs help!
3 Michael Smith Apparently, this poster is sturggling a bit LOL Apparently, this poster is sturggling a bit LOL
4 Daniel Smith More fake data here; ok. More fke dat here; ok.
5 Sarah Smith Will need to think up of better ideas. Will need to think up of better ideas.
6 Matthew Smith Love a good bit of Python, me. Love a good bit of Python, me.
7 Jane Smith Is this a sentence?! (I think so). Is this a sentence?! (I think so).
8 Peter Smith Remarkable - isn't it? Remarkable - isn't it?
9 Chloe Smith Foo Bar... that's all that is left to say. Foo Bar... that's all that is left to say.
现在让我们用 for 循环比较性能:
df_test1 = df.sample(n=10000, replace=True)
df_test2 = df.sample(n=10000, replace=True)
def loop(df):
for idx, string in enumerate(df['Sentence']):
corrupted_sentence = corrupt(string, sub_dict)
df['Sentence'][idx] = corrupted_sentence
%timeit df_test1.Sentence.apply(lambda sentence: corrupt(sentence, sub_dict))
# 36.5 ms ± 1.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit loop(df_test2)
# 5.19 s ± 98.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
哇哦!它方式更快。
我正在 dataframe
列上执行 2 个大的 for 循环任务。上下文就是我所说的“文本损坏”;将结构完美的文本变成充满缺失标点符号和拼写错误的文本,以模仿人为错误。
我发现 运行 10,000 行非常慢,即使在优化 for 循环之后也是如此。
我在这个 post.
上发现了一个名为 Batching 的进程最上面的答案提供了一个简洁的模板,我认为它比常规的 for 循环迭代快得多。
我如何使用该答案来重新实现以下代码?(我添加了评论以询问更多相关信息)。
或者;是否有 任何技术可以使我的 for 循环变得相当快?
import pandas as pd
import random
import re
# example
df = pd.DataFrame(columns=['Forname', 'Surname', 'Sentence'])
df.loc['0'] = ['Bob', 'Smith', 'Hi, this is a perfectly constructred sentence!']
df.loc['1'] = ['Alice', 'Smith', 'Can you tell this is fake data?']
df.loc['2'] = ['John', 'Smith', 'This poster needs help!']
df.loc['3'] = ['Michael', 'Smith', 'Apparently, this poster is sturggling a bit LOL']
df.loc['4'] = ['Daniel', 'Smith', 'More fake data here; ok.']
df.loc['5'] = ['Sarah', 'Smith', 'Will need to think up of better ideas.']
df.loc['6'] = ['Matthew', 'Smith', 'Love a good bit of Python, me.']
df.loc['7'] = ['Jane', 'Smith', 'Is this a sentence?! (I think so).']
df.loc['8'] = ['Peter', 'Smith', "Remarkable - isn't it?"]
df.loc['9'] = ['Chloe', 'Smith', "Foo Bar... that's all that is left to say."]
print(df)
punctuation_marks = ['?', '…', '!', '.', ',', '—', '–', '–', ':', ';', '\"', '\'', '[', ']', '(', ')', '{', '}']
p = 0.5 # changeable
for idx, string in enumerate(df['Sentence']):
for punc in punctuation_marks:
if punc in string:
CHANCE = (random.randint(1, 100)) / 100
if CHANCE <= p:
df['Sentence'][idx] = string.replace(punc, '')
misspellings_corpus = open('misspellings_corpus.txt', 'r')
misspellings = misspellings_corpus.readlines()
for idx, string in enumerate(df['Sentence']):
word_list = re.sub("[^\w]", " ", string).split() # removes punctuation
for word in word_list:
CHANCE = (random.randint(1, 100)) / 100
try: # break middle for-loop
for ms in misspellings:
if (word in ms) and (CHANCE <= p):
wrong = ms.split('->')[0]
correct = ms.split('->')[1][:-2] # removes '\n'
if ',' in correct: correct = random.choice(my_str.split(',')).strip() # only 1 correct spelling
if correct in string:
df['Sentence'][idx] = string.replace(correct, wrong)
raise StopIteration
except StopIteration: pass
misspellings_corpus.txt
(片段):
affadvit,affa_dava,afadant,afadavate,afadavid,affidate,affidavent,afftadave,athadavid,affiadait,aphadivode,appidavid,afidaded,affi-davit,affidavat,aphadated,affivadat,afidaviat,affedavit,affiavate,affidaved,afefedavid,affidavate,affavidate,affdated,aphidavit,affevivat,affided,affadavid,attipdavid,affidavidit,affidavite,affadivate,affidavited,afdiodave,affidafet,affidivit,afadafit,affedit,afadavide,afidefed,Affi_David,affividate,affaidivit,afidiated,affidovt,affadavat,avadavate,effidavit,afidavit,aphadavid,afedaved,afardivient,apitated,affividative,affedaivite,afteradeated,Afi_David,acavated,affedated,affidevit,affidivat,afaedaviate,affedaved,afatait,afedative,avidated,afidavid,avidiate,afadavit,affedave,affedavid,afidaved,affavidit,afidated,afidavite,afodivid,affidated,afadiadid,affidaphet,affidatet,athadiet,afidabit,affidait,afadated,affadivit,affadavit,afadivite,affidavid,affadapfed,affdavit,aphedavid,athadavit,adivide,afdavit,afedavit,afadiatet,alpadavid,afadaviate,affadivid,aftedavid,affadavite,affadavate,apadenment,aphadavet->affidavit
anverrsy,aneversary,anneversies,anniversity,anavuature,annevarcery,annerfversy,anervery,annaversary,anverserice,annaversery,Anniversary,anivrsary,ananersery,anaversie,anniverserie,annaversity,anifurcaty,anenany,anavirsary,aniversy,anverseary,annervesary,annerverarcy,anaveres,anerviersy,aneversy,aniversary,anivesery,anneversers,anirversary,anniversy,aniversere,aneversere,annaversrey,anavorasy,annversary,aniversiry,anerversurey,Amanversery,anniversery,aniversery,anniversiory,anniversily,anneversary,aneversiary,anaversery,anaversity,anniverserys,anerversary,anniverseray,aniverseray,anniverary,anivessery,anaversarie,aniversity,Annyver,annervirsary,anniversty,annevyercy,aniverusy,anarversieiy,onniver,anaversy,anversity,anaveje,anversicy,anniversay,anerversee,aneversarry,anifersery,anversy,aneversery,annaversiry,annivirsary,annivercery,anvesy,anvertery,annversy,anevers,anniverisy,aneversory,anternesery,avernity,Eenarcrsity,anivarisy,aniverserary,annaverserie,anniversaries,aniversay,anyversary,ananversery,annivesrey,anniversiry,annivesry,anniverscy,annerversery,amryvercary,anneversery,anerversery,anversa,anmersersy,aneversitey,aniversry,aniverserry->anniversary
Ane->And
agenst->agents
eeg,agg->egg
注意:如果需要,我可以粘贴更多示例行。
apply
可用于在每一行上调用函数,并且比 for 循环快得多(矢量化函数甚至更快)。为了让生活更轻松、更高效,我做了一些事情:
- 将您的文本文件转换为字典。这将比原始文本更高效、更易于使用。
- 将所有损坏逻辑放在一个函数中。这将更易于维护,并允许我们使用
apply
- 清理了 up/modified 一点逻辑。我在下面显示的内容与您的要求不完全相同,但应该很容易适应。
好的,这是代码:
import io
import random
# this generates a dict {'word1':['list', 'of', 'misspellings'],} where s is a string copied above file
df2 = pd.DataFrame(io.StringIO(s), columns=["subs"])
sub_dict = df2.subs.str.strip().str.split("->", expand=True).set_index(1)[0].str.split(",").to_dict()
sub_dict["fake"] = ["fak", "fkae", "fke"]
sub_dict["tell"] = ["tel"]
sub_dict["this"] = ["tis", "htsi"]
sub_dict["data"] = ["dat", "dta"]
def corrupt(sentence, sub_dict, p=0.5):
# logic is similar but not identical to your code
for k, v in sub_dict.items():
if k in sentence and random.random() <= p:
corrupted_word = random.choice(v)
sentence = sentence.replace(k, corrupted_word)
return sentence
现在 apply
位:
df["corrupted"] = df.Sentence.apply(lambda sentence: corrupt(sentence, sub_dict))
# works as expected, see second sentence
Forname Surname Sentence corrupted
0 Bob Smith Hi, this is a perfectly constructred sentence! Hi, this is a perfectly constructred sentence!
1 Alice Smith Can you tell this is fake data? Can you tel htsi is fake dta?
2 John Smith This poster needs help! This poster needs help!
3 Michael Smith Apparently, this poster is sturggling a bit LOL Apparently, this poster is sturggling a bit LOL
4 Daniel Smith More fake data here; ok. More fke dat here; ok.
5 Sarah Smith Will need to think up of better ideas. Will need to think up of better ideas.
6 Matthew Smith Love a good bit of Python, me. Love a good bit of Python, me.
7 Jane Smith Is this a sentence?! (I think so). Is this a sentence?! (I think so).
8 Peter Smith Remarkable - isn't it? Remarkable - isn't it?
9 Chloe Smith Foo Bar... that's all that is left to say. Foo Bar... that's all that is left to say.
现在让我们用 for 循环比较性能:
df_test1 = df.sample(n=10000, replace=True)
df_test2 = df.sample(n=10000, replace=True)
def loop(df):
for idx, string in enumerate(df['Sentence']):
corrupted_sentence = corrupt(string, sub_dict)
df['Sentence'][idx] = corrupted_sentence
%timeit df_test1.Sentence.apply(lambda sentence: corrupt(sentence, sub_dict))
# 36.5 ms ± 1.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit loop(df_test2)
# 5.19 s ± 98.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
哇哦!它方式更快。