使用 Python Faker 为 5000 行生成不同的数据
Using Python Faker generate different data for 5000 rows
我想使用 Python Faker 库生成 500 行数据,但是我使用下面的代码得到了重复的数据。你能指出我哪里出错了吗?我相信这与 for 循环有关。提前致谢:
from faker import Factory
import pandas as pd
import random
def create_fake_stuff(fake):
df = pd.DataFrame(columns=('name'
, 'email'
, 'bs'
, 'address'
, 'city'
, 'state'
, 'date_time'
, 'paragraph'
, 'Conrad'
,'randomdata'))
stuff = [fake.name()
, fake.email()
, fake.bs()
, fake.address()
, fake.city()
, fake.state()
, fake.date_time()
, fake.paragraph()
, fake.catch_phrase()
, random.randint(1000,2000)]
for i in range(10):
df.loc[i] = [item for item in stuff]
print(df)
if __name__ == '__main__':
fake = Factory.create()
create_fake_stuff(fake)
我将伪造的东西数组放在我的 for 循环中以获得预期的结果:
for i in range(10):
stuff = [fake.name()
, fake.email()
, fake.bs()
, fake.address()
, fake.city()
, fake.state()
, fake.date_time()
, fake.paragraph()
, fake.catch_phrase()
, random.randint(1000, 2000)]
df.loc[i] = [item for item in stuff]
print(df)
以下脚本可以显着提高 pandas 性能。
from faker import Faker
import pandas as pd
import random
fake = Faker()
def create_rows(num=1):
output = [{"name":fake.name(),
"address":fake.address(),
"name":fake.name(),
"email":fake.email(),
"bs":fake.bs(),
"address":fake.address(),
"city":fake.city(),
"state":fake.state(),
"date_time":fake.date_time(),
"paragraph":fake.paragraph(),
"Conrad":fake.catch_phrase(),
"randomdata":random.randint(1000,2000)} for x in range(num)]
return output
耗时5.55s
%%time
df = pd.DataFrame(create_rows(5000))
Wall time: 5.55 s
免责声明:这个答案是在问题之后添加的,并添加了一些不直接回答问题的新信息。
现在有一个快速的新库Mimesis - Fake Data Generator。
- 优点:据说它的工作速度比
faker
快几倍(请参阅下面我对与所讨论的数据类似的数据的测试)。
- 缺点:仅适用于 Python 的 3.6 版本。
pip install mimesis
>>> from mimesis import Person
>>> from mimesis.enums import Gender
>>> person = Person('en')
>>> person.full_name(gender=Gender.FEMALE)
'Antonetta Garrison'
>>> personru = Person('ru')
>>> personru.full_name()
'Рената Черкасова'
和之前开发的一样faker:
pip install faker
>>> from faker import Faker
>>> fake_ru=Faker('ja_JP')
>>> fake_ru=Faker('ru_RU')
>>> fake_jp=Faker('ja_JP')
>>> print (fake_ru.name())
Субботина Елена Наумовна
>>> print (fake_jp.name())
大垣 花子
下面是我最近根据 forzer0eight 的回答中提供的代码对 Mimesis 与 Faker 的时间安排:
from faker import Faker
import pandas as pd
import random
fake = Faker()
def create_rows_faker(num=1):
output = [{"name":fake.name(),
"address":fake.address(),
"name":fake.name(),
"email":fake.email(),
#"bs":fake.bs(),
"city":fake.city(),
"state":fake.state(),
"date_time":fake.date_time(),
#"paragraph":fake.paragraph(),
#"Conrad":fake.catch_phrase(),
"randomdata":random.randint(1000,2000)} for x in range(num)]
return output
%%time
df_faker = pd.DataFrame(create_rows_faker(5000))
CPU 次:用户 3.51 秒,系统:2.86 毫秒,总计:3.51 秒
挂墙时间:3.51 秒
from mimesis import Person
from mimesis import Address
from mimesis.enums import Gender
from mimesis import Datetime
person = Person('en')
import pandas as pd
import random
person = Person()
addess = Address()
datetime = Datetime()
def create_rows_mimesis(num=1):
output = [{"name":person.full_name(gender=Gender.FEMALE),
"address":addess.address(),
"name":person.name(),
"email":person.email(),
#"bs":person.bs(),
"city":addess.city(),
"state":addess.state(),
"date_time":datetime.datetime(),
#"paragraph":person.paragraph(),
#"Conrad":person.catch_phrase(),
"randomdata":random.randint(1000,2000)} for x in range(num)]
return output
%%time
df_mimesis = pd.DataFrame(create_rows_mimesis(5000))
CPU 次:用户 178 毫秒,系统:1.7 毫秒,总计:180 毫秒
挂墙时间:179 毫秒
下面是比较结果数据:
df_faker.head(2)
address city date_time email name randomdata state
0 3818 Goodwin Haven\nBrocktown, GA 06168 Valdezport 2004-10-18 20:35:52 joseph81@gomez-beltran.info Deborah Garcia 1218 Oklahoma
1 2568 Gonzales Field\nRichardhaven, NC 79149 West Rachel 1985-02-03 00:33:00 lbeck@wang.com Barbara Pineda 1536 Tennessee
df_mimesis.head(2)
address city date_time email name randomdata state
0 351 Nobles Viaduct Cedar Falls 2013-08-22 08:20:25.288883 chemotherapeutics1964@gmail.com Ernest 1673 Georgia
1 517 Williams Hill Malden 2008-01-26 18:12:01.654995 biochemical1972@yandex.com Jonathan 1845 North Dakota
使用 farsante and mimesis 库是使用假数据创建 Pandas DataFrame 的最简单方法。
import random
import farsante
from mimesis import Person
from mimesis import Address
from mimesis import Datetime
person = Person()
address = Address()
datetime = Datetime()
def rand_int(min_int, max_int):
def some_rand_int():
return random.randint(min_int, max_int)
return some_rand_int
df = farsante.pandas_df([
person.full_name,
address.address,
person.name,
person.email,
address.city,
address.state,
datetime.datetime,
rand_int(1000, 2000)], 5)
print(df)
full_name address name ... state datetime some_rand_int
0 Weldon Durham 1027 Nellie Square Bruna ... West Virginia 2030-06-10 09:21:29.179412 1453
1 Veta Conrad 932 Cragmont Arcade Betsey ... Iowa 2017-08-11 23:50:27.479281 1909
2 Vena Kinney 355 Edgar Highway Tyson ... New Hampshire 2002-12-21 05:26:45.723531 1735
3 Adam Sheppard 270 Williar Court Treena ... North Dakota 2011-03-30 19:16:29.015598 1503
4 Penney Allison 592 Oakdale Road Chas ... Maine 2009-12-14 16:31:37.714933 1175
这种方法可以使您的代码保持整洁。
我想使用 Python Faker 库生成 500 行数据,但是我使用下面的代码得到了重复的数据。你能指出我哪里出错了吗?我相信这与 for 循环有关。提前致谢:
from faker import Factory
import pandas as pd
import random
def create_fake_stuff(fake):
df = pd.DataFrame(columns=('name'
, 'email'
, 'bs'
, 'address'
, 'city'
, 'state'
, 'date_time'
, 'paragraph'
, 'Conrad'
,'randomdata'))
stuff = [fake.name()
, fake.email()
, fake.bs()
, fake.address()
, fake.city()
, fake.state()
, fake.date_time()
, fake.paragraph()
, fake.catch_phrase()
, random.randint(1000,2000)]
for i in range(10):
df.loc[i] = [item for item in stuff]
print(df)
if __name__ == '__main__':
fake = Factory.create()
create_fake_stuff(fake)
我将伪造的东西数组放在我的 for 循环中以获得预期的结果:
for i in range(10):
stuff = [fake.name()
, fake.email()
, fake.bs()
, fake.address()
, fake.city()
, fake.state()
, fake.date_time()
, fake.paragraph()
, fake.catch_phrase()
, random.randint(1000, 2000)]
df.loc[i] = [item for item in stuff]
print(df)
以下脚本可以显着提高 pandas 性能。
from faker import Faker
import pandas as pd
import random
fake = Faker()
def create_rows(num=1):
output = [{"name":fake.name(),
"address":fake.address(),
"name":fake.name(),
"email":fake.email(),
"bs":fake.bs(),
"address":fake.address(),
"city":fake.city(),
"state":fake.state(),
"date_time":fake.date_time(),
"paragraph":fake.paragraph(),
"Conrad":fake.catch_phrase(),
"randomdata":random.randint(1000,2000)} for x in range(num)]
return output
耗时5.55s
%%time
df = pd.DataFrame(create_rows(5000))
Wall time: 5.55 s
免责声明:这个答案是在问题之后添加的,并添加了一些不直接回答问题的新信息。
现在有一个快速的新库Mimesis - Fake Data Generator。
- 优点:据说它的工作速度比
faker
快几倍(请参阅下面我对与所讨论的数据类似的数据的测试)。 - 缺点:仅适用于 Python 的 3.6 版本。
pip install mimesis
>>> from mimesis import Person
>>> from mimesis.enums import Gender
>>> person = Person('en')
>>> person.full_name(gender=Gender.FEMALE)
'Antonetta Garrison'
>>> personru = Person('ru')
>>> personru.full_name()
'Рената Черкасова'
和之前开发的一样faker:
pip install faker
>>> from faker import Faker
>>> fake_ru=Faker('ja_JP')
>>> fake_ru=Faker('ru_RU')
>>> fake_jp=Faker('ja_JP')
>>> print (fake_ru.name())
Субботина Елена Наумовна
>>> print (fake_jp.name())
大垣 花子
下面是我最近根据 forzer0eight 的回答中提供的代码对 Mimesis 与 Faker 的时间安排:
from faker import Faker
import pandas as pd
import random
fake = Faker()
def create_rows_faker(num=1):
output = [{"name":fake.name(),
"address":fake.address(),
"name":fake.name(),
"email":fake.email(),
#"bs":fake.bs(),
"city":fake.city(),
"state":fake.state(),
"date_time":fake.date_time(),
#"paragraph":fake.paragraph(),
#"Conrad":fake.catch_phrase(),
"randomdata":random.randint(1000,2000)} for x in range(num)]
return output
%%time
df_faker = pd.DataFrame(create_rows_faker(5000))
CPU 次:用户 3.51 秒,系统:2.86 毫秒,总计:3.51 秒 挂墙时间:3.51 秒
from mimesis import Person
from mimesis import Address
from mimesis.enums import Gender
from mimesis import Datetime
person = Person('en')
import pandas as pd
import random
person = Person()
addess = Address()
datetime = Datetime()
def create_rows_mimesis(num=1):
output = [{"name":person.full_name(gender=Gender.FEMALE),
"address":addess.address(),
"name":person.name(),
"email":person.email(),
#"bs":person.bs(),
"city":addess.city(),
"state":addess.state(),
"date_time":datetime.datetime(),
#"paragraph":person.paragraph(),
#"Conrad":person.catch_phrase(),
"randomdata":random.randint(1000,2000)} for x in range(num)]
return output
%%time
df_mimesis = pd.DataFrame(create_rows_mimesis(5000))
CPU 次:用户 178 毫秒,系统:1.7 毫秒,总计:180 毫秒 挂墙时间:179 毫秒
下面是比较结果数据:
df_faker.head(2)
address city date_time email name randomdata state
0 3818 Goodwin Haven\nBrocktown, GA 06168 Valdezport 2004-10-18 20:35:52 joseph81@gomez-beltran.info Deborah Garcia 1218 Oklahoma
1 2568 Gonzales Field\nRichardhaven, NC 79149 West Rachel 1985-02-03 00:33:00 lbeck@wang.com Barbara Pineda 1536 Tennessee
df_mimesis.head(2)
address city date_time email name randomdata state
0 351 Nobles Viaduct Cedar Falls 2013-08-22 08:20:25.288883 chemotherapeutics1964@gmail.com Ernest 1673 Georgia
1 517 Williams Hill Malden 2008-01-26 18:12:01.654995 biochemical1972@yandex.com Jonathan 1845 North Dakota
使用 farsante and mimesis 库是使用假数据创建 Pandas DataFrame 的最简单方法。
import random
import farsante
from mimesis import Person
from mimesis import Address
from mimesis import Datetime
person = Person()
address = Address()
datetime = Datetime()
def rand_int(min_int, max_int):
def some_rand_int():
return random.randint(min_int, max_int)
return some_rand_int
df = farsante.pandas_df([
person.full_name,
address.address,
person.name,
person.email,
address.city,
address.state,
datetime.datetime,
rand_int(1000, 2000)], 5)
print(df)
full_name address name ... state datetime some_rand_int
0 Weldon Durham 1027 Nellie Square Bruna ... West Virginia 2030-06-10 09:21:29.179412 1453
1 Veta Conrad 932 Cragmont Arcade Betsey ... Iowa 2017-08-11 23:50:27.479281 1909
2 Vena Kinney 355 Edgar Highway Tyson ... New Hampshire 2002-12-21 05:26:45.723531 1735
3 Adam Sheppard 270 Williar Court Treena ... North Dakota 2011-03-30 19:16:29.015598 1503
4 Penney Allison 592 Oakdale Road Chas ... Maine 2009-12-14 16:31:37.714933 1175
这种方法可以使您的代码保持整洁。