Python 中的复杂数据提取
Complex Data Extraction in Python
我需要一些帮助来启动程序。我每周都会参加几场在线扑克锦标赛。事实证明,我使用的网站记录手牌历史并将它们作为 .txt 文件保存到我的硬盘驱动器。不幸的是,数据的格式有些粗糙。我想创建一个程序来记录每手牌并告诉我我赢了或输了多少。我已经从下面的一只手上粘贴了一个样本,我想从每只手上提取以下信息。
盲注和底注。 在您向下滚动样本时,您可以看到 "Player 8 has small blind (250)",然后是 "Player 1 has big blind (500)"。每个玩家的底注已在上面注明 "Player Hero ante (50)"。所以在这种情况下,小盲注 = 250,大盲注 = 500,底注 = 50。
我的筹码量。 我将我的玩家标记为 "Hero"。我的筹码量在第 6 行显示 "Seat 3: Hero (17595)"。在这种情况下,我的筹码量是 17595。
我的手。在这个例子中,它被表示为"Player Hero received card: [10c]; Player Hero received card: [7h]."所以我的手是“10c7h”
玩家人数。样本中有 8 名玩家。
我的立场。 这个可能有点棘手。我决定从 Big Blind 开始并将其赋值为 0。Small blind = 1,button = 2,等等。这在某种程度上违背了 "poker logic",但从编程的角度来看,更有意义对我来说,因为总会有一个大盲注,而其他一些位置将取决于 table 有多少玩家。
Profit / Loss。 这靠近 "Summary" 标签下文本的底部。 "Player Hero does not show cards.Bets: 50. Collects: 0. Loses: 50." 在这种情况下,我的利润是 -50(即损失 50),这意味着我支付了 50 底注并弃牌了。
.txt 文件的外观如下。注意这是一只手。在实际的 .txt 文件中,这手牌之后会有数十或数百手牌。开头总是用 "Game started" 表示,最后一行总是用 "Game ended" 表示。
Game started at: 2018/1/9 10:14:10
Game ID: 1094127759 250/500 ,000 GTD, Table 4 (Hold'em)
Seat 7 is the button
Seat 1: Player1 (9650).
Seat 2: Player2 (19433).
Seat 3: Hero (17595).
Seat 4: Player4 (8900).
Seat 5: Player5 (12670).
Seat 6: Player6 (11187).
Seat 7: Player7 (11300).
Seat 8: Player8 (17603).
Player Player8 ante (50)
Player Player1 ante (50)
Player Player2 ante (50)
Player Hero ante (50)
Player Player4 ante (50)
Player Player5 ante (50)
Player Player6 ante (50)
Player Player7 ante (50)
Player Player8 has small blind (250)
Player Player1 has big blind (500)
Player Player8 received a card.
Player Player8 received a card.
Player Player1 received a card.
Player Player1 received a card.
Player Player2 received a card.
Player Player2 received a card.
Player Hero received card: [10c]
Player Hero received card: [7h]
Player Player4 received a card.
Player Player4 received a card.
Player Player5 received a card.
Player Player5 received a card.
Player Player6 received a card.
Player Player6 received a card.
Player Player7 received a card.
Player Player7 received a card.
Player Player2 folds
Player Hero folds
Player Player4 raises (1000)
Player Player5 folds
Player Player6 folds
Player Player7 folds
Player Player8 folds
Player Player1 folds
Uncalled bet (500) returned to Player4
Player Player4 mucks cards
------ Summary ------
Pot: 1650
Player Player1 does not show cards.Bets: 550. Collects: 0. Loses: 550.
Player Player2 does not show cards.Bets: 50. Collects: 0. Loses: 50.
Player Hero does not show cards.Bets: 50. Collects: 0. Loses: 50.
*Player Player4 mucks (does not show cards). Bets: 550. Collects: 1650. Wins: 1100.
Player Player5 does not show cards.Bets: 50. Collects: 0. Loses: 50.
Player Player6 does not show cards.Bets: 50. Collects: 0. Loses: 50.
Player Player7 does not show cards.Bets: 50. Collects: 0. Loses: 50.
Player Player8 does not show cards.Bets: 300. Collects: 0. Loses: 300.
Game ended at: 2018/1/9 10:14:52
感谢任何帮助。甚至只是关于我如何去做这件事或我应该学习什么样的东西的一些想法。在我看来,输出应该是这样的:
HandNumber = 000001
BigBlind = 500
Ante = 50
Players = 8
StackSize = 17595
Hand = 10c7h
Position = 6 # small blind = 1; add 5 since I'm 5 positions removed
Profit = -50
我的经验水平: 我参加了 Python 开发、数据科学和 SQL 的在线课程大约 6 个月。我对 类 有一些了解,但没有大量创建自己的经验。我设计了一些程序来帮助使用正则表达式从财务报表中提取数据。
这最容易解决,方法是使用正则表达式拆分不同的游戏,然后使用更多正则表达式来提取信息。
我会制作一个 class 来保存所有这些信息。然后你可以使用数据库或 json 来存储此信息
def split_file(file_handle):
pat_str = '''\
^Game started at: (?P<game_start>.*?)
(?P<game>.*?)
^------ Summary ------
(?P<summary>.*)
^Game ended at: (?P<game_end>.*)$\
'''
pat = re.compile(pat_str, flags=re.MULTILINE|re.DOTALL)
text = file_handle.read()
for game in pat.finditer(text):
yield game
class Pokergame:
def __init__(self, game_info, playername = 'Hero'):
self.game_start = datetime.datetime.strptime(game_info['game_start'], "%Y/%m/%d %H:%M:%S")
self.game_end = datetime.datetime.strptime(game_info['game_end'], "%Y/%m/%d %H:%M:%S")
self.game_info = _parse_game(game_info['game'], playername)
self.summary = _parse_summary(game_info['summary'], playername)
def _parse_game(game_str, playername):
pattern_seat = f'Seat (\d+): {playername} \((\d+)\).'
seat_match = re.search(pattern=pattern_seat, string=game_str)
if seat_match:
seat, stack = seat_match.groups()
pattern_cards = f'Player {playername} received card: \[(?P<card>\w+)\]'
cards = tuple(i['card'] for i in re.finditer(pattern_cards, game_str))
result = {
'seat': seat,
'stack': stack,
'cards': cards,
'text': game_str,
}
return result
def _parse_summary(summary_str, playername):
return summary_str
games = []
with StringIO(hand_text) as file_handle:
for game_info in split_file(file_handle):
games.append(Pokergame(game_info))
我已经使用 StringIO 来模拟 open(file)
。您将不得不充实 __init__
和 _parse_...
一些内容,但这应该会让您走上正确的轨道。
如果你有多个文件,你可以使用itertools.chain
连接游戏
games[0].game_info
{'cards': ('10c', '7h'),
'seat': '3',
'stack': '17595',
'text': "Game ID: 1094127759 250/500 ,000 GTD, ...\nPlayer Player4 mucks cards"}
我需要一些帮助来启动程序。我每周都会参加几场在线扑克锦标赛。事实证明,我使用的网站记录手牌历史并将它们作为 .txt 文件保存到我的硬盘驱动器。不幸的是,数据的格式有些粗糙。我想创建一个程序来记录每手牌并告诉我我赢了或输了多少。我已经从下面的一只手上粘贴了一个样本,我想从每只手上提取以下信息。
盲注和底注。 在您向下滚动样本时,您可以看到 "Player 8 has small blind (250)",然后是 "Player 1 has big blind (500)"。每个玩家的底注已在上面注明 "Player Hero ante (50)"。所以在这种情况下,小盲注 = 250,大盲注 = 500,底注 = 50。
我的筹码量。 我将我的玩家标记为 "Hero"。我的筹码量在第 6 行显示 "Seat 3: Hero (17595)"。在这种情况下,我的筹码量是 17595。
我的手。在这个例子中,它被表示为"Player Hero received card: [10c]; Player Hero received card: [7h]."所以我的手是“10c7h”
玩家人数。样本中有 8 名玩家。
我的立场。 这个可能有点棘手。我决定从 Big Blind 开始并将其赋值为 0。Small blind = 1,button = 2,等等。这在某种程度上违背了 "poker logic",但从编程的角度来看,更有意义对我来说,因为总会有一个大盲注,而其他一些位置将取决于 table 有多少玩家。
Profit / Loss。 这靠近 "Summary" 标签下文本的底部。 "Player Hero does not show cards.Bets: 50. Collects: 0. Loses: 50." 在这种情况下,我的利润是 -50(即损失 50),这意味着我支付了 50 底注并弃牌了。
.txt 文件的外观如下。注意这是一只手。在实际的 .txt 文件中,这手牌之后会有数十或数百手牌。开头总是用 "Game started" 表示,最后一行总是用 "Game ended" 表示。
Game started at: 2018/1/9 10:14:10
Game ID: 1094127759 250/500 ,000 GTD, Table 4 (Hold'em)
Seat 7 is the button
Seat 1: Player1 (9650).
Seat 2: Player2 (19433).
Seat 3: Hero (17595).
Seat 4: Player4 (8900).
Seat 5: Player5 (12670).
Seat 6: Player6 (11187).
Seat 7: Player7 (11300).
Seat 8: Player8 (17603).
Player Player8 ante (50)
Player Player1 ante (50)
Player Player2 ante (50)
Player Hero ante (50)
Player Player4 ante (50)
Player Player5 ante (50)
Player Player6 ante (50)
Player Player7 ante (50)
Player Player8 has small blind (250)
Player Player1 has big blind (500)
Player Player8 received a card.
Player Player8 received a card.
Player Player1 received a card.
Player Player1 received a card.
Player Player2 received a card.
Player Player2 received a card.
Player Hero received card: [10c]
Player Hero received card: [7h]
Player Player4 received a card.
Player Player4 received a card.
Player Player5 received a card.
Player Player5 received a card.
Player Player6 received a card.
Player Player6 received a card.
Player Player7 received a card.
Player Player7 received a card.
Player Player2 folds
Player Hero folds
Player Player4 raises (1000)
Player Player5 folds
Player Player6 folds
Player Player7 folds
Player Player8 folds
Player Player1 folds
Uncalled bet (500) returned to Player4
Player Player4 mucks cards
------ Summary ------
Pot: 1650
Player Player1 does not show cards.Bets: 550. Collects: 0. Loses: 550.
Player Player2 does not show cards.Bets: 50. Collects: 0. Loses: 50.
Player Hero does not show cards.Bets: 50. Collects: 0. Loses: 50.
*Player Player4 mucks (does not show cards). Bets: 550. Collects: 1650. Wins: 1100.
Player Player5 does not show cards.Bets: 50. Collects: 0. Loses: 50.
Player Player6 does not show cards.Bets: 50. Collects: 0. Loses: 50.
Player Player7 does not show cards.Bets: 50. Collects: 0. Loses: 50.
Player Player8 does not show cards.Bets: 300. Collects: 0. Loses: 300.
Game ended at: 2018/1/9 10:14:52
感谢任何帮助。甚至只是关于我如何去做这件事或我应该学习什么样的东西的一些想法。在我看来,输出应该是这样的:
HandNumber = 000001
BigBlind = 500
Ante = 50
Players = 8
StackSize = 17595
Hand = 10c7h
Position = 6 # small blind = 1; add 5 since I'm 5 positions removed
Profit = -50
我的经验水平: 我参加了 Python 开发、数据科学和 SQL 的在线课程大约 6 个月。我对 类 有一些了解,但没有大量创建自己的经验。我设计了一些程序来帮助使用正则表达式从财务报表中提取数据。
这最容易解决,方法是使用正则表达式拆分不同的游戏,然后使用更多正则表达式来提取信息。 我会制作一个 class 来保存所有这些信息。然后你可以使用数据库或 json 来存储此信息
def split_file(file_handle):
pat_str = '''\
^Game started at: (?P<game_start>.*?)
(?P<game>.*?)
^------ Summary ------
(?P<summary>.*)
^Game ended at: (?P<game_end>.*)$\
'''
pat = re.compile(pat_str, flags=re.MULTILINE|re.DOTALL)
text = file_handle.read()
for game in pat.finditer(text):
yield game
class Pokergame:
def __init__(self, game_info, playername = 'Hero'):
self.game_start = datetime.datetime.strptime(game_info['game_start'], "%Y/%m/%d %H:%M:%S")
self.game_end = datetime.datetime.strptime(game_info['game_end'], "%Y/%m/%d %H:%M:%S")
self.game_info = _parse_game(game_info['game'], playername)
self.summary = _parse_summary(game_info['summary'], playername)
def _parse_game(game_str, playername):
pattern_seat = f'Seat (\d+): {playername} \((\d+)\).'
seat_match = re.search(pattern=pattern_seat, string=game_str)
if seat_match:
seat, stack = seat_match.groups()
pattern_cards = f'Player {playername} received card: \[(?P<card>\w+)\]'
cards = tuple(i['card'] for i in re.finditer(pattern_cards, game_str))
result = {
'seat': seat,
'stack': stack,
'cards': cards,
'text': game_str,
}
return result
def _parse_summary(summary_str, playername):
return summary_str
games = []
with StringIO(hand_text) as file_handle:
for game_info in split_file(file_handle):
games.append(Pokergame(game_info))
我已经使用 StringIO 来模拟 open(file)
。您将不得不充实 __init__
和 _parse_...
一些内容,但这应该会让您走上正确的轨道。
如果你有多个文件,你可以使用itertools.chain
连接游戏
games[0].game_info
{'cards': ('10c', '7h'),
'seat': '3',
'stack': '17595',
'text': "Game ID: 1094127759 250/500 ,000 GTD, ...\nPlayer Player4 mucks cards"}