如何将 PGN 数据读入 DataFrame
How to read PGN data into a DataFrame
我有大量国际象棋游戏的单个 .pgn (Portable Game Notation)。游戏包含在这样的文件中:
[Event "FIDE World Cup 2017"]
[Site "Tbilisi GEO"]
[Date "2017.09.05"]
[Round "1.1"]
[White "Carlsen, Magnus"]
[Black "Balogun, Oluwafemi"]
[Result "1-0"]
[WhiteTitle "GM"]
[BlackTitle "FM"]
[WhiteElo "2822"]
[BlackElo "2255"]
[ECO "B00"]
[Opening "King's pawn opening"]
[WhiteFideId "1503014"]
[BlackFideId "8501246"]
[EventDate "2017.09.03"]
1. e4 d6 2. d4 g6 3. Bc4 Nf6 4. Qe2 Nc6 5. Nf3 Bg7 6. O-O Bg4 7. c3 O-O
8. h3 Bxf3 9. Qxf3 e5 10. Rd1 Qe8 11. d5 Ne7 12. Qe2 Nh5 13. Bb5 Qc8
14. Na3 a6 15. Ba4 f5 16. Bc2 f4 17. Qg4 Qxg4 18. hxg4 Nf6 19. g5 Nd7
20. Nc4 b6 21. b4 h6 22. gxh6 Bxh6 23. g4 Nf6 24. f3 Bg5 25. Kg2 Kg7
26. a4 Bh4 27. Bd2 g5 28. Rh1 Ng6 29. Kf1 Rh8 30. Ke2 Bg3 31. a5 b5 32.
Na3 Ne7 33. c4 c6 34. dxc6 Nxc6 35. Bc3 Rxh1 36. Rxh1 bxc4 37. Nxc4 Rb8
38. Nxd6 Kg6 39. Nf5 1-0
[Event "FIDE World Cup 2017"]
etc...
我想用这个数据创建一个数据框,其中列标题是每个字符串左侧的单词,数据是字符串。然后是 PGN 字符串的单独列。
我在 的影响下进行了尝试,并得出了:
pgn <- read.table("~/Desktop/GitHub/Chess/test.pgn", quote="",
stringsAsFactors=FALSE)
# get column names
column_names <- sub("\[(\w+).+", "\1", pgn[1:17,1])
column_names[17] <- "PGN"
#create DF
pgn.df <- data.frame(matrix(sub("\[\w+ \\"(.+)\\"\]", "\1",
pgn[,1]),byrow=TRUE, ncol=17))
names(pgn.df) <- column_names
这里的问题是我的 pgn 信息是多行的。那么有没有办法在我的正则表达式中说明这一点?或自动更改文件以使 pgn 成为一行的方法?
谢谢!
我仍然建议在准备步骤中使用(更新的)替换 RegEx 删除不需要的中断,如下所示:
/(?:[^\[\]\n\S])\s*\n/ /g
您可以在线测试 here(使用 PGN 作为输入文本)。但是我像你一样在转义 R 中的特殊字符时遇到了一些问题。
因此我决定改用 Perl。
use strict;
use File::Slurp;
my $text = read_file($ARGV[0]);
$text =~ s/(?:[^\[\]\n\S])\s*\n/ /g;
write_file($ARGV[0], $text);
可以像这样从 R 中调用
system("perl ~/Desktop/regex.pl ~/Desktop/test.pgn")
我还没有在 Windows 或 Linux 上测试过这个,但是这个包所基于的 C 代码库声称非常可移植。您需要一个支持从源代码编译的 R 设置(即,如果您使用 Windows,则需要 Rtools)。
安装:
devtools::install_github("hrbrmstr/pigeon")
使用(tidyverse
不是真正需要的包工作,但 IMO 它打印数据帧比内置的基本 R 打印功能更干净):
library(pigeon)
library(tidyverse)
这是一个带有内置数据集的小测试,您可能也在使用该数据集:
fide <- read_pgn(system.file("extdata", "r7.pgn", package="pigeon"))
fide
## # A tibble: 2 x 12
## Event Site Date Round White Black Result WhiteElo BlackElo ECO
## * <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 World Cup 2017 Tbilisi 2017.09.23 44.1 Aronian Levon (ARM) Ding Liren (CHN) 1/2-1/2 2799 2777 A18
## 2 World Cup 2017 Tbilisi 2017.09.24 45.1 Ding Liren (CHN) Aronian Levon (ARM) 1/2-1/2 2777 2799 E06
## # ... with 2 more variables: LiveChessVersion <chr>, Moves <list>
glimpse(fide)
## Observations: 2
## Variables: 12
## $ Event <chr> "World Cup 2017", "World Cup 2017"
## $ Site <chr> "Tbilisi", "Tbilisi"
## $ Date <chr> "2017.09.23", "2017.09.24"
## $ Round <chr> "44.1", "45.1"
## $ White <chr> "Aronian Levon (ARM)", "Ding Liren (CHN)"
## $ Black <chr> "Ding Liren (CHN)", "Aronian Levon (ARM)"
## $ Result <chr> "1/2-1/2", "1/2-1/2"
## $ WhiteElo <chr> "2799", "2777"
## $ BlackElo <chr> "2777", "2799"
## $ ECO <chr> "A18", "E06"
## $ LiveChessVersion <chr> "1.4.8", "1.4.8"
## $ Moves <list> [c("c4", "Nf6", "Nc3", "e6", "e4", "d5", "cxd5", "exd5", "e5", "Ne4", "Nf3", "Bf5", "Be2"...
这是一个更大的测试:
tf <- tempfile(fileext = ".zip")
td <- tempdir()
download.file("https://www.pgnmentor.com/players/Adams.zip", tf)
fil <- unzip(tf, exdir = td)
adams <- read_pgn(fil)
adams
## # A tibble: 2,982 x 11
## Event Site Date Round White Black Result WhiteElo BlackElo ECO
## * <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Lloyds Bank op London 1984.??.?? 1 Adams, Michael Sedgwick, David 1-0 C05
## 2 Lloyds Bank op London 1984.??.?? 3 Adams, Michael Dickenson, Neil F 1-0 2230 C07
## 3 Lloyds Bank op London 1984.??.?? 4 Hebden, Mark Adams, Michael 1-0 2480 B10
## 4 Lloyds Bank op London 1984.??.?? 5 Pasman, Michael Adams, Michael 0-1 2310 D42
## 5 Lloyds Bank op London 1984.??.?? 6 Adams, Michael Levitt, Jonathan 1/2-1/2 2370 B99
## 6 Lloyds Bank op London 1984.??.?? 9 Adams, Michael Saeed, Saeed Ahmed 1-0 2430 B56
## 7 BCF-ch Edinburgh 1985.??.?? 1 Adams, Michael Singh, Sukh Dave 1/2-1/2 2360 2080 B70
## 8 BCF-ch Edinburgh 1985.??.?? 2 Abayasekera, Roger Adams, Michael 1-0 2200 2360 B13
## 9 BCF-ch Edinburgh 1985.??.?? 3 Adams, Michael Jackson, Sheila 1/2-1/2 2360 2225 C85
## 10 BCF-ch Edinburgh 1985.??.?? 4 Muir, Andrew J Adams, Michael 1/2-1/2 2285 2360 E45
## # ... with 2,972 more rows, and 1 more variables: Moves <list>
glimpse(adams)
## Observations: 2,982
## Variables: 11
## $ Event <chr> "Lloyds Bank op", "Lloyds Bank op", "Lloyds Bank op", "Lloyds Bank op", "Lloyds Bank op", "Lloyds ...
## $ Site <chr> "London", "London", "London", "London", "London", "London", "Edinburgh", "Edinburgh", "Edinburgh",...
## $ Date <chr> "1984.??.??", "1984.??.??", "1984.??.??", "1984.??.??", "1984.??.??", "1984.??.??", "1985.??.??", ...
## $ Round <chr> "1", "3", "4", "5", "6", "9", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "?", "1", "...
## $ White <chr> "Adams, Michael", "Adams, Michael", "Hebden, Mark", "Pasman, Michael", "Adams, Michael", "Adams, M...
## $ Black <chr> "Sedgwick, David", "Dickenson, Neil F", "Adams, Michael", "Adams, Michael", "Levitt, Jonathan", "S...
## $ Result <chr> "1-0", "1-0", "1-0", "0-1", "1/2-1/2", "1-0", "1/2-1/2", "1-0", "1/2-1/2", "1/2-1/2", "1-0", "1/2-...
## $ WhiteElo <chr> "", "", "2480", "2310", "", "", "2360", "2200", "2360", "2285", "2360", "2250", "2360", "2225", "2...
## $ BlackElo <chr> "", "2230", "", "", "2370", "2430", "2080", "2360", "2225", "2360", "2245", "2360", "2260", "2360"...
## $ ECO <chr> "C05", "C07", "B10", "D42", "B99", "B56", "B70", "B13", "C85", "E45", "C84", "B10", "C85", "A22", ...
## $ Moves <list> [c("e4", "e6", "d4", "d5", "Nd2", "Nf6", "e5", "Nfd7", "f4", "c5", "c3", "Nc6", "Ndf3", "cxd4", "...
使用成熟的 C "library"(从技术上讲它不是库,但我硬塞进了一个库)的一个好处是它不仅可以进行模式匹配。如果游戏文件格式错误,它将无法正确解析(因为它不应该)。
我需要 运行 通过 ASAN/UBSAN/Valgrind 以确保没有内存泄漏,但如果这最终有用,请告诉我,我会把包装。
您可以采用的另一种方法是将 .pgn 转换为 .csv,这是 panda 最容易解析的文件结构。
https://pypi.org/project/pgn2data/
from converter.pgn_data import PGNData as pgnd
import pandas as pd
# This creates two output files, one for game info
# (white_elo, black_elo, rating_diff, time_control... etc),
# and one for moves.
filename = 'path to .pgn file'
pgn_data = pgnd(filename)
result = pgn_data.export()
result.print_summary()
# Then read the csv with pandas
# Change path to where your files output
path = 'Documents/github/project/folder/'
df_info = pd.read_csv(path + '_game_info.csv')
df_moves = pd.read_csv(path + '_moves.csv')
我有大量国际象棋游戏的单个 .pgn (Portable Game Notation)。游戏包含在这样的文件中:
[Event "FIDE World Cup 2017"]
[Site "Tbilisi GEO"]
[Date "2017.09.05"]
[Round "1.1"]
[White "Carlsen, Magnus"]
[Black "Balogun, Oluwafemi"]
[Result "1-0"]
[WhiteTitle "GM"]
[BlackTitle "FM"]
[WhiteElo "2822"]
[BlackElo "2255"]
[ECO "B00"]
[Opening "King's pawn opening"]
[WhiteFideId "1503014"]
[BlackFideId "8501246"]
[EventDate "2017.09.03"]
1. e4 d6 2. d4 g6 3. Bc4 Nf6 4. Qe2 Nc6 5. Nf3 Bg7 6. O-O Bg4 7. c3 O-O
8. h3 Bxf3 9. Qxf3 e5 10. Rd1 Qe8 11. d5 Ne7 12. Qe2 Nh5 13. Bb5 Qc8
14. Na3 a6 15. Ba4 f5 16. Bc2 f4 17. Qg4 Qxg4 18. hxg4 Nf6 19. g5 Nd7
20. Nc4 b6 21. b4 h6 22. gxh6 Bxh6 23. g4 Nf6 24. f3 Bg5 25. Kg2 Kg7
26. a4 Bh4 27. Bd2 g5 28. Rh1 Ng6 29. Kf1 Rh8 30. Ke2 Bg3 31. a5 b5 32.
Na3 Ne7 33. c4 c6 34. dxc6 Nxc6 35. Bc3 Rxh1 36. Rxh1 bxc4 37. Nxc4 Rb8
38. Nxd6 Kg6 39. Nf5 1-0
[Event "FIDE World Cup 2017"]
etc...
我想用这个数据创建一个数据框,其中列标题是每个字符串左侧的单词,数据是字符串。然后是 PGN 字符串的单独列。
我在
pgn <- read.table("~/Desktop/GitHub/Chess/test.pgn", quote="",
stringsAsFactors=FALSE)
# get column names
column_names <- sub("\[(\w+).+", "\1", pgn[1:17,1])
column_names[17] <- "PGN"
#create DF
pgn.df <- data.frame(matrix(sub("\[\w+ \\"(.+)\\"\]", "\1",
pgn[,1]),byrow=TRUE, ncol=17))
names(pgn.df) <- column_names
这里的问题是我的 pgn 信息是多行的。那么有没有办法在我的正则表达式中说明这一点?或自动更改文件以使 pgn 成为一行的方法?
谢谢!
我仍然建议在准备步骤中使用(更新的)替换 RegEx 删除不需要的中断,如下所示:
/(?:[^\[\]\n\S])\s*\n/ /g
您可以在线测试 here(使用 PGN 作为输入文本)。但是我像你一样在转义 R 中的特殊字符时遇到了一些问题。
因此我决定改用 Perl。
use strict;
use File::Slurp;
my $text = read_file($ARGV[0]);
$text =~ s/(?:[^\[\]\n\S])\s*\n/ /g;
write_file($ARGV[0], $text);
可以像这样从 R 中调用
system("perl ~/Desktop/regex.pl ~/Desktop/test.pgn")
我还没有在 Windows 或 Linux 上测试过这个,但是这个包所基于的 C 代码库声称非常可移植。您需要一个支持从源代码编译的 R 设置(即,如果您使用 Windows,则需要 Rtools)。
安装:
devtools::install_github("hrbrmstr/pigeon")
使用(tidyverse
不是真正需要的包工作,但 IMO 它打印数据帧比内置的基本 R 打印功能更干净):
library(pigeon)
library(tidyverse)
这是一个带有内置数据集的小测试,您可能也在使用该数据集:
fide <- read_pgn(system.file("extdata", "r7.pgn", package="pigeon"))
fide
## # A tibble: 2 x 12
## Event Site Date Round White Black Result WhiteElo BlackElo ECO
## * <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 World Cup 2017 Tbilisi 2017.09.23 44.1 Aronian Levon (ARM) Ding Liren (CHN) 1/2-1/2 2799 2777 A18
## 2 World Cup 2017 Tbilisi 2017.09.24 45.1 Ding Liren (CHN) Aronian Levon (ARM) 1/2-1/2 2777 2799 E06
## # ... with 2 more variables: LiveChessVersion <chr>, Moves <list>
glimpse(fide)
## Observations: 2
## Variables: 12
## $ Event <chr> "World Cup 2017", "World Cup 2017"
## $ Site <chr> "Tbilisi", "Tbilisi"
## $ Date <chr> "2017.09.23", "2017.09.24"
## $ Round <chr> "44.1", "45.1"
## $ White <chr> "Aronian Levon (ARM)", "Ding Liren (CHN)"
## $ Black <chr> "Ding Liren (CHN)", "Aronian Levon (ARM)"
## $ Result <chr> "1/2-1/2", "1/2-1/2"
## $ WhiteElo <chr> "2799", "2777"
## $ BlackElo <chr> "2777", "2799"
## $ ECO <chr> "A18", "E06"
## $ LiveChessVersion <chr> "1.4.8", "1.4.8"
## $ Moves <list> [c("c4", "Nf6", "Nc3", "e6", "e4", "d5", "cxd5", "exd5", "e5", "Ne4", "Nf3", "Bf5", "Be2"...
这是一个更大的测试:
tf <- tempfile(fileext = ".zip")
td <- tempdir()
download.file("https://www.pgnmentor.com/players/Adams.zip", tf)
fil <- unzip(tf, exdir = td)
adams <- read_pgn(fil)
adams
## # A tibble: 2,982 x 11
## Event Site Date Round White Black Result WhiteElo BlackElo ECO
## * <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Lloyds Bank op London 1984.??.?? 1 Adams, Michael Sedgwick, David 1-0 C05
## 2 Lloyds Bank op London 1984.??.?? 3 Adams, Michael Dickenson, Neil F 1-0 2230 C07
## 3 Lloyds Bank op London 1984.??.?? 4 Hebden, Mark Adams, Michael 1-0 2480 B10
## 4 Lloyds Bank op London 1984.??.?? 5 Pasman, Michael Adams, Michael 0-1 2310 D42
## 5 Lloyds Bank op London 1984.??.?? 6 Adams, Michael Levitt, Jonathan 1/2-1/2 2370 B99
## 6 Lloyds Bank op London 1984.??.?? 9 Adams, Michael Saeed, Saeed Ahmed 1-0 2430 B56
## 7 BCF-ch Edinburgh 1985.??.?? 1 Adams, Michael Singh, Sukh Dave 1/2-1/2 2360 2080 B70
## 8 BCF-ch Edinburgh 1985.??.?? 2 Abayasekera, Roger Adams, Michael 1-0 2200 2360 B13
## 9 BCF-ch Edinburgh 1985.??.?? 3 Adams, Michael Jackson, Sheila 1/2-1/2 2360 2225 C85
## 10 BCF-ch Edinburgh 1985.??.?? 4 Muir, Andrew J Adams, Michael 1/2-1/2 2285 2360 E45
## # ... with 2,972 more rows, and 1 more variables: Moves <list>
glimpse(adams)
## Observations: 2,982
## Variables: 11
## $ Event <chr> "Lloyds Bank op", "Lloyds Bank op", "Lloyds Bank op", "Lloyds Bank op", "Lloyds Bank op", "Lloyds ...
## $ Site <chr> "London", "London", "London", "London", "London", "London", "Edinburgh", "Edinburgh", "Edinburgh",...
## $ Date <chr> "1984.??.??", "1984.??.??", "1984.??.??", "1984.??.??", "1984.??.??", "1984.??.??", "1985.??.??", ...
## $ Round <chr> "1", "3", "4", "5", "6", "9", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "?", "1", "...
## $ White <chr> "Adams, Michael", "Adams, Michael", "Hebden, Mark", "Pasman, Michael", "Adams, Michael", "Adams, M...
## $ Black <chr> "Sedgwick, David", "Dickenson, Neil F", "Adams, Michael", "Adams, Michael", "Levitt, Jonathan", "S...
## $ Result <chr> "1-0", "1-0", "1-0", "0-1", "1/2-1/2", "1-0", "1/2-1/2", "1-0", "1/2-1/2", "1/2-1/2", "1-0", "1/2-...
## $ WhiteElo <chr> "", "", "2480", "2310", "", "", "2360", "2200", "2360", "2285", "2360", "2250", "2360", "2225", "2...
## $ BlackElo <chr> "", "2230", "", "", "2370", "2430", "2080", "2360", "2225", "2360", "2245", "2360", "2260", "2360"...
## $ ECO <chr> "C05", "C07", "B10", "D42", "B99", "B56", "B70", "B13", "C85", "E45", "C84", "B10", "C85", "A22", ...
## $ Moves <list> [c("e4", "e6", "d4", "d5", "Nd2", "Nf6", "e5", "Nfd7", "f4", "c5", "c3", "Nc6", "Ndf3", "cxd4", "...
使用成熟的 C "library"(从技术上讲它不是库,但我硬塞进了一个库)的一个好处是它不仅可以进行模式匹配。如果游戏文件格式错误,它将无法正确解析(因为它不应该)。
我需要 运行 通过 ASAN/UBSAN/Valgrind 以确保没有内存泄漏,但如果这最终有用,请告诉我,我会把包装。
您可以采用的另一种方法是将 .pgn 转换为 .csv,这是 panda 最容易解析的文件结构。
https://pypi.org/project/pgn2data/
from converter.pgn_data import PGNData as pgnd
import pandas as pd
# This creates two output files, one for game info
# (white_elo, black_elo, rating_diff, time_control... etc),
# and one for moves.
filename = 'path to .pgn file'
pgn_data = pgnd(filename)
result = pgn_data.export()
result.print_summary()
# Then read the csv with pandas
# Change path to where your files output
path = 'Documents/github/project/folder/'
df_info = pd.read_csv(path + '_game_info.csv')
df_moves = pd.read_csv(path + '_moves.csv')