Golang 中的 bufio.NewScanner 是否读取内存中的整个文件而不是每个文件一行?
Does bufio.NewScanner in Golang reads the entire file in memory instead of a line each?
我尝试使用 bufio.NewScanner
.
通过以下函数逐行读取文件
func TailFromStart(fd *os.File, wg *sync.WaitGroup) {
fd.Seek(0,0)
scanner := bufio.NewScanner(fd)
for scanner.Scan() {
line := scanner.Text()
offset, _ := fd.Seek(0, 1)
fmt.Println(offset)
fmt.Println(line)
offsetreset, _ := fd.Seek(offset, 0)
fmt.Println(offsetreset)
}
offset, err := fd.Seek(0, 1)
CheckError(err)
fmt.Println(offset)
wg.Done()
}
我原以为它会按递增顺序打印偏移量,但是,它在每次迭代中打印相同的值,直到文件达到 EOF
。
127.0.0.1 - - [11/Aug/2016:22:10:39 +0530] "GET /ttt HTTP/1.1" 404 437 "-" "curl/7.38.0"
613
613
127.0.0.1 - - [11/Aug/2016:22:10:42 +0530] "GET /qqq HTTP/1.1" 404 437 "-" "curl/7.38.0"
613
613 是文件中的字符总数。
cat /var/log/apache2/access.log | wc
7 84 613
我理解错了吗,还是 bufio.NewScanner
读取了内存中的整个文件,并遍历了内存中的那个文件?如果是这样,有没有更好的逐行阅读方式?
请参阅 func (s *Scanner) Buffer(buf []byte, max int)
文档:
Buffer sets the initial buffer to use when scanning and the maximum
size of buffer that may be allocated during scanning. The maximum
token size is the larger of max and cap(buf)
.
If max <= cap(buf)
,
Scan will use this buffer only and do no allocation.
By default, Scan uses an internal buffer and sets the maximum token
size to MaxScanTokenSize
.
Buffer panics if it is called after scanning has started.
并且:
MaxScanTokenSize is the maximum size used to buffer a token unless
the user provides an explicit buffer with Scan.Buffer. The actual
maximum token size may be smaller as the buffer may need to include,
for instance, a newline.
MaxScanTokenSize = 64 * 1024
startBufSize = 4096 // Size of initial allocation for buffer.
不,正如@JimB 所说,它只读取缓冲区大小,请参阅此测试示例:
对于小于 4096 字节的文件,它会将所有文件内容读取到缓冲区,
但对于大文件只读取 4096 字节,
用大文件试试这个:
package main
import (
"bufio"
"fmt"
"os"
)
func main() {
fd, err := os.Open("big.txt")
if err != nil {
panic(err)
}
defer fd.Close()
n, err := fd.Seek(0, 0)
if err != nil {
panic(err)
}
fmt.Println("n =", n) // 0
scanner := bufio.NewScanner(fd)
for scanner.Scan() {
fmt.Println(scanner.Text())
break
}
offset, err := fd.Seek(0, 1)
if err != nil {
panic(err)
}
fmt.Println("offset =", offset) //4096
offsetreset, err := fd.Seek(offset, 0)
if err != nil {
panic(err)
}
fmt.Println("offsetreset =", offsetreset) //4096
offset, err = fd.Seek(0, 1)
if err != nil {
panic(err)
}
fmt.Println("offset =", offset) //4096
}
输出:
n = 0
offset = 4096
offsetreset = 4096
offset = 4096
您可以增加扫描仪的缓冲区大小
例如:-
scanner := bufio.NewScanner(file)
buf := make([]byte, 0, 64*1024)
scanner.Buffer(buf, 1024*1024) //1024*1024 => 1mb max (you can change value here to read larger files
for scanner.Scan() {
// do your stuff
}
我尝试使用 bufio.NewScanner
.
func TailFromStart(fd *os.File, wg *sync.WaitGroup) {
fd.Seek(0,0)
scanner := bufio.NewScanner(fd)
for scanner.Scan() {
line := scanner.Text()
offset, _ := fd.Seek(0, 1)
fmt.Println(offset)
fmt.Println(line)
offsetreset, _ := fd.Seek(offset, 0)
fmt.Println(offsetreset)
}
offset, err := fd.Seek(0, 1)
CheckError(err)
fmt.Println(offset)
wg.Done()
}
我原以为它会按递增顺序打印偏移量,但是,它在每次迭代中打印相同的值,直到文件达到 EOF
。
127.0.0.1 - - [11/Aug/2016:22:10:39 +0530] "GET /ttt HTTP/1.1" 404 437 "-" "curl/7.38.0"
613
613
127.0.0.1 - - [11/Aug/2016:22:10:42 +0530] "GET /qqq HTTP/1.1" 404 437 "-" "curl/7.38.0"
613
613 是文件中的字符总数。
cat /var/log/apache2/access.log | wc
7 84 613
我理解错了吗,还是 bufio.NewScanner
读取了内存中的整个文件,并遍历了内存中的那个文件?如果是这样,有没有更好的逐行阅读方式?
请参阅 func (s *Scanner) Buffer(buf []byte, max int)
文档:
Buffer sets the initial buffer to use when scanning and the maximum size of buffer that may be allocated during scanning. The maximum token size is the larger of max and
cap(buf)
.
Ifmax <= cap(buf)
, Scan will use this buffer only and do no allocation.By default, Scan uses an internal buffer and sets the maximum token size to
MaxScanTokenSize
.Buffer panics if it is called after scanning has started.
并且:
MaxScanTokenSize is the maximum size used to buffer a token unless the user provides an explicit buffer with Scan.Buffer. The actual maximum token size may be smaller as the buffer may need to include, for instance, a newline.
MaxScanTokenSize = 64 * 1024 startBufSize = 4096 // Size of initial allocation for buffer.
不,正如@JimB 所说,它只读取缓冲区大小,请参阅此测试示例:
对于小于 4096 字节的文件,它会将所有文件内容读取到缓冲区,
但对于大文件只读取 4096 字节,
用大文件试试这个:
package main
import (
"bufio"
"fmt"
"os"
)
func main() {
fd, err := os.Open("big.txt")
if err != nil {
panic(err)
}
defer fd.Close()
n, err := fd.Seek(0, 0)
if err != nil {
panic(err)
}
fmt.Println("n =", n) // 0
scanner := bufio.NewScanner(fd)
for scanner.Scan() {
fmt.Println(scanner.Text())
break
}
offset, err := fd.Seek(0, 1)
if err != nil {
panic(err)
}
fmt.Println("offset =", offset) //4096
offsetreset, err := fd.Seek(offset, 0)
if err != nil {
panic(err)
}
fmt.Println("offsetreset =", offsetreset) //4096
offset, err = fd.Seek(0, 1)
if err != nil {
panic(err)
}
fmt.Println("offset =", offset) //4096
}
输出:
n = 0
offset = 4096
offsetreset = 4096
offset = 4096
您可以增加扫描仪的缓冲区大小
例如:-
scanner := bufio.NewScanner(file)
buf := make([]byte, 0, 64*1024)
scanner.Buffer(buf, 1024*1024) //1024*1024 => 1mb max (you can change value here to read larger files
for scanner.Scan() {
// do your stuff
}