It is common practice to calculate the checksums for files to check its integrity. For large files, the checksum computation is slow. Now I am wondering why it is so slow and whether choosing another tool will be better. In this post, I try three common tools
crc32 to compute checksums on a relatively large file to see which checksum tool on Linux is faster to help us decide the choices of the checksum tool.
File to be checsum’ed is a 15GB text file:
$ ls -lha wiki.txt -rw-r--r-- 1 zma zma 15G Jun 14 10:28 wiki.txt
The performance ∞
Now, let’s see how does the three tools perform for computing the checksum of the file.
sha1sum speed ∞
$ time sha1sum wiki.txt 251dcb5c08c6a2fabd258f2c8a9b95e15c0cc098 wiki.txt real 1m21.143s user 0m21.647s sys 0m4.668s
crc32 speed ∞
$ time crc32 wiki.txt 0080f7a1 real 1m21.051s user 0m16.194s sys 0m4.890s
md5sum speed ∞
$ time md5sum wiki.txt e2e649030c795ffa9f33a99bcb39dde7 wiki.txt real 1m27.392s user 0m25.563s sys 0m3.936s
From the results,
crc32 is the fasted. But it is just a tiny bit faster than
md5sum is the slowest but just a little bit slower.
Why there is no much differences? To compute the checksums, the tools need to read these files and do the computation. Now, let’s check how much time is needed to read the file content out.
$ time dd if=wiki.txt of=/dev/null bs=8192 1953039+1 records in 1953039+1 records out 15999296457 bytes (16 GB) copied, 80.4203 s, 199 MB/s real 1m20.447s user 0m0.202s sys 0m7.091s
The I/O read speed is around 200MB/s. That’s not bad for a single magnetic disk I/O storage.
So, almost all time are on reading the file content. The algorithms and the tools themselves are not yet the limitation. The disk I/O speed is.
The conclusion is that use any tools that work the best for you (you may need to be aware of the the collisions for these algorithms, check Simard’s comment) without worrying a lot about the speed (it still consumes time) on a relatively modern computer. If you want higher speed, improve your I/O speed first till CPU is the bottleneck (CPU usage reaches 100%).
What if I/O was not the bottleneck ∞
Pádraig comments that we can avoid the I/O and measure the computational cost. I did a little bit change to the suggested command to do checksum on a file under /dev/shm/ as
crc32 does not accept input from STDIN. The system is the same one on which I did the previous tests. It can only support 3GB by the time I did this test. The results are as follows.
[zma@host:/dev/shm]$ head -c 3G /dev/zero >test [zma@host:/dev/shm]$ for chk in crc32 md5sum sha1sum ; do echo $chk; time $chk test; done crc32 480bbe37 real 0m3.411s user 0m2.931s sys 0m0.482s md5sum c698c87fb53058d493492b61f4c74189 test real 0m5.103s user 0m4.697s sys 0m0.409s sha1sum 6e7f6dca8def40df0b21f58e11c1a41c3e000285 test real 0m4.451s user 0m4.082s sys 0m0.372s
To summarize the speed if we consider
md5sum‘s speed as the baseline:
crc32 is the fastest here. It is a Perl 5 program using
Archive::Zip::computeCRC32() to compute the crc32.
The throughput here for
md5sum is above 600MB/s. This is not a number that can not be achieved by an SSD or a RAID of SSDs. On the system I tested, if the I/O is much improved, the computation will likely affect much of the time spent.
CPU model and versions of checksum tools used ∞
Here are the CPU model and versions of the checksum tools used during the test.
$ lscpu | grep "Model name" Model name: Intel(R) Core(TM) i5-4460 CPU @ 3.20GHz
$ md5sum --version md5sum (GNU coreutils) 8.23 Copyright (C) 2014 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>. This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Written by Ulrich Drepper, Scott Miller, and David Madore. $ sha1sum --version sha1sum (GNU coreutils) 8.23 Copyright (C) 2014 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>. This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Written by Ulrich Drepper, Scott Miller, and David Madore. $ rpm -qf `which crc32` perl-Archive-Zip-1.46-1.fc22.noarch