How to efficiently archive a very large sparse file

How to efficiently archive a very large sparse file, say 1TB?

The sparse file may contains a small amount of data, say 32MB.

asked Jan 19, 2013 by forum (2,150 points)
edited May 1, 2013 by SA

1 Answer

 
Best answer

The SEEK_HOLE functionality plays the trick and makes 'tar' and 'cp' handle the large sparse file very efficiently.

More on SEEK_HOLE:

https://blogs.oracle.com/bonwick/entry/seek_hole_and_seek_data

http://lwn.net/Articles/440778/

http://lwn.net/Articles/260699/

On Fedora 17 with 3.6.5 kernel:

[zma@office t]$ uname -a
Linux office 3.6.5-1.fc17.x86_64 #1 SMP Wed Oct 31 19:37:18 UTC 201
[zma@office tmp]$ ls -lh pmem-1 

-rw-rw-r-- 1 zma zma 1.0T Nov  7 20:14 pmem-1

[zma@office tmp]$ time tar cSf pmem-1.tar pmem-1

real    0m0.003s
user    0m0.003s
sys 0m0.000s

[zma@office tmp]$ time cp pmem-1 pmem-1-copy

real    0m0.020s
user    0m0.000s
sys 0m0.003s

[zma@office tmp]$ ls -lh pmem*
-rw-rw-r-- 1 zma zma 1.0T Nov  7 20:14 pmem-1
-rw-rw-r-- 1 zma zma 1.0T Nov  7 20:15 pmem-1-copy
-rw-rw-r-- 1 zma zma  10K Nov  7 20:15 pmem-1.tar

[zma@office tmp]$ mkdir t
[zma@office tmp]$ cd t
[zma@office t]$ time tar xSf ../pmem-1.tar 

real    0m0.003s
user    0m0.000s
sys 0m0.002s

[zma@office t]$ ls -lha
total 8.0K
drwxrwxr-x   2 zma  zma  4.0K Nov  7 20:16 .
drwxrwxrwt. 35 root root 4.0K Nov  7 20:16 ..
-rw-rw-r--   1 zma  zma  1.0T Nov  7 20:14 pmem-1

For comparison, on Fedora 12 with a 2.6.32 kernel:

$ du -hs sparse-1
0   sparse-1

$ ls -lha sparse-1
-rw-rw-r-- 1 user1 user1 1.0T 2012-11-03 11:17 sparse-1

$ time tar cSf sparse-1.tar sparse-1

real    96m19.847s
user    22m3.314s
sys     52m32.272s

$ time gzip sparse-1

real    200m18.714s
user    164m33.835s
sys     10m39.971s

$ ls -lha sparse-1*
-rw-rw-r-- 1 user1 user1 1018M 2012-11-03 11:17 sparse-1.gz
-rw-rw-r-- 1 user1 user1   10K 2012-11-06 23:13 sparse-1.tar

$ time rsync --sparse sparse-1 sparse-1-copy

real    124m46.321s
user    107m15.084s
sys     83m8.323s

$ du -hs sparse-1-copy 
4.0K    sparse-1-copy

Some discussions on: http://stackoverflow.com/questions/13252682/copying-a-1tb-sparse-file

answered Jan 20, 2013 by SA (14,760 points)
edited May 1, 2013 by SA

Well, I can run some virtual machines.

Another crazy thing that I have noticed.

dd if=/dev/zero of=xen-guest.img bs=1 count=0 seek=10G
dd if=/dev/zero of=xen-guest.img bs=1 count=1 seek=10G

Notice the difference in the "count" parameter. They will both produce an empty sparse file.

However, only the one with "count=1" will be fast with bsdtar. The other one will take ages.

commented May 2, 2013 by daniele (100 points)

Btw, I also notices that bsdtar 3.0.3 does not support this at all. You have to use at least 3.0.4 for it to work.

I find it very strange that it's so hard to find information about this online.

With clouds popping up everywhere, you would think that people would care about copying 200GB sparse files in minutes instead of hours.

commented May 2, 2013 by daniele (100 points)

This observation is interesting. There are some differences:

$ dd if=/dev/zero of=xen-guest.img bs=1 count=0 seek=10G
0+0 records in
0+0 records out
0 bytes (0 B) copied, 1.2637e-05 s, 0.0 kB/s

$ dd if=/dev/zero of=xen-guest2.img bs=1 count=1 seek=10G
1+0 records in
1+0 records out
1 byte (1 B) copied, 4.2406e-05 s, 23.6 kB/s

If count is 0, 0 byte is copied. While 1 byte is copied if count is 1. And the size of the xen-guest2.img is 1 byte larger than the size of xen-guest.img. There is totally no data in xen-guest.img.

My suspicion is that bsdtar is not designed to handle the situation that the file is totally empty or there is a bug there.

commented May 2, 2013 by SA (14,760 points)

The SEEK_HOLE is supported from Linux 3.1. From man page for lseek:

   Seeking file data and holes
       Since version 3.1, Linux supports the following additional values for whence:

       SEEK_DATA
              Adjust the file offset to the next location in the file greater than or equal to offset containing data.  If offset points to data, then the file offset is set to offset.

       SEEK_HOLE
              Adjust the file offset to the next hole in the file greater than or equal to offset.  If offset points into the middle of a hole, then the file offset is  set  to  offset.
              If there is no hole past offset, then the file offset is adjusted to the end of the file (i.e., there is an implicit hole at the end of any file).
commented May 3, 2013 by SA (14,760 points)

That's true. There is little document about the SEEK_HOLE and SEEK_DATA support.

I dig a little bit in the libarchive source tree and find it is added possibly from this commit:

commit d216d028a78e56a37bab9e42a2f17f28714a6535
Author: Michihiro NAKAJIMA <ggcueroad@gmail.com>
Date:   Tue Feb 2 06:09:17 2010 -0500

    Determine sparse files through API such as lseek(HOLE).
    
    SVN-Revision: 1856

https://github.com/libarchive/libarchive/commit/d216d028a78e56a37bab9e42a2f17f28714a6535#libarchive/archive_read_disk_entry_from_file.c

After that, there are bug fixes. For example:

$ git show b76da87985101f7acdcc0d84490bb4f6a736d210
commit b76da87985101f7acdcc0d84490bb4f6a736d210
Author: Michihiro NAKAJIMA <ggcueroad@gmail.com>
Date:   Sat Feb 25 18:38:13 2012 +0900

    Fix a wrong check on a result of lseek.

diff --git a/libarchive/archive_read_disk_entry_from_file.c b/libarchive/archive_read_disk_entry_from_file.c
index 8fcd0ab..0fef3c7 100644
--- a/libarchive/archive_read_disk_entry_from_file.c
+++ b/libarchive/archive_read_disk_entry_from_file.c
@@ -1033,7 +1033,7 @@ setup_sparse(struct archive_read_disk *a,
                        goto exit_setup_sparse;
                }
                off_e = lseek(*fd, off_s, SEEK_HOLE);
-               if (off_s == (off_t)-1) {
+               if (off_e == (off_t)-1) {
                        if (errno == ENXIO) {
                                off_e = lseek(*fd, 0, SEEK_END);
                                if (off_e != (off_t)-1)

I guess that it works starting from v3.0.4 after that bugfix:

$ git show v3.0.4 | head -n4
tag v3.0.4
Tagger: Andres Mejia <amejia004@gmail.com>
Date:   Wed Mar 28 09:53:16 2012 -0400

$ git show v3.0.3 | head -n4
commit e235511e964cf8b13bf49a1e343bfdc5c11014da
Author: Tim Kientzle <kientzle@gmail.com>
Date:   Fri Jan 13 00:32:07 2012 -0500
commented May 3, 2013 by SA (14,760 points)

Please log in or register to answer this question.

Copyright © SysTutorials. User contributions licensed under cc-wiki with attribution required.
Hosted on Dreamhost

...