Save disk space by using bestcompress to select the best compression program

As part of my job, I often need to archive and retain copies of very large files for my clients. To help save disk space, I compress these files before they are archived. I could just use the same compression program for every file, but since different compression algorithms excel at dealing with different types of input, I find it’s much better to make use of a particularly useful shell script called bestcompress that I came across several years ago in Wicked Cool Shell Scripts by Dave Taylor. It compares the output sizes of files produced by three different compression programs available on most Unix implementations (including Mac OS X) when applied to the actual file a user needs to compress. After executing the script, the user is left with the smallest resultant compressed file produced by either compress, gzip, or bzip2.

The script is available on the book’s website, along with a detailed explanation of exactly how it works and how to use it.

6 Comments for “Save disk space by using bestcompress to select the best compression program”

  1. posted by Olli on

    The script doesn’t work as intended when run with Ubuntu (and maybe other Linux distros).

    The problem lies in the call of
    ls -l “$name” $Zout $gzout $bzout

    That doesn’t give the output in the intended order, but automatically sorts it alphanumeric.

    This will work:
    ls -f -l “$name” $Zout $gzout $bzout

  2. posted by alfora on

    Is such a feature really required in the 21st century where you can buy harddisks in TB chunks?

    Is the amount of space really that much after you’ve used the “optimal” compression? Besides, there is nothing special about compress, gzip, or bzip2. They are all general purpose compression algorithms. We are not talking about special algorithms that excel at “text”, “image” or “sound”.

    Another point is that most of your every day documents are already compressed:

    * MS Office 2007, 2010 documents are zip files
    * OpenOffice documents are zip files
    * Video is already compressed
    * Audio is already compressed
    * PDF files are already compressed
    * Pictures are already compressed

    So the amount of space you’ll gain by such an “optimisation” is in the order of a few percent of the original (compressed) size. Granted, this will add up if you are talking about 1 TB harddisks but do you really fill up your HD to the last byte?

    What you’ll lose is this:
    * It takes 3 times as long to compress files
    * It might lead to more HD fragmentation

    I’d rather install a script that checks the “access time” of the files and if they were not accessed within the last few days/weeks/months it will automatically compress them. (Some operating systems can do that automatically but it would be better to restrict such a feature to a few directories and use file compression instead of block-wise compression.)

    This automatic script has the advantage that you don’t have to do anything at all. Just dump your files in your “archive” directory and let the computer do the rest.

  3. Avatar of PJ Doland

    posted by PJ Doland on

    alfora-

    I use this script to compress hundreds of log files totaling about ~40 GB before compression each month. For that kind of task, it’s actually quite useful.

    No argument that it’s unlikely to yield significant benefits for already compressed file types.

  4. posted by Kees Reuzelaar on

    This is actually quite useful as I backup my data to an off-site storage provider which charges by the total diskspace I use on their servers.

    Saving a couple of gigabytes will save me money as well.

    Not to mention shorter transfer times which are quite long as it is.

    Thanks for the tip, I’m sure I can use it.

  5. posted by Pat on

    Seems like you skipping the higher compression schemes currently available and costing yourself a lot of CPU cycles to do it. I would think that a LZMA/7zip would give better compression in many of these cases and is also platform unspecific.

  6. posted by alfora on

    I have to take my statement, “So the amount of space you’ll gain by such an “optimisation” is in the order of a few percent of the original (compressed) size.” after running some tests.

    First, I tested one of the largest log file on my system and compressed it with compress, gzip, and bzip2:

    -rw-r—– 1 alfora staff 2601873 13 Nov 10:34 system.log
    -rw-r—– 1 alfora staff 579263 13 Nov 10:35 system.log.Z
    -rw-r—– 1 alfora staff 147528 13 Nov 10:36 system.log.bz2
    -rw-r—– 1 alfora staff 227978 13 Nov 10:35 system.log.gz

    There is a huge difference between compress and the rest and a smaller but still large difference between gzip and bzip2. system.log is a typical log file and contains just plain text.

    I found an even larger log file on my windows partition, namely MPLog-xxxx.log from the Microsoft Anti-Malware software:

    -rwxr-xr-x 1 alfora staff 12986186 13 Nov 10:55 MPLog.log
    -rwxr-xr-x 1 alfora staff 1361229 13 Nov 10:57 MPLog.log.Z
    -rwxr-xr-x 1 alfora staff 434348 13 Nov 10:56 MPLog.log.bz2
    -rwxr-xr-x 1 alfora staff 696208 13 Nov 10:56 MPLog.log.gz

    Again, huge difference between compress and the rest, smaller difference between gzip and bzip2.

    My third test is using pagefile.sys from the Windows partition which contains 6 GB of binary data but in an uncompressed form compared to already compressed pictures or movies:

    -rwxr-xr-x 1 alfora staff 6430035968 13 Nov 10:47 pagefile.sys
    -rwxr-xr-x 1 alfora staff 517531039 13 Nov 11:17 pagefile.sys.Z
    -rwxr-xr-x 1 alfora staff 329262782 13 Nov 10:58 pagefile.sys.bz2
    -rwxr-xr-x 1 alfora staff 351123320 13 Nov 10:54 pagefile.sys.gz

    You can still see a difference but it is not so large anymore. bzip2 still “wins” in this contest but gzip is only about 7% worse than bzip2. There is no question about how old and outdated compress is. On the other hand, bzip2 might be better than gzip most of the time.

    The question remains: Are you are willing to wait approx. 3 times as long for your compressed data and have to use 3 times as much HD space during the compression just to throw away the results of compress and gzip most of the time?

Comments are closed.