DATA COMPRESSION



By Lance Jensen

Executive Software Technical Support Director



Compression is a very valuable tool. It can dramatically increase the

amount of data you can store on a disk, which saves thousands of dollars for

many sites. But it is not intended to be used indiscriminately. The

information in this article will help you decide when and when not to

compress files.


Benefits and Costs of Compression


Compression has one invariable benefit: You can fit up to twice the data on

a disk. In addition, on smaller, simple partitions, you may even get faster

I/O. These would be under 2GB, and not RAID or stripe sets or mirror sets.

But set against these benefits are several costs:


- CPU Utilization. Accessing a compressed file is a CPU-intensive action.

The mere act of reading such files can use over 60% of your CPU. This is no

problem if you are only reading the file, but it can impact other operations

that are running at the same time.


- Access Time. On volume sets and on partitions over 2GB, reading and

writing generally takes longer if compression is used. In my experience,

reading and writing on 4.3GB or larger partitions always takes longer if the

file is compressed; on partitions over 8GB, it takes at least twice as long

because of fragmentation inherent in compressed files. This is explained in

detail later in this article.


- Fragmentation. When you decompress a file, it is written to a different

part of the disk; it may be written contiguously, and it may not. When you

decompress an entire partition, it always fragments badly. In fact, you can

run the analysis tool in our Diskeeper defragmenter before and after running

compression and see for yourself the results on your system.


- MFT (Master File Table) Fragmentation. File compression is achieved by

taking the first 16 clusters of the file, packing the data into as small a

space as possible and writing it to the disk, then repeating this for each

remaining 16 cluster increment. The Logical Cluster Number (LCN) where it

is written and the number of clusters the compressed data uses is stored in

the MFT. This is repeated for the next increment of 16 clusters, and so on.

There is also a last entry which stores -1 instead of an LCN, along with how

many clusters are needed to decompress the last increment. (For more

information on the MFT, see "The Master File Table: What It Is and What

It's For", eLetter Volume 2, Issue 5).


Now, the actual compressed file may or may not be fragmented; it doesn't

really matter, because NTFS must always access a compressed file as if it

were fragmented. You see, the MFT entry of an uncompressed file contains

the LCN and size in clusters of the first fragment. If the file is

contiguous, that is all the data needed. But if the file is fragmented,

there is another set of LCN and cluster count data required for each

fragment. When the file is accessed, the system has to do an I/O for each

LCN and cluster count. Since that is what the MFT entry of a compressed

file looks like, the system must do an I/O for each 16-cluster increment.


But the real drawback of compression as regards the MFT is the size of the

MFT entries. If the file is large enough, there will be so many LCN and

cluster count records that the MFT entry will overflow, requiring at least

one additional MFT entry. If you compress enough files, and almost

certainly if you compress the entire partition, the MFT will fill the MFT

zone (the pre-allocated space, usually at the beginning of the disk). Any

new MFT entries will be written wherever the Next Free Space pointer happens

to point. These tiny fragments of MFT take longer to read because the

read/write head has to move to access them, and the fragments break up the

free space permanently. At this time there is no way to get rid of them or

to defragment the MFT short of reformatting the partition.


This may not seem serious, but it can get out of hand very quickly. In one

of our tests, we compressed a 271MB file; it resulted in 467 extra MFT

entries!


All of these points result in slower performance, and therefore less

production. Adding more hard disk space lets you avoid compressing files,

and makes your system run faster. If spending $1,000 for new disks allows

you to bring in another $100 per week, then the disks will pay for

themselves in ten weeks, and the rest is pure profit.


Stated simply, you can weigh the value of compression against the

performance hit you will take in using it. If you lose little to no

performance, fine. But if the saved disk space is going to seriously impact

performance, you won't save money on disk space, but instead lose it in

performance and man-hours.


When is Compression Worth Doing?


The basic underlying reason for using compression has always been to reduce

the cost of data storage in those cases where there would be little or no

price to pay in terms of system performance. These are the most common

cases where compression is needed:


- If a partition is used for archive, and you don't access it frequently,

compression may be worthwhile.


- If a partition is under 2GB and is not a volume set of any kind, and if

you never exceed 40% CPU utilization, compression should be worthwhile.


- If the performance hit can be balanced against the disk space saved, then

compression is worthwhile.


-------------------------------------------------------------


Lance Jensen is Executive Software ace Tech Support Director, and has great experience with both Windows NT and Digital's OpenVMS operating systems. He can be reached at dknt_support@executive.com. Please feel free to write to him with questions or comments about this article.


CONTACT EXECUTIVE SOFTWARE

http://www.execsoft.com


@Macarlo, Inc.
@Macarlo's Shareware & Web
OS/2
Java Lobby Member
Java Site Accredited

[TOP] [HOME] [INDEX]