Click to See Complete Forum and Search --> : Huffman compression in C++ (Summer project)
slavik
07-07-2005, 01:09 AM
Well, I decided to post progress that I make on my project ...
One or a few things to note is that all code is in C++ and that it is on paper, so I have no clue whether it will work at all, although I think it should.
In this thread, I will not post any code at all. ("It's MINE!!! ALL MINE!!!" - Daffy Duck :p) I will however, post which parts of the projects I have completed and what they do ^^.
Ok, 1st off:
Wrote the "BitArray" to hold an array of bits (uses an array of chars, and stores 8 bits in each char). There will be a large array in the end holding the entire coded string which will be very easy to print to file and then read if needed (for decompression).
There will be an array of smaller ones for each char, so that when the tree is created, you only have to find a char to encode once in the tree. There will be an array of the smaller bit arrays.
2nd:
Wrote the node class for the huffman tree, hold pointers to left and right, current weight, byte value (character).
3rd:
outlined process for compression and decompression via flow charts.
4th:
wrote a sorting function utilizing linear sort, I know it is very inefficient. I might change it later to tree or heap sort. Most likely heap, since it would be faster.
To DO:
- remove_dead function to remove all 0 weights from the array of nodes.
- combine function to combine the first 2 elements in the array of nodes under a new parent node.
- Use a better sorting algorithm*
- give BitArray class a new method which accepts another BitArray and adds the value of the argument (the bit string) to itself.
- give BitArray stack/queue behavior (push, pop, remove_front) to add in the previous and for searching (since the tree traversal will be recursive, push and pop is needed).
- IMPORTANT!!! Re-read the K&R book on how to add CLI options/arguments :)
That's it for now :).
When I finish this project, if it doesn't give good compression (comparable to ZIP/RAR/ACE), I might release the source code, if it does good, then I won't for obvious reasons :). (I dunno what they are :p)
EDIT: questions/comments/complaints/suggestions welcome :).
EDIT2: the way I am going to have my util is it will only be able to handle a single file ... which is fine ... but for folders and stuff, I will write "wrapper" scripts (probably in perl, maybe another C/C++ program) ... the idea is that the compression program does just that and nothing else.
slavik
10-31-2005, 09:02 PM
well, didn't do it over summer, but over the past weekend I have advanced in leaps and bounds :)
I got it to the point where for encidng, it just has to output the encoded file in a smart way ... so far it can analyze the file it is given and tell you the file size, encoded file size, compression ratio in % (encoded size / original file size)*100, and how much space you can save ...
for file I/O I (will) use C stuff and for other stuff I will use C++ (since it has vector which is nice to work with and the sort algorithm which I don't have to code either).
I4one
11-01-2005, 12:25 AM
hi;
forgive my ignorance in all programming languages - and i'm interested in trying to follow your progress, but for those of us less enlightened, could you plz provide a brief synopsis of what/why you are doing this....and the benefits ?
Thanks - and hope it works out for ya (chi-ching ;))
Candyman
11-01-2005, 05:18 AM
When I finish this project, if it doesn't give good compression (comparable to ZIP/RAR/ACE), I might release the source code, if it does good, then I won't for obvious reasons :). (I dunno what they are :p)
No offense, but I see that as kind of a backward attitude toward the issue. I mean, if it does work well, won't it be more useful to publish it so other people can improve on it? Do people really pay for general compression software anymore? There are already algorithms out there that are very widespread and very open that work very, very well. See 7zip, which is one of the most popular projects on SourceForge. Not to mention all the GNU utilities...I'm not trying to get down on you, I'm just trying to figure out the motivation behind that particular part.
Many existing compression schemes are built with Huffman encoding as part of them. What kind of improvements does your implementation offer (if you don't mind the question)? How are the results for your compression testing so far?
As far as why he would do it, I'd take the liberty of assuming it has something to do with the phrase "because he can". :)
For references on Huffman, see here (http://en.wikipedia.org/wiki/Huffman_coding) and here (http://www.prepressure.com/techno/compressionhuffman.htm).
slavik
11-01-2005, 07:59 AM
actually, I changed that stance and will open the source after I am done with it ...
as for compression testing my program thinks it can compress a 3.3GB ISO DVD image to 123MB in 10 minutes ...
this week(end) I hope to write 2 more classes that will buffer the input and output.
the reason "why" I am doing this is to learn how to implement the huffman coding algorithm so then I have something "real" to show for my code ...
the bzip2 project has only 1 main author, too. and bzip2 is considered to be the best available ...
btw, winrar would take over 1 hour to compress the same DVD ISO image ... I wasn't willing to wait.
And my implementation so far has a limit of length being 2^32 bytes long, which is 4GB.
edit: after I am done with this, I plan on working to implement arithmetic coding (claimed to be slower but better compression than huffman).
edit2: here's a "log" of a run on a 3.3GB DVD ISO image
D:\Dev-Cpp\compression>huffman.exe F:\KNOPPIX_V4.0.2DVD-2005-09-23-EN.iso
Reading F:\KNOPPIX_V4.0.2DVD-2005-09-23-EN.iso
File size = 3344640000 bytes
Encoded size = 123414528 bytes
Compression ratio = 3.68992% ((encoded file size)/(original file size))*100%
Space Saved = 3221225472 bytes
the program is available at http://bcacm.org/~slavik/huffman.exe
it doesn't encode yet, since it lacks the file output functionality, but otherwise it's pretty much done for encoding ...
slavik
11-01-2005, 08:08 PM
today, I hit a milestone :)
I wrote the output part and the decode part ... although it decodes a file right after encoding, it was done this way (very dirty right now) to make sure that it can decode the files it encodes ...
and it decodes them properly ... :)
to do:
-clean up the code
-make decode and encode separate functions
-allow encode/decode command from command line.
I4one
11-01-2005, 08:55 PM
bravo! bravo!....
what i find really interesting is the knoppix dvd as a test. The 3.3GB on that DVD usually has about 8GB + of data /apps since it's already compressed, it easily fits on a DVD-5 (4.7GB, actual 4.3). What am i misunderstanding ? -- and is this going to be for x86 ?
excellent work - keep on keepin' on ;) :D
slavik
11-01-2005, 11:20 PM
my program will be for anything that has a C++ compiler for it ...
since I compile with mingw (g++ port for windows) ... anything that g++ runs on will work :)
this is an earlier version that can analyse the file and tell you how many bytes it would need for the compressed version ... of course, the header is not being included, which would be upto 1.5KB
http://bcacm.org/~slavik/huffman.exe
usage ...
the executable is for win32 only ...
to use, open up command promt and go to the directory where you have it, then enter the name "huffman.exe" without quotes and then a space and the file name or path to a file you want to test it on. if the filepath contains spaces put it in quotes ...
you can also specify the full path to the program ...
examples:
huffman.exe C:\foo\foo.bar
huffman.exe "C:\blah glah\someother.dat"
C:\huffman.exe "D:\My Documents\docfile.odt"
"C:\foo blah\huffman.exe" "D:\Your Documents\someother.doc"
Very cool. Any chance you'll increase the maximum filesize, considering that most DVDs are just a bit over 2^32 bytes.
I4one
11-02-2005, 07:02 AM
some results, when testing on a typical DVD on E: drive;
the 1st instance is on a DIR itself - 2nd the Drive itself - 3rd is a file within a folder
C:\WINDOWS\Desktop>huffman.exe E:\video_ts
Problem with the file
You either didn't specify the file pathname properly or
I do not have read permissions to it
C:\WINDOWS\Desktop>huffman.exe E:\
Problem with the file
You either didn't specify the file pathname properly or
I do not have read permissions to it
C:\WINDOWS\Desktop>huffman.exe E:\Video_TS\vts_01_1.vob
Reading E:\Video_TS\vts_01_1.vob
File size = 0 bytes
Encoded size = 0 bytes
Compression ratio = -1.#IND% ((encoded file size)/(original file size))*100%
Space Saved = 0 bytes
slavik
11-02-2005, 09:02 AM
well, the program accepts only files (no directories at this point in time) ... and how big is the VOB?
that is an interesting problem ... I4one
As for increasing the size, it WILL be done later and backward compatability won't be broken ...
when the larger file size is supported though, the size of the header will increase ... in theory, I could make my own 5byte data type.
EDIT: If you want, go to http://bcacm.org/forums/index.php and register there ... and keep an eye on the "Huffman Project" ... that is where most of the updates will be.
I tested this on a dumped GBA rom-image.
2199 - Gunstar Super Heroes (u)(Trashman).gba
8192kb
I got 84.97. I guess the compression gets better at higher file sizes.
I4one
11-02-2005, 04:56 PM
well, the program accepts only files (no directories at this point in time) ... and how big is the VOB?
that is an interesting problem ... I4one
.vob tested = 1,048,574 KB - ~1 GB
but obviously it's a DVD, and Read Only (Archive attribute is set also - hmmm)...yet you're test of knoppix was very similar -- i will try some other files, and post results when i'm not drinking ;)
btw - this is all Fat32, and (CDFS) and UDF actually for the DVD;
http://i12.photobucket.com/albums/a217/iforone/Casablanca2.png
slavik
11-02-2005, 05:26 PM
well, if you give it a file over 4GB, it won't complain, but the size display for the file will be wrong ...
Also, my tests had the file on the hard drive ...
btw, it is possible to give an input to my program that will not be compressable, only thing is that the usefulness of such a file to anyone else is rather low ... there is also a way to give my program an input of 2KB that it outputs a file of 8KB ... I didn't actually test it but calculated it on paper ... so far, it seems that there is a problem with my buffered output ...
edit: fixed ... forgot to tell it to open the file for binary output :P (bytes are interpreted as bytes, not characters)
edit2: this weekend, I hope to get something working that will encode a file with the proper header and decode it with a properly ...
edit3: still not fixed, I think ...
I4one
11-03-2005, 03:02 AM
here's some more results;
* 1st using a very small CDA file on a burnt CD
* 2nd using a d/l WinXP SP2 package (on local HDD)
C:\WINDOWS\Desktop>huffman.exe "E:\track01.cda
Reading E:\track01.cda
File size = 44 bytes
Encoded size = 21 bytes
Compression ratio = 47.7273% ((encoded file size)/(original file size))*100%
Space Saved = 23 bytes
C:\WINDOWS\Desktop>huffman.exe "D:\XP Install\WindowsXP-KB835935-SP2-ENU.exe"
Reading D:\XP Install\WindowsXP-KB835935-SP2-ENU.exe
File size = 278927592 bytes
Encoded size = 278927592 bytes
Compression ratio = 100% ((encoded file size)/(original file size))*100%
Space Saved = 0 bytes
slavik
11-03-2005, 07:45 AM
seems like the second file is already compressed ...
I also figured out why I was having problems ... when encoding the 3.3gb file, the variable would overflow and start from beginning ...
seems like the second file is already compressed ...
I also figured out why I was having problems ... when encoding the 3.3gb file, the variable would overflow and start from beginning ...
In that test, the 2nd file is indeed already compressed, essentially an executable zip file.
And we've all done the int instead of long in a loop thing at some point ;)
I4one
11-03-2005, 09:05 AM
..sooo
i need to run it on a non-compressed file ?
i eluded to the knoppix 4.0.2 DVD (ISO) as already being compressed as well - i got no response about it, so i presumed it doesn't matter....i mean if klaus knoppix could release a 128MB CD. instead of a 3.3GB DVD, don't you think he would've already ?
i also tried it on a 950MB+ Nero .NRG file (Image) - and it just hung there "reading contents" - though admittedly, i didn't wait very long.
what ? for ISO files only ?
and why don't i see any other results from fellow sharks ?
and why don't i see any other results from fellow sharks ?
What? And take time away from gaming?
Note that ISO files (and NRG) are more of a format spec than having anything to do with compression.
Some Linux distros will release CDs full of .tar (no compression) or .rpm (modest compression sometimes). Those that include lots of gzipped files will compress much less. I'm not familiar with Knoppix, so can't say where they fall in the range.
slavik
11-03-2005, 04:45 PM
well, the thing is that long int are same size as int ... at least on my system with mingw ...
I am working on code to calculate everything properly ... also, for a 3.3GB ISO it took my system about 10 minutes ... and 1 min for a 614MB ISO image ...
the reason why it's a bad idea to run it on a compressed file is that usually, you cannot compress the file any better anyway ... so find some large install file or something ...
Candyman
11-03-2005, 05:14 PM
and why don't i see any other results from fellow sharks ?
'Cuz it's an exe, and therefore I'd have to reboot into Windows to make it work. ;)
Slavik, http://developer.apple.com/documentation/DeveloperTools/gcc-4.0.0/gcc/Long-Long.html
Think that'll be big enough?
slavik
11-03-2005, 05:44 PM
I'll give it a try after my current run goes through ...
I4one
11-03-2005, 06:10 PM
alright - the last guinea pig entry for awhile
C:\WINDOWS\Desktop>huffman.exe C:\windows\desktop\knoppix\knoppix
Reading C:\windows\desktop\knoppix\knoppix
File size = 725253780 bytes
Encoded size = 188382868 bytes
Compression ratio = 25.9748% ((encoded file size)/(original file size))*100%
Space Saved = 536870912 bytes
C:\WINDOWS\Desktop>
like i said - this file was already compressed on the Knoppix 3.6 CD-R...as any knoppix ISO 4.0.2 DVD will surely contain many "compressed" files...perhaps the jump from CD (700+MB) to DVD (3300MB) allowed him to keep the least used apps (or extra stuff) only to remain highly compressed - (up to a total of over 8.0GB of data on the DVD). I copied this 725MB file back over to desktop from CD to run this test
slavik
11-03-2005, 06:36 PM
nice ... once I can get it to compute the encoded size properly, I will start working on releasing the program in 3 versions ...
1 to compress, 1 to decompress and 1 to analyze ... eventually I will merge them into 1 and then allow command line options (-c, -d, -a)
EDIT: Good news ... thanks to Candyman, the calculation works properly :) you can go to the site and redownload the program ... now it should read everything properly ...
EDIT2: I was thinking ... to avoid the problems of int ... to implement my own "wrapper" classes so that I do not have to rely on the basic data type, exept for char and pointers to char and my "custom" classes.