Huffman compression in C++ (Summer project)

Sharky Forums


Page 1 of 2 12 LastLast
Results 1 to 15 of 24

Thread: Huffman compression in C++ (Summer project)

  1. #1
    Mako Shark slavik's Avatar
    Join Date
    May 2001
    Location
    Brooklyn
    Posts
    3,308

    Huffman compression in C++ (Summer project)

    Well, I decided to post progress that I make on my project ...

    One or a few things to note is that all code is in C++ and that it is on paper, so I have no clue whether it will work at all, although I think it should.

    In this thread, I will not post any code at all. ("It's MINE!!! ALL MINE!!!" - Daffy Duck ) I will however, post which parts of the projects I have completed and what they do ^^.

    Ok, 1st off:
    Wrote the "BitArray" to hold an array of bits (uses an array of chars, and stores 8 bits in each char). There will be a large array in the end holding the entire coded string which will be very easy to print to file and then read if needed (for decompression).
    There will be an array of smaller ones for each char, so that when the tree is created, you only have to find a char to encode once in the tree. There will be an array of the smaller bit arrays.

    2nd:
    Wrote the node class for the huffman tree, hold pointers to left and right, current weight, byte value (character).

    3rd:
    outlined process for compression and decompression via flow charts.

    4th:
    wrote a sorting function utilizing linear sort, I know it is very inefficient. I might change it later to tree or heap sort. Most likely heap, since it would be faster.

    To DO:
    - remove_dead function to remove all 0 weights from the array of nodes.
    - combine function to combine the first 2 elements in the array of nodes under a new parent node.
    - Use a better sorting algorithm*
    - give BitArray class a new method which accepts another BitArray and adds the value of the argument (the bit string) to itself.
    - give BitArray stack/queue behavior (push, pop, remove_front) to add in the previous and for searching (since the tree traversal will be recursive, push and pop is needed).
    - IMPORTANT!!! Re-read the K&R book on how to add CLI options/arguments

    That's it for now .

    When I finish this project, if it doesn't give good compression (comparable to ZIP/RAR/ACE), I might release the source code, if it does good, then I won't for obvious reasons . (I dunno what they are )

    EDIT: questions/comments/complaints/suggestions welcome .

    EDIT2: the way I am going to have my util is it will only be able to handle a single file ... which is fine ... but for folders and stuff, I will write "wrapper" scripts (probably in perl, maybe another C/C++ program) ... the idea is that the compression program does just that and nothing else.
    Last edited by slavik; 08-12-2005 at 11:01 PM.
    Activation? What activation?
    Quote Originally Posted by Geekkit (from ubuntu forums regarding 'goto' statement)
    Yep it sure does. So does crack cocaine. Existence is not a valid endorsement for being acceptable.
    Quote Originally Posted by Linus Torvalds
    Only wimps use tape backup: _real_ men just upload their important stuff on ftp, and let the rest of the world mirror it

  2. #2
    Mako Shark slavik's Avatar
    Join Date
    May 2001
    Location
    Brooklyn
    Posts
    3,308
    well, didn't do it over summer, but over the past weekend I have advanced in leaps and bounds

    I got it to the point where for encidng, it just has to output the encoded file in a smart way ... so far it can analyze the file it is given and tell you the file size, encoded file size, compression ratio in % (encoded size / original file size)*100, and how much space you can save ...

    for file I/O I (will) use C stuff and for other stuff I will use C++ (since it has vector which is nice to work with and the sort algorithm which I don't have to code either).
    Last edited by slavik; 10-31-2005 at 10:03 PM.
    Activation? What activation?
    Quote Originally Posted by Geekkit (from ubuntu forums regarding 'goto' statement)
    Yep it sure does. So does crack cocaine. Existence is not a valid endorsement for being acceptable.
    Quote Originally Posted by Linus Torvalds
    Only wimps use tape backup: _real_ men just upload their important stuff on ftp, and let the rest of the world mirror it

  3. #3
    BozoKiller
    Join Date
    Oct 2003
    Location
    Zoso
    Posts
    7,636
    hi;
    forgive my ignorance in all programming languages - and i'm interested in trying to follow your progress, but for those of us less enlightened, could you plz provide a brief synopsis of what/why you are doing this....and the benefits ?
    Thanks - and hope it works out for ya (chi-ching )
    Delete the Electoral College - Support
    www.NationalPopularVote.com

    "The world according to DRM Bozos"

    I am a consumer, I'll buy anything
    I am a sheep, I am cattle, I follow the herd
    I am ignorant, a dumbass, and I am a bozo...
    I am the epitome of the 'rank and file'
    I am your next door neighbor
    I am 95% of American Consumers
    I will consume you

    • If the light in your head hasn't come on yet,
      I suggest you go get a new bulb!

  4. #4
    Hammerhead Shark Candyman's Avatar
    Join Date
    Aug 2002
    Location
    Redwood City, CA
    Posts
    1,386
    Quote Originally Posted by slavik
    When I finish this project, if it doesn't give good compression (comparable to ZIP/RAR/ACE), I might release the source code, if it does good, then I won't for obvious reasons . (I dunno what they are )
    No offense, but I see that as kind of a backward attitude toward the issue. I mean, if it does work well, won't it be more useful to publish it so other people can improve on it? Do people really pay for general compression software anymore? There are already algorithms out there that are very widespread and very open that work very, very well. See 7zip, which is one of the most popular projects on SourceForge. Not to mention all the GNU utilities...I'm not trying to get down on you, I'm just trying to figure out the motivation behind that particular part.

    Many existing compression schemes are built with Huffman encoding as part of them. What kind of improvements does your implementation offer (if you don't mind the question)? How are the results for your compression testing so far?

    As far as why he would do it, I'd take the liberty of assuming it has something to do with the phrase "because he can".

    For references on Huffman, see here and here.
    Core 2 Duo 6750
    Antec P182
    Abit IP35 Pro
    4 GB DDR2 800 RAM
    Asus Xonar D2
    Panasonic SA-XR55 / Audio Technica ATH-A900

  5. #5
    Mako Shark slavik's Avatar
    Join Date
    May 2001
    Location
    Brooklyn
    Posts
    3,308
    actually, I changed that stance and will open the source after I am done with it ...

    as for compression testing my program thinks it can compress a 3.3GB ISO DVD image to 123MB in 10 minutes ...

    this week(end) I hope to write 2 more classes that will buffer the input and output.

    the reason "why" I am doing this is to learn how to implement the huffman coding algorithm so then I have something "real" to show for my code ...

    the bzip2 project has only 1 main author, too. and bzip2 is considered to be the best available ...

    btw, winrar would take over 1 hour to compress the same DVD ISO image ... I wasn't willing to wait.

    And my implementation so far has a limit of length being 2^32 bytes long, which is 4GB.

    edit: after I am done with this, I plan on working to implement arithmetic coding (claimed to be slower but better compression than huffman).

    edit2: here's a "log" of a run on a 3.3GB DVD ISO image

    D:\Dev-Cpp\compression>huffman.exe F:\KNOPPIX_V4.0.2DVD-2005-09-23-EN.iso
    Reading F:\KNOPPIX_V4.0.2DVD-2005-09-23-EN.iso
    File size = 3344640000 bytes
    Encoded size = 123414528 bytes
    Compression ratio = 3.68992% ((encoded file size)/(original file size))*100%
    Space Saved = 3221225472 bytes

    the program is available at http://bcacm.org/~slavik/huffman.exe

    it doesn't encode yet, since it lacks the file output functionality, but otherwise it's pretty much done for encoding ...
    Last edited by slavik; 11-01-2005 at 10:04 AM.
    Activation? What activation?
    Quote Originally Posted by Geekkit (from ubuntu forums regarding 'goto' statement)
    Yep it sure does. So does crack cocaine. Existence is not a valid endorsement for being acceptable.
    Quote Originally Posted by Linus Torvalds
    Only wimps use tape backup: _real_ men just upload their important stuff on ftp, and let the rest of the world mirror it

  6. #6
    Mako Shark slavik's Avatar
    Join Date
    May 2001
    Location
    Brooklyn
    Posts
    3,308
    today, I hit a milestone

    I wrote the output part and the decode part ... although it decodes a file right after encoding, it was done this way (very dirty right now) to make sure that it can decode the files it encodes ...

    and it decodes them properly ...

    to do:
    -clean up the code
    -make decode and encode separate functions
    -allow encode/decode command from command line.
    Activation? What activation?
    Quote Originally Posted by Geekkit (from ubuntu forums regarding 'goto' statement)
    Yep it sure does. So does crack cocaine. Existence is not a valid endorsement for being acceptable.
    Quote Originally Posted by Linus Torvalds
    Only wimps use tape backup: _real_ men just upload their important stuff on ftp, and let the rest of the world mirror it

  7. #7
    BozoKiller
    Join Date
    Oct 2003
    Location
    Zoso
    Posts
    7,636
    bravo! bravo!....

    what i find really interesting is the knoppix dvd as a test. The 3.3GB on that DVD usually has about 8GB + of data /apps since it's already compressed, it easily fits on a DVD-5 (4.7GB, actual 4.3). What am i misunderstanding ? -- and is this going to be for x86 ?

    excellent work - keep on keepin' on
    Last edited by I4one; 11-01-2005 at 10:02 PM.
    Delete the Electoral College - Support
    www.NationalPopularVote.com

    "The world according to DRM Bozos"

    I am a consumer, I'll buy anything
    I am a sheep, I am cattle, I follow the herd
    I am ignorant, a dumbass, and I am a bozo...
    I am the epitome of the 'rank and file'
    I am your next door neighbor
    I am 95% of American Consumers
    I will consume you

    • If the light in your head hasn't come on yet,
      I suggest you go get a new bulb!

  8. #8
    Mako Shark slavik's Avatar
    Join Date
    May 2001
    Location
    Brooklyn
    Posts
    3,308
    my program will be for anything that has a C++ compiler for it ...

    since I compile with mingw (g++ port for windows) ... anything that g++ runs on will work

    this is an earlier version that can analyse the file and tell you how many bytes it would need for the compressed version ... of course, the header is not being included, which would be upto 1.5KB
    http://bcacm.org/~slavik/huffman.exe

    usage ...
    the executable is for win32 only ...
    to use, open up command promt and go to the directory where you have it, then enter the name "huffman.exe" without quotes and then a space and the file name or path to a file you want to test it on. if the filepath contains spaces put it in quotes ...

    you can also specify the full path to the program ...

    examples:

    huffman.exe C:\foo\foo.bar
    huffman.exe "C:\blah glah\someother.dat"
    C:\huffman.exe "D:\My Documents\docfile.odt"
    "C:\foo blah\huffman.exe" "D:\Your Documents\someother.doc"
    Last edited by slavik; 11-02-2005 at 12:39 AM.
    Activation? What activation?
    Quote Originally Posted by Geekkit (from ubuntu forums regarding 'goto' statement)
    Yep it sure does. So does crack cocaine. Existence is not a valid endorsement for being acceptable.
    Quote Originally Posted by Linus Torvalds
    Only wimps use tape backup: _real_ men just upload their important stuff on ftp, and let the rest of the world mirror it

  9. #9
    Richard M. Nixon '08 PCJ's Avatar
    Join Date
    Jul 2003
    Posts
    8,187
    Very cool. Any chance you'll increase the maximum filesize, considering that most DVDs are just a bit over 2^32 bytes.
    +++ F.O.R.U.M. I.L.L.U.M.I.N.A.T.I. +++
    Quote Originally Posted by Zvi
    How come you always choose the wrong side, first HD DVD, now Russians.

  10. #10
    BozoKiller
    Join Date
    Oct 2003
    Location
    Zoso
    Posts
    7,636
    some results, when testing on a typical DVD on E: drive;

    the 1st instance is on a DIR itself - 2nd the Drive itself - 3rd is a file within a folder
    Code:
    C:\WINDOWS\Desktop>huffman.exe E:\video_ts
    Problem with the file
    You either didn't specify the file pathname properly or
    I do not have read permissions to it
    
    C:\WINDOWS\Desktop>huffman.exe E:\
    Problem with the file
    You either didn't specify the file pathname properly or
    I do not have read permissions to it
    
    C:\WINDOWS\Desktop>huffman.exe E:\Video_TS\vts_01_1.vob
    Reading E:\Video_TS\vts_01_1.vob
    File size = 0 bytes
    Encoded size = 0 bytes
    Compression ratio = -1.#IND% ((encoded file size)/(original file size))*100%
    Space Saved = 0 bytes
    Last edited by I4one; 11-02-2005 at 08:05 AM.
    Delete the Electoral College - Support
    www.NationalPopularVote.com

    "The world according to DRM Bozos"

    I am a consumer, I'll buy anything
    I am a sheep, I am cattle, I follow the herd
    I am ignorant, a dumbass, and I am a bozo...
    I am the epitome of the 'rank and file'
    I am your next door neighbor
    I am 95% of American Consumers
    I will consume you

    • If the light in your head hasn't come on yet,
      I suggest you go get a new bulb!

  11. #11
    Mako Shark slavik's Avatar
    Join Date
    May 2001
    Location
    Brooklyn
    Posts
    3,308
    well, the program accepts only files (no directories at this point in time) ... and how big is the VOB?

    that is an interesting problem ... I4one

    As for increasing the size, it WILL be done later and backward compatability won't be broken ...

    when the larger file size is supported though, the size of the header will increase ... in theory, I could make my own 5byte data type.

    EDIT: If you want, go to http://bcacm.org/forums/index.php and register there ... and keep an eye on the "Huffman Project" ... that is where most of the updates will be.
    Last edited by slavik; 11-02-2005 at 10:06 AM.
    Activation? What activation?
    Quote Originally Posted by Geekkit (from ubuntu forums regarding 'goto' statement)
    Yep it sure does. So does crack cocaine. Existence is not a valid endorsement for being acceptable.
    Quote Originally Posted by Linus Torvalds
    Only wimps use tape backup: _real_ men just upload their important stuff on ftp, and let the rest of the world mirror it

  12. #12
    Richard M. Nixon '08 PCJ's Avatar
    Join Date
    Jul 2003
    Posts
    8,187
    I tested this on a dumped GBA rom-image.

    2199 - Gunstar Super Heroes (u)(Trashman).gba
    8192kb

    I got 84.97. I guess the compression gets better at higher file sizes.
    +++ F.O.R.U.M. I.L.L.U.M.I.N.A.T.I. +++
    Quote Originally Posted by Zvi
    How come you always choose the wrong side, first HD DVD, now Russians.

  13. #13
    BozoKiller
    Join Date
    Oct 2003
    Location
    Zoso
    Posts
    7,636
    Quote Originally Posted by slavik
    well, the program accepts only files (no directories at this point in time) ... and how big is the VOB?

    that is an interesting problem ... I4one
    .vob tested = 1,048,574 KB - ~1 GB

    but obviously it's a DVD, and Read Only (Archive attribute is set also - hmmm)...yet you're test of knoppix was very similar -- i will try some other files, and post results when i'm not drinking

    btw - this is all Fat32, and (CDFS) and UDF actually for the DVD;
    http://i12.photobucket.com/albums/a2...asablanca2.png
    Delete the Electoral College - Support
    www.NationalPopularVote.com

    "The world according to DRM Bozos"

    I am a consumer, I'll buy anything
    I am a sheep, I am cattle, I follow the herd
    I am ignorant, a dumbass, and I am a bozo...
    I am the epitome of the 'rank and file'
    I am your next door neighbor
    I am 95% of American Consumers
    I will consume you

    • If the light in your head hasn't come on yet,
      I suggest you go get a new bulb!

  14. #14
    Mako Shark slavik's Avatar
    Join Date
    May 2001
    Location
    Brooklyn
    Posts
    3,308
    well, if you give it a file over 4GB, it won't complain, but the size display for the file will be wrong ...

    Also, my tests had the file on the hard drive ...

    btw, it is possible to give an input to my program that will not be compressable, only thing is that the usefulness of such a file to anyone else is rather low ... there is also a way to give my program an input of 2KB that it outputs a file of 8KB ... I didn't actually test it but calculated it on paper ... so far, it seems that there is a problem with my buffered output ...

    edit: fixed ... forgot to tell it to open the file for binary output :P (bytes are interpreted as bytes, not characters)

    edit2: this weekend, I hope to get something working that will encode a file with the proper header and decode it with a properly ...

    edit3: still not fixed, I think ...
    Last edited by slavik; 11-02-2005 at 06:54 PM.
    Activation? What activation?
    Quote Originally Posted by Geekkit (from ubuntu forums regarding 'goto' statement)
    Yep it sure does. So does crack cocaine. Existence is not a valid endorsement for being acceptable.
    Quote Originally Posted by Linus Torvalds
    Only wimps use tape backup: _real_ men just upload their important stuff on ftp, and let the rest of the world mirror it

  15. #15
    BozoKiller
    Join Date
    Oct 2003
    Location
    Zoso
    Posts
    7,636
    here's some more results;
    * 1st using a very small CDA file on a burnt CD

    * 2nd using a d/l WinXP SP2 package (on local HDD)

    Code:
    C:\WINDOWS\Desktop>huffman.exe "E:\track01.cda
    Reading E:\track01.cda
    File size = 44 bytes
    Encoded size = 21 bytes
    Compression ratio = 47.7273% ((encoded file size)/(original file size))*100%
    Space Saved = 23 bytes
    
    C:\WINDOWS\Desktop>huffman.exe "D:\XP Install\WindowsXP-KB835935-SP2-ENU.exe"
    Reading D:\XP Install\WindowsXP-KB835935-SP2-ENU.exe
    File size = 278927592 bytes
    Encoded size = 278927592 bytes
    Compression ratio = 100% ((encoded file size)/(original file size))*100%
    Space Saved = 0 bytes
    Last edited by I4one; 11-03-2005 at 10:12 AM.
    Delete the Electoral College - Support
    www.NationalPopularVote.com

    "The world according to DRM Bozos"

    I am a consumer, I'll buy anything
    I am a sheep, I am cattle, I follow the herd
    I am ignorant, a dumbass, and I am a bozo...
    I am the epitome of the 'rank and file'
    I am your next door neighbor
    I am 95% of American Consumers
    I will consume you

    • If the light in your head hasn't come on yet,
      I suggest you go get a new bulb!

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •