Determine binary file type

Sharky Forums


Results 1 to 11 of 11

Thread: Determine binary file type

Hybrid View

  1. #1
    NullPointerException rock's Avatar
    Join Date
    Sep 2000
    Location
    York, PA
    Posts
    6,203

    Determine binary file type

    I've written a nice little application that goes through a Firefox cache folder and tags each file with an appropriate extension. It does a wonderful job so far, pulling out many kinds of images and text types. I can manually identify some Office docs, too, but those are harder to do automatically because there aren't 'magic bytes' at the top, and all I'm parsing is the first line.

    Anyway, I've come upon a few binary files that I can't seem to identify or open. They all start with the bytes 1F 8B 08 00. The first 2 bytes seem to indicate a GZip file type, but that would be a bit surprising first off all. And second, it won't open them anyway (no archive program will). And IrfranView does a good job of recognizing misnamed extensions, and it doesn't know this one either.

    Anyone with enough experience or understand what I'm talking about to offer a suggestion?

    Open Source is free like a puppy is free.

    It's only when you look at an ant through a magnifying glass on a sunny day that you realise how often they burst into flames.

    Understanding Evolution

  2. #2
    Hammerhead Shark
    Join Date
    Feb 2001
    Posts
    1,612
    Have you tried looking at your history for the files' creation date? If it's just a few, it could be something weird done by one or two sites.

  3. #3
    BozoKiller
    Join Date
    Oct 2003
    Location
    Zoso
    Posts
    7,636
    i take it you found this already ? http://www.garykessler.net/library/file_sigs.html

    definitely seems like a gunzip
    http://64.233.161.104/search?q=cache...8B+08+00&hl=en
    google cache given only for the 'bold' highlighted search term purposes -- easily locate info in page
    Delete the Electoral College - Support
    www.NationalPopularVote.com

    "The world according to DRM Bozos"

    I am a consumer, I'll buy anything
    I am a sheep, I am cattle, I follow the herd
    I am ignorant, a dumbass, and I am a bozo...
    I am the epitome of the 'rank and file'
    I am your next door neighbor
    I am 95% of American Consumers
    I will consume you

    • If the light in your head hasn't come on yet,
      I suggest you go get a new bulb!

  4. #4
    NullPointerException rock's Avatar
    Join Date
    Sep 2000
    Location
    York, PA
    Posts
    6,203
    Maybe they're just partial or currupted GZ files, because that does look like the only format with that lead-in.

    Thanks for the File Signatures link -- I hadn't seen that one yet. It confirmed my issue with Office files -- they all start the same; the identifying bits are at the end of the file, so parsing just the first line can't do it.

    I've only got 2 of those files in the whole cache. Checking dates, it puts them alongside a bunch of PNG images from maps.google.com. Maybe they're part of the client-side bits that let you scroll around the maps.

    I'll just tag these as .gz so at least they're set to something.

    Open Source is free like a puppy is free.

    It's only when you look at an ant through a magnifying glass on a sunny day that you realise how often they burst into flames.

    Understanding Evolution

  5. #5
    BozoKiller
    Join Date
    Oct 2003
    Location
    Zoso
    Posts
    7,636
    cool rock....
    i noticed, out of curiousity, i checked some random Moz cache files -- found GIF files easily with my hexeditor....signature bytes are there - everything cool GIF89 -- BUT......when i checked some DAT files on my HDD - all (headers/lead in) of my DAT files are different...what up with dat ?....they certainly don't match the URL info i posted.

    Yes ....irfanview is pretty great in noticing the "filetype" before opening (prompting to rename the file extension) -- i noticed that awhile ago about it.....and i didn't fine WinZIP sig in the URL i posted - nor a b2zip -- but like i said....who knows, since my DAT file experiment turned up unfound (non-matching) sig bytes.


    FWIW - I realize for example; Adobe Acrobat docs have a sig header, b/c i looked with a hexeditor when i can't open some d/l PDF files....they all start with "PDF1.5" (or similar) -- and my 4.0 reader can't open/view them. they get deleted (CTRL + DEL) immediately.
    Last edited by I4one; 12-23-2005 at 11:18 PM.
    Delete the Electoral College - Support
    www.NationalPopularVote.com

    "The world according to DRM Bozos"

    I am a consumer, I'll buy anything
    I am a sheep, I am cattle, I follow the herd
    I am ignorant, a dumbass, and I am a bozo...
    I am the epitome of the 'rank and file'
    I am your next door neighbor
    I am 95% of American Consumers
    I will consume you

    • If the light in your head hasn't come on yet,
      I suggest you go get a new bulb!

  6. #6
    NullPointerException rock's Avatar
    Join Date
    Sep 2000
    Location
    York, PA
    Posts
    6,203
    The 4.0 reader is pretty old, so that's not too surprising you can't open newer PDFs.

    As for '.dat' files, that's an extension used by many different programs, so I don't think there's a general rule to apply to there.

    If anyone is interested, this is what my checks looks like so far. This is simple C#, though this bit is just a bunch of conditionals after reading the first 10 bytes. This code is just string comparisions; to check for the bytes as above, there's a little logic before I cast the bytes into String, and then these tests are skipped.

    Code:
    	if (str.IndexOf("PDF") >= 0) ext = ".pdf";
    	if (str.IndexOf("JFIF") >= 0) ext = ".jpg";
    	if (str.IndexOf("Exif") >= 0) ext = ".jpg";
    	if (str.IndexOf("EPS") >= 0) ext = ".jpg";
    	if (str.IndexOf("GIF") >= 0) ext = ".gif";
    	if (str.IndexOf("PNG") >= 0) ext = ".png";
    	if (str.IndexOf("CWS") >= 0) ext = ".swf";
    	if (str.IndexOf("FWS") >= 0) ext = ".swf";
    	if (str.IndexOf("link") >= 0) ext = ".txt";
    	if (str.IndexOf(".") == 0) ext = ".css";
    	if (str.IndexOf("<H") >= 0) ext = ".html";
    	if (str.IndexOf("<h") >= 0) ext = ".html";
    	if (str.IndexOf("<!") >= 0) ext = ".html";
    	if (str.IndexOf("<!DOC") >= 0) ext = ".html";
    	if (str.IndexOf("<!doc") >= 0) ext = ".html";
    	if (str.IndexOf("XML") >= 0) ext = ".xml";
    	if (str.IndexOf("<?") >= 0) ext = ".xml";
    	if (str.IndexOf("/") == 0) ext = ".js";
    	if (str.IndexOf("\t") == 0) ext = ".html";
    These seem to cover close to 100% of the files.
    Last edited by rock; 12-24-2005 at 12:38 PM.

    Open Source is free like a puppy is free.

    It's only when you look at an ant through a magnifying glass on a sunny day that you realise how often they burst into flames.

    Understanding Evolution

  7. #7
    Mako Shark slavik's Avatar
    Join Date
    May 2001
    Location
    Brooklyn
    Posts
    3,308
    umm, make your code more efficient by using a switch statement
    Activation? What activation?
    Quote Originally Posted by Geekkit (from ubuntu forums regarding 'goto' statement)
    Yep it sure does. So does crack cocaine. Existence is not a valid endorsement for being acceptable.
    Quote Originally Posted by Linus Torvalds
    Only wimps use tape backup: _real_ men just upload their important stuff on ftp, and let the rest of the world mirror it

  8. #8
    BozoKiller
    Join Date
    Oct 2003
    Location
    Zoso
    Posts
    7,636
    hey rock;
    out of curiousity, why are doing this ?
    i could've used this a long time ago when i wanted to wipe the cache without having to go through the long painful GUI everytime. So i came up with a simple .BAT file - to clear the whole thing...AFAIK - there's no way to get a DOS command prompt to filter and delete randomly named files that have no file extensions, and exclude others that reside in the same folder. In the beginning i wanted to leave the _CACHE_001_ (2)(3) and the _CACHE_MAP_ files alone, as i didn't know what the repercussions would be from wiping those, if any. Found out - it don't matter

    i would assume you want to help ppl who would like to Save some of these temporary inet files....yet there are some, if not many extensions, that Moz recognizes, but not listed in your 'almost' 100% list. Enlighten me plz.
    Last edited by I4one; 12-26-2005 at 07:01 PM.
    Delete the Electoral College - Support
    www.NationalPopularVote.com

    "The world according to DRM Bozos"

    I am a consumer, I'll buy anything
    I am a sheep, I am cattle, I follow the herd
    I am ignorant, a dumbass, and I am a bozo...
    I am the epitome of the 'rank and file'
    I am your next door neighbor
    I am 95% of American Consumers
    I will consume you

    • If the light in your head hasn't come on yet,
      I suggest you go get a new bulb!

  9. #9
    Tiger Shark UmneyDurak's Avatar
    Join Date
    Apr 2004
    Location
    Escaped from 2nd floor of Soda Hall.
    Posts
    671
    Nevermind.
    Last edited by UmneyDurak; 12-27-2005 at 05:59 AM.
    01+01=10

  10. #10
    BozoKiller
    Join Date
    Oct 2003
    Location
    Zoso
    Posts
    7,636
    Quote Originally Posted by UmneyDurak
    Nevermind.

    no - no...plz, speak up
    i know you have programming knowledge
    Delete the Electoral College - Support
    www.NationalPopularVote.com

    "The world according to DRM Bozos"

    I am a consumer, I'll buy anything
    I am a sheep, I am cattle, I follow the herd
    I am ignorant, a dumbass, and I am a bozo...
    I am the epitome of the 'rank and file'
    I am your next door neighbor
    I am 95% of American Consumers
    I will consume you

    • If the light in your head hasn't come on yet,
      I suggest you go get a new bulb!

  11. #11
    NullPointerException rock's Avatar
    Join Date
    Sep 2000
    Location
    York, PA
    Posts
    6,203
    Quote Originally Posted by slavik
    umm, make your code more efficient by using a switch statement
    But I don't think readability would improve. It would also be faster if they were all if/else too. There are >= and == tests, and I'm testing on a String and with this single block, it's very obvious what's going on. And since this might work on a couple hundred files in a folder, saving a fraction of a fraction of a second isn't of much concern.

    Quote Originally Posted by I4one
    out of curiousity, why are doing this ?
    Mostly because I wanted to pull out some images after the fact. I was also curious about how much html is cached when using web email systems. I've since learned it's an easy way to get snapshots when using maps.google.com.

    As for retrieving things like .wmv or .mpg files played in the browser, this could work there as well; but on the rare occasion I need to do that, it's easy enough to manually find it based on size and date.

    So basically, it was an exercise to see if I could do it. It's just a simple command line application, and I'm not planning on investing any more time into it to wrap a gui around it or make it work anywhere except the local directory.

    Open Source is free like a puppy is free.

    It's only when you look at an ant through a magnifying glass on a sunny day that you realise how often they burst into flames.

    Understanding Evolution

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •