|
-
NullPointerException
Determine binary file type
I've written a nice little application that goes through a Firefox cache folder and tags each file with an appropriate extension. It does a wonderful job so far, pulling out many kinds of images and text types. I can manually identify some Office docs, too, but those are harder to do automatically because there aren't 'magic bytes' at the top, and all I'm parsing is the first line.
Anyway, I've come upon a few binary files that I can't seem to identify or open. They all start with the bytes 1F 8B 08 00. The first 2 bytes seem to indicate a GZip file type, but that would be a bit surprising first off all. And second, it won't open them anyway (no archive program will). And IrfranView does a good job of recognizing misnamed extensions, and it doesn't know this one either.
Anyone with enough experience or understand what I'm talking about to offer a suggestion?
Open Source is free like a puppy is free.
It's only when you look at an ant through a magnifying glass on a sunny day that you realise how often they burst into flames.
Understanding Evolution
-
Have you tried looking at your history for the files' creation date? If it's just a few, it could be something weird done by one or two sites.
-
i take it you found this already ? http://www.garykessler.net/library/file_sigs.html
definitely seems like a gunzip
http://64.233.161.104/search?q=cache...8B+08+00&hl=en
google cache given only for the 'bold' highlighted search term purposes -- easily locate info in page
Delete the Electoral College - Support
www.NationalPopularVote.com
"The world according to DRM Bozos"
I am a consumer, I'll buy anything
I am a sheep, I am cattle, I follow the herd
I am ignorant, a dumbass, and I am a bozo...
I am the epitome of the 'rank and file'
I am your next door neighbor
I am 95% of American Consumers
I will consume you
- If the light in your head hasn't come on yet,
I suggest you go get a new bulb!
-
NullPointerException
Maybe they're just partial or currupted GZ files, because that does look like the only format with that lead-in.
Thanks for the File Signatures link -- I hadn't seen that one yet. It confirmed my issue with Office files -- they all start the same; the identifying bits are at the end of the file, so parsing just the first line can't do it.
I've only got 2 of those files in the whole cache. Checking dates, it puts them alongside a bunch of PNG images from maps.google.com. Maybe they're part of the client-side bits that let you scroll around the maps.
I'll just tag these as .gz so at least they're set to something.
Open Source is free like a puppy is free.
It's only when you look at an ant through a magnifying glass on a sunny day that you realise how often they burst into flames.
Understanding Evolution
-
cool rock....
i noticed, out of curiousity, i checked some random Moz cache files -- found GIF files easily with my hexeditor....signature bytes are there - everything cool GIF89 -- BUT......when i checked some DAT files on my HDD - all (headers/lead in) of my DAT files are different...what up with dat ?....they certainly don't match the URL info i posted.
Yes ....irfanview is pretty great in noticing the "filetype" before opening (prompting to rename the file extension) -- i noticed that awhile ago about it.....and i didn't fine WinZIP sig in the URL i posted - nor a b2zip -- but like i said....who knows, since my DAT file experiment turned up unfound (non-matching) sig bytes.
FWIW - I realize for example; Adobe Acrobat docs have a sig header, b/c i looked with a hexeditor when i can't open some d/l PDF files....they all start with "PDF1.5" (or similar) -- and my 4.0 reader can't open/view them. they get deleted (CTRL + DEL) immediately.
Last edited by I4one; 12-23-2005 at 11:18 PM.
Delete the Electoral College - Support
www.NationalPopularVote.com
"The world according to DRM Bozos"
I am a consumer, I'll buy anything
I am a sheep, I am cattle, I follow the herd
I am ignorant, a dumbass, and I am a bozo...
I am the epitome of the 'rank and file'
I am your next door neighbor
I am 95% of American Consumers
I will consume you
- If the light in your head hasn't come on yet,
I suggest you go get a new bulb!
-
NullPointerException
The 4.0 reader is pretty old, so that's not too surprising you can't open newer PDFs.
As for '.dat' files, that's an extension used by many different programs, so I don't think there's a general rule to apply to there.
If anyone is interested, this is what my checks looks like so far. This is simple C#, though this bit is just a bunch of conditionals after reading the first 10 bytes. This code is just string comparisions; to check for the bytes as above, there's a little logic before I cast the bytes into String, and then these tests are skipped.
Code:
if (str.IndexOf("PDF") >= 0) ext = ".pdf";
if (str.IndexOf("JFIF") >= 0) ext = ".jpg";
if (str.IndexOf("Exif") >= 0) ext = ".jpg";
if (str.IndexOf("EPS") >= 0) ext = ".jpg";
if (str.IndexOf("GIF") >= 0) ext = ".gif";
if (str.IndexOf("PNG") >= 0) ext = ".png";
if (str.IndexOf("CWS") >= 0) ext = ".swf";
if (str.IndexOf("FWS") >= 0) ext = ".swf";
if (str.IndexOf("link") >= 0) ext = ".txt";
if (str.IndexOf(".") == 0) ext = ".css";
if (str.IndexOf("<H") >= 0) ext = ".html";
if (str.IndexOf("<h") >= 0) ext = ".html";
if (str.IndexOf("<!") >= 0) ext = ".html";
if (str.IndexOf("<!DOC") >= 0) ext = ".html";
if (str.IndexOf("<!doc") >= 0) ext = ".html";
if (str.IndexOf("XML") >= 0) ext = ".xml";
if (str.IndexOf("<?") >= 0) ext = ".xml";
if (str.IndexOf("/") == 0) ext = ".js";
if (str.IndexOf("\t") == 0) ext = ".html";
These seem to cover close to 100% of the files.
Last edited by rock; 12-24-2005 at 12:38 PM.
Open Source is free like a puppy is free.
It's only when you look at an ant through a magnifying glass on a sunny day that you realise how often they burst into flames.
Understanding Evolution
-
Mako Shark
umm, make your code more efficient by using a switch statement
Activation? What activation?
 Originally Posted by Geekkit (from ubuntu forums regarding 'goto' statement)
Yep it sure does. So does crack cocaine. Existence is not a valid endorsement for being acceptable.
 Originally Posted by Linus Torvalds
Only wimps use tape backup: _real_ men just upload their important stuff on ftp, and let the rest of the world mirror it 
-
hey rock;
out of curiousity, why are doing this ?
i could've used this a long time ago when i wanted to wipe the cache without having to go through the long painful GUI everytime. So i came up with a simple .BAT file - to clear the whole thing...AFAIK - there's no way to get a DOS command prompt to filter and delete randomly named files that have no file extensions, and exclude others that reside in the same folder. In the beginning i wanted to leave the _CACHE_001_ (2)(3) and the _CACHE_MAP_ files alone, as i didn't know what the repercussions would be from wiping those, if any. Found out - it don't matter 
i would assume you want to help ppl who would like to Save some of these temporary inet files....yet there are some, if not many extensions, that Moz recognizes, but not listed in your 'almost' 100% list. Enlighten me plz.
Last edited by I4one; 12-26-2005 at 07:01 PM.
Delete the Electoral College - Support
www.NationalPopularVote.com
"The world according to DRM Bozos"
I am a consumer, I'll buy anything
I am a sheep, I am cattle, I follow the herd
I am ignorant, a dumbass, and I am a bozo...
I am the epitome of the 'rank and file'
I am your next door neighbor
I am 95% of American Consumers
I will consume you
- If the light in your head hasn't come on yet,
I suggest you go get a new bulb!
-
Tiger Shark
Last edited by UmneyDurak; 12-27-2005 at 05:59 AM.
01+01=10
-
 Originally Posted by UmneyDurak
Nevermind.

no - no...plz, speak up 
i know you have programming knowledge
Delete the Electoral College - Support
www.NationalPopularVote.com
"The world according to DRM Bozos"
I am a consumer, I'll buy anything
I am a sheep, I am cattle, I follow the herd
I am ignorant, a dumbass, and I am a bozo...
I am the epitome of the 'rank and file'
I am your next door neighbor
I am 95% of American Consumers
I will consume you
- If the light in your head hasn't come on yet,
I suggest you go get a new bulb!
-
NullPointerException
 Originally Posted by slavik
umm, make your code more efficient by using a switch statement 
But I don't think readability would improve. It would also be faster if they were all if/else too. There are >= and == tests, and I'm testing on a String and with this single block, it's very obvious what's going on. And since this might work on a couple hundred files in a folder, saving a fraction of a fraction of a second isn't of much concern.
 Originally Posted by I4one
out of curiousity, why are doing this ?
Mostly because I wanted to pull out some images after the fact. I was also curious about how much html is cached when using web email systems. I've since learned it's an easy way to get snapshots when using maps.google.com.
As for retrieving things like .wmv or .mpg files played in the browser, this could work there as well; but on the rare occasion I need to do that, it's easy enough to manually find it based on size and date.
So basically, it was an exercise to see if I could do it. It's just a simple command line application, and I'm not planning on investing any more time into it to wrap a gui around it or make it work anywhere except the local directory.
Open Source is free like a puppy is free.
It's only when you look at an ant through a magnifying glass on a sunny day that you realise how often they burst into flames.
Understanding Evolution
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
|