First bytes of Binary files on http.response are wrong
Hello,
I've made a plugin that hook on the
spider.on_each_response
, it save binary files (images
at the moment) so I can compare the images when I launch a new
crawl.
This is it's run method :
def run
spider.on_each_response do |response|
ct = response.headers_hash['Content-type']
fname = File.basename(response.effective_url)
if ct == 'image/jpeg'
File.open(fname, 'w') {|f| f << response.body}
end
end
The plugin work like a charm ... except that response.body
contain wrong bytes and was unreadable :( let me explain :
A valid jpeg start with : FF D8 FF E0 00 10 4A
But the written file always start with (for jpeg) : C3 98 C3 A0 00
10 4A
And the written file is always bigger than expected (30%)
I tried to save the file using
open(response.effective_url).read
and then the written
file is readable and ok
I know that it's not part of the arachni framework but this behaviour is very strange to me
Comments are currently closed for this discussion. You can start a new one.
Keyboard shortcuts
Generic
? | Show this help |
---|---|
ESC | Blurs the current field |
Comment Form
r | Focus the comment reply box |
---|---|
^ + ↩ | Submit the comment |
You can use Command ⌘
instead of Control ^
on Mac
Support Staff 1 Posted by Tasos Laskos on 06 Nov, 2012 12:20 PM
Actually, I've sort of seen this before when using the Proxy plugin, for some reason the images get corrupted. I'll look into it.
Support Staff 2 Posted by Tasos Laskos on 06 Nov, 2012 10:50 PM
Fixed in the distributed crawler's branch,
Tasos Laskos closed this discussion on 06 Nov, 2012 10:50 PM.
beunwa re-opened this discussion on 07 Nov, 2012 07:16 AM
3 Posted by beunwa on 07 Nov, 2012 07:16 AM
Nice Job !
Could you explain what's wrong ? I've digged up in your code and cant find the problem.
Support Staff 4 Posted by Tasos Laskos on 07 Nov, 2012 11:03 AM
https://github.com/Arachni/arachni/commit/867d2e2ea848f71de3cd042d1...
Some time ago a user was reporting encoding errors so we started sanitizing everything in an effort to find out how the bad chars were getting through.
Turns out the problem was environmental because it eventually went away on its own but the sanitization code remained in the repo and was corrupting binary files.
Maybe repacking the bytes is too harsh, I may dial it down to forcing the HTTP responses to UTF8.
PS. See
String#repack
: https://github.com/Arachni/arachni/blob/feature/distributed-crawlin...Tasos Laskos closed this discussion on 07 Nov, 2012 11:03 AM.