First bytes of Binary files on http.response are wrong

Beunwa's Avatar

Beunwa

06 Nov, 2012 12:16 PM

Hello,

I've made a plugin that hook on the spider.on_each_response , it save binary files (images at the moment) so I can compare the images when I launch a new crawl.
This is it's run method :

def run
  spider.on_each_response do |response|
    ct = response.headers_hash['Content-type']
    fname = File.basename(response.effective_url)
    if ct == 'image/jpeg'
      File.open(fname, 'w') {|f| f << response.body}
    end
end

The plugin work like a charm ... except that response.body contain wrong bytes and was unreadable :( let me explain :
A valid jpeg start with : FF D8 FF E0 00 10 4A
But the written file always start with (for jpeg) : C3 98 C3 A0 00 10 4A
And the written file is always bigger than expected (30%)

I tried to save the file using open(response.effective_url).read and then the written file is readable and ok

I know that it's not part of the arachni framework but this behaviour is very strange to me

  1. Support Staff 1 Posted by Tasos Laskos on 06 Nov, 2012 12:20 PM

    Tasos Laskos's Avatar

    Actually, I've sort of seen this before when using the Proxy plugin, for some reason the images get corrupted. I'll look into it.

  2. Support Staff 2 Posted by Tasos Laskos on 06 Nov, 2012 10:50 PM

    Tasos Laskos's Avatar

    Fixed in the distributed crawler's branch,

  3. Tasos Laskos closed this discussion on 06 Nov, 2012 10:50 PM.

  4. beunwa re-opened this discussion on 07 Nov, 2012 07:16 AM

  5. 3 Posted by beunwa on 07 Nov, 2012 07:16 AM

    beunwa's Avatar

    Nice Job !

    Could you explain what's wrong ? I've digged up in your code and cant find the problem.

  6. Support Staff 4 Posted by Tasos Laskos on 07 Nov, 2012 11:03 AM

    Tasos Laskos's Avatar

    https://github.com/Arachni/arachni/commit/867d2e2ea848f71de3cd042d1...

    Some time ago a user was reporting encoding errors so we started sanitizing everything in an effort to find out how the bad chars were getting through.
    Turns out the problem was environmental because it eventually went away on its own but the sanitization code remained in the repo and was corrupting binary files.

    Maybe repacking the bytes is too harsh, I may dial it down to forcing the HTTP responses to UTF8.

    PS. See String#repack: https://github.com/Arachni/arachni/blob/feature/distributed-crawlin...

  7. Tasos Laskos closed this discussion on 07 Nov, 2012 11:03 AM.

Comments are currently closed for this discussion. You can start a new one.

Keyboard shortcuts

Generic

? Show this help
ESC Blurs the current field

Comment Form

r Focus the comment reply box
^ + ↩ Submit the comment

You can use Command ⌘ instead of Control ^ on Mac