Software as a Service (Saas)
There are people who've asked what's the best way to setup Arachni to operate in a SaaS model so hopefully this article will clarify a few things and point you to the right direction.
Default model (On-Demand)
Arachni's default distributed model is that of On-Demand; you
have a Dispatcher which you ask for an Instance, then you take the
Instance that's been assigned to you, configure it with the scan
options and start the scan.
Once the scan is running you poll it intermittently to check its
progress and once the scan is finished you grab the report, save it
somewhere and then shutdown your Instance.
Seasoned engineers of distributed systems will immediately think: What a stupid way of doing things, this will be a pain to scale.
And they would, of course, be right.
In order to load balance your jobs you would have to keep track of the workload of all your Dispatchers and then assign the next job to the least burdened one -- which is not the most elegant solution.
On the other hand, the On-Demand model allows you to perform scans...well...on demand and is perfect for developing user interfaces -- like the WebUI for example.
Plus, the On-Demand model is the simplest to work with and any other model can be built on top of it, so you don't lose any flexibility (in case you're looking to do fancier stuff) while keeping the design neat, simple and straightforward.
Typical SaaS model (Producer/Consumer)
So far I've explained the default way of doing things and the reason for that design choice, let's now look at a deployment more suitable to an SaaS endeavor.
What you need in this case is the Producer/Consumer
model.
The Producer would be an interface for pushing scan configurations
to a queue and the Consumer(s) would be a system which would
maintain a pool of Instances, be able to pop items off of the
queue, assign them to an Instance and generally manage them.
The way to do that in Arachni is to add custom RPCD Handlers to your Dispatchers.
For example, you can add an RPCD Handler that just waits there and pops items off a work queue (which can be some sort of DB for example), monitors and manages running instances, saves their reports to a DB (once they're done scanning) and then shuts them down.
Since the RPCD Handler will run on the same machine as the Instances it'll be monitoring there'll be no appreciable I/O overhead/expense.
Moreover, you could have it limit the amount of running scans to a given number, once the number of running scans drops bellow the limit you then pop the next job off the queue and so it goes -- which means you won't have to worry about load balancing either.
So you end up with all Dispatchers being able to manage their own instances and the only thing you need to do is push jobs to a work queue and then read the reports from a DB.
And you can have your RPCD Handler provide an aggregate of what's going on (progress-wise) -- or split distinct functionality across a few RPCD Handlers, since it's better for components to only have a single responsibility from a s/w engineering point of view.
Unfortunately (or fortunately) beyond the support for RPCD
Handlers there isn't anything else you can leverage to configure
Arachni this way.
How you want to structure your work queue or your DB or how or when
you want to run the scans is your business.
Fortunately though, Arachni is flexible enough to accommodate
whatever you thing would be the best solution.
(OK, not 100% sure on that last remark to be honest but if you find yourself restricted in some manner then drop us a line and we'll see what we can do.)