Scanning the web efficiently
Introduction
In the recent years, I’ve been taking part in more and more Capture The Flag (CTF) competitions. One of the most common tasks in these competitions is to find hidden flags on a web server. This usually involves scanning the web server for hidden files and directories. This is a very long and boring process, and I felt like the existing tools were not efficient enough.
I decided to write my own tool to scan the web efficiently, and I called it rwalk
.
I find it very important to have a good understanding of the inner workings of the tools we use, so this article will explain both the process of writing the tool and the features it offers.
Existing tools
Before starting to write my own tool, let’s take a look at the existing ones and see what they offer.
ffuf
ffuf
is a fast web fuzzer written in Go.
It is based on a FUZZ
keyword that is replaced by the words in a wordlist. It can thus be used to scan for parameters, headers, and directories.
ffuf -u http://ffuf.me/cd/basic/FUZZ -w common.txt
Dirsearch
dirsearch
is a simple command line tool designed to brute force directories and files in web servers.
python3 dirsearch.py -u http://ffuf.me/cd/basic/ -w common.txt
Writing the tool
I chose Rust as the programming language for this tool because it’s my favorite at the moment and has a great ecosystem for writing command line tools,
with libraries like clap
and reqwest
.
The first functionality I needed was to scan a target recursively. This was achieved by using a simple tree structure.
impl<T> Tree<T> {
/// Insert a new data into the tree, at the root if no parent provided.
pub fn insert(
&mut self,
data: T,
parent: Option<Arc<Mutex<TreeNode<T>>>>,
) -> Arc<Mutex<TreeNode<T>>> {
let new_node = Arc::new(Mutex::new(TreeNode {
data,
children: Vec::new(),
}));
match parent {
Some(parent) => {
parent.lock().children.push(new_node.clone());
}
None => {
self.root = Some(new_node.clone());
}
}
new_node
}
/// Recursively get all nodes at a given depth
fn get_nodes_at_depth(
node: &Option<Arc<Mutex<TreeNode<T>>>>,
depth: usize,
nodes: &mut Vec<Arc<Mutex<TreeNode<T>>>>,
) {
if let Some(node) = node {
if depth == 0 {
nodes.push(node.clone());
} else {
for child in &node.lock().children {
Self::get_nodes_at_depth_recursive(&Some(child.clone()), depth - 1, nodes);
}
}
}
}
}
while depth < max_depth {
let previous_nodes = tree.get_nodes_at_depth(depth);
// Iterate over the previous nodes
for previous_node in previous_nodes {
// Iterate over the wordlist
for word in wordlist {
let response = reqwest::blocking::get(&format!("{}/{}", node, word););
if response.is_ok() {
let status = response.unwrap().status();
if status.is_success() {
// Add the new node to the tree, with the previous node as its parent
tree.insert(word, previous_node);
}
}
}
}
// Go to the next depth (/a/b -> /a/b/c)
depth += 1;
}
The algorithm is quite simple:
- We retrieve the nodes at the current depth. (The first time, it’s the root node
/
) - We iterate over the wordlist and make a request to the server for each word.
- If the request is successful, we add the new node to the tree, with the previous node as its parent.
- We go to the next depth.
This single-threaded approach was already quite efficient, but I wanted to make it even faster with multi-threading.
Such an approach requires a bit of refactoring, but Rust makes it easy to do so with the tokio
library.
Chunks of the wordlist are distributed to different threads, and the results are collected at the end.
let chunks = wordlist.chunks(wordlist.len() / num_threads);
while depth < max_depth {
let previous_nodes = tree.get_nodes_at_depth(depth);
let mut rxs = vec![];
// Iterate over the previous nodes
for previous_node in previous_nodes {
// Iterate over the wordlist
for chunk in chunks {
let (tx, rx) = tokio::sync::mpsc::channel(1);
tokio::spawn(async move {
for word in chunk {
let response = // ...
if response.is_ok() {
let status = response.unwrap().status();
if status.is_success() {
tree.insert(word, previous_node);
}
}
}
// Send the signal to the main thread
tx.send(()).await.unwrap();
});
rxs.push(rx);
}
}
// Wait for all the threads to finish
for mut rx in rxs {
rx.recv().await;
}
// Go to the next depth (/a/b -> /a/b/c)
depth += 1;
}
Notice that the tx
+ rx
pattern is used to signal the main thread when a thread has finished its work.
We do not need to send any result back as the tree is already wrapped in an Arc<Mutex<T>>
and can be accessed from any thread.
Adding more features
Filtering
One of the most important features of a web scanner is to be able to filter the responses.
Basic filters include:
- Status code
- Response body
- Response size
- Response body hash
- Response headers
- Response time
I added those to rwalk
and made them configurable via the command line.
let mut outs = vec![];
for filter in filters {
let negated = filter.0.starts_with('!');
let out = match filter.0.trim_start_matches('!') {
"time" => check_range(&parse_range(&filter.1), time) ^ negated,
"status" => check_range(&parse_range(&filter.1), status_code) ^ negated,
"contains" => !res_text.contains(&filter.1) ^ negated,
"starts" => !res_text.starts_with(&filter.1) ^ negated,
"ends" => !res_text.ends_with(&filter.1) ^ negated,
"regex" => regex::Regex::new(&filter.1).is_match(res_text) ^ negated,
"size" => check_range(&parse_range(&filter.1), res_text.len()) ^ negated
_ => true,
}
outs.push(out);
}
if opts.or {
outs.iter().any(|&x| x)
} else {
outs.iter().all(|&x| x)
}
The check_range()
and parse_range()
functions are used to parse and check a specific type of filter, which is a range.
Format | Description |
---|---|
5 | Exactly 5 |
5-10 | Between 5 and 10 (inclusive) |
5,10 | Exactly 5 or 10 |
>5 | Greater than 5 |
<5 | Less than 5 |
5,10,15 | Exactly 5 , 10 , or 15 |
>5,10,15 | Greater than 5 , or exactly 10 or 15 |
5-10,15-20 | Between 5 and 10 or between 15 and 20 (inclusive) |
This allows for a lot of flexibility when filtering the responses, as most of them require a numeric comparison (more, less, equal, etc.).
For example, let’s say we want to filter the responses that only took less than 100
milliseconds:
--filter "time:<100"
We also needed a way to negate the filters, which is why the !
character is used to do so with a XOR operation on the base result:
Filter Output | Negated | Result |
---|---|---|
true | false | true |
false | false | false |
true | true | false |
false | true | true |
The default behavior is to and
all the filters, but the or
behavior can be enabled with the --or
flag.
Most of the modern API responses are JSON, and we might want to filter them based on their content.
We need to be able to easily access the JSON fields and compare them to a value.
The following format is used to filter a JSON response:
--filter "json:obj.key=value1|value2"
To filter only the responses that contain deadbeef
as the first item of the data
array:
--filter "json:data.0=deadbeef"
{
"data": ["deadbeef", "cafebabe"],
"other": "stuff",
"answer": 42
}
Example
Let’s test out these filters with ffuf.me/cd/no404 !
Note: See the install section to install the tool and try it out yourself.
We start out with a basic scan as follows:
rwalk http://ffuf.me/cd/no404 common.txt
And we get a bunch of 200s:
✓ 200 http://ffuf.me/cd/no404/wp-dbmanager 142ms
✓ 200 http://ffuf.me/cd/no404/header_logo 142ms
✓ 200 http://ffuf.me/cd/no404/signon 142ms
✓ 200 http://ffuf.me/cd/no404/private2 142ms
✓ 200 http://ffuf.me/cd/no404/mantis 152ms
✓ 200 http://ffuf.me/cd/no404/authorization 152ms
✓ 200 http://ffuf.me/cd/no404/technical 130ms
✓ 200 http://ffuf.me/cd/no404/popup_content 130ms
✓ 200 http://ffuf.me/cd/no404/columns 130ms
✓ 200 http://ffuf.me/cd/no404/pre 132ms
✓ 200 http://ffuf.me/cd/no404/103 131ms
✓ 200 http://ffuf.me/cd/no404/inline 132ms
✓ 200 http://ffuf.me/cd/no404/read 142ms
✓ 200 http://ffuf.me/cd/no404/ss_vms_admin_sm 142ms
...
Let’s make use of the --show
flag, that allows us to display additional information on each request.
rwalk http://ffuf.me/cd/no404 common.txt --show hash
This yields the body hash of each response:
✓ 200 http://ffuf.me/cd/no404/brochures 138ms | hash: 1c3af61e88de6618c0fabbe48edf5ed9
✓ 200 http://ffuf.me/cd/no404/append 138ms | hash: 1c3af61e88de6618c0fabbe48edf5ed9
✓ 200 http://ffuf.me/cd/no404/reqs 141ms | hash: 1c3af61e88de6618c0fabbe48edf5ed9
✓ 200 http://ffuf.me/cd/no404/json 141ms | hash: 1c3af61e88de6618c0fabbe48edf5ed9
✓ 200 http://ffuf.me/cd/no404/done 146ms | hash: 1c3af61e88de6618c0fabbe48edf5ed9
✓ 200 http://ffuf.me/cd/no404/command 144ms | hash: 1c3af61e88de6618c0fabbe48edf5ed9
✓ 200 http://ffuf.me/cd/no404/path 144ms | hash: 1c3af61e88de6618c0fabbe48edf5ed9
✓ 200 http://ffuf.me/cd/no404/pro 145ms | hash: 1c3af61e88de6618c0fabbe48edf5ed9
✓ 200 http://ffuf.me/cd/no404/webadm 147ms | hash: 1c3af61e88de6618c0fabbe48edf5ed9
✓ 200 http://ffuf.me/cd/no404/th 146ms | hash: 1c3af61e88de6618c0fabbe48edf5ed9
✓ 200 http://ffuf.me/cd/no404/shopstat 146ms | hash: 1c3af61e88de6618c0fabbe48edf5ed9
We can now try our luck and search for a response that does not match 1c3af61e88de6618c0fabbe48edf5ed9
:
rwalk http://ffuf.me/cd/no404 common.txt --filter "!hash:1c3af61e88de6618c0fabbe48edf5ed9"
And voilà !
✓ 200 http://ffuf.me/cd/no404/secret 114ms
If you are curious, you can also see what was the content of the response with the --show body
option.
rwalk http://ffuf.me/cd/no404 common.txt --show body --filter "!hash:1c3af61e88de6618c0fabbe48edf5ed9"
✓ 200 http://ffuf.me/cd/no404/secret 130ms | body: Controller does not exist
Output
You might want to save the results of your scan to a file, to be able to analyze them with another tool or to share them with someone else.
The available formats are:
- JSON
- CSV
- Markdown
- Plain text
But as I was using a tree structure to search for the paths, I thought it would be a good idea to print a nice tree a the end of each scan.
The library ptree
was perfect for this.
✓ Done in 7 seconds with an average of 1990 req/s
✓ 200 /cd/recursion
└─ ✖ 403 /admin
└─ ✖ 403 /users
└─ ✓ 200 /96
Permutations
In its current state, rwalk
is only able to scan a target at the end of a path.
But what if we needed to fuzz a path like /admin/.../login/.../
?
This is why I added a new scanning mode called classic
, which allowed for both regular and permutation scanning.
The separation between these modes is due to the fact that they rely on different algorithms.
The permutation scanning mode is based on the itertools
crate’s .permutations()
method.
if opts.permutations {
let token_count = self
.url
.matches(opts.fuzz_key)
.count();
let combinations: Vec<_> = words.iter().permutations(token_count).collect();
combinations
.iter()
.map(|c| {
let mut url = url.clone();
for word in c {
url = url.replacen(opts.fuzz_key, word, 1);
}
url
})
.collect()
} else {
words
.iter()
.map(|w| {
url.replace(opts.fuzz_key, w)
})
.collect()
}
We first compute the number of tokens in the URL, then generate all the n
-permutations of the wordlist and replace the tokens with the permutations.
Wordlists
When you need to quickly scan a target, you might not want to use a huge wordlist.
rwalk
comes with a bunch of filters and transformations to help you generate a wordlist on the fly.
Filters
Let’s say you need to scan a target for PHP files, you can use the --filter
flag to only keep the words that end with .php
.
--wordlist-filter ends=.php
or you might want to keep only 4-letter words:
--wordlist-filter size=4
Transformations
You might want to transform the words in the wordlist to lowercase, uppercase, or capitalize them.
--transform lowercase
or perhaps you want to replace each instance of a
with @
:
--transform replace:a=@
These two features combined allow for a lot of flexibility when generating a granular wordlist.
You can try this feature with ffuf.me/cd/ext/logs and the following command:
rwalk http://ffuf.me/cd/ext/logs common.txt --transform suffix:.log
This will generate a wordlist with the .log
suffix and scan the target for files with this extension.
_ _
| | |
_ ____ ____ _| | | __
| '__\ \ /\ / / _` | | |/ /
| | \ V V / (_| | | <
|_| \_/\_/ \__,_|_|_|\_\
Version: 0.4.11
Author: cstef
ℹ 4613 words loaded
ℹ Starting crawler with 110 threads
ℹ Press Ctrl+C to save state and exit
✓ 200 http://ffuf.me/cd/ext/logs/users.log 118ms
✓ Done in 5 seconds with an average of 1916 req/s
✖ 403 /cd/ext/logs
└─ ✓ 200 /users.log
Throttling
When scanning a target, you sometimes need to throttle the requests to avoid being blocked by the server.
Let’s take a concrete example with ffuf.me/cd/rate.
We start with a basic scan:
rwalk http://ffuf.me/cd/rate common.txt
This yields us an empty scan result:
_ _
| | |
_ ____ ____ _| | | __
| '__\ \ /\ / / _` | | |/ /
| | \ V V / (_| | | <
|_| \_/\_/ \__,_|_|_|\_\
Version: 0.4.11
Author: cstef
ℹ 4613 words loaded
ℹ Starting crawler with 110 threads
ℹ Press Ctrl+C to save state and exit
✓ Done in 5 seconds with an average of 2008 req/s
✓ 200 /cd/rate
This means that nothing has been matched, let’s take a look at the response codes by overriding the default filter:
rwalk http://ffuf.me/cd/rate common.txt --filter status:"100-599"
✖ 429 http://ffuf.me/cd/rate/wpau-backup 122ms
✖ 404 http://ffuf.me/cd/rate/video 124ms
✖ 429 http://ffuf.me/cd/rate/transaction 138ms
✖ 429 http://ffuf.me/cd/rate/furl 134ms
✖ 404 http://ffuf.me/cd/rate/calling 140ms
✖ 429 http://ffuf.me/cd/rate/laptops 136ms
✖ 429 http://ffuf.me/cd/rate/resellers 136ms
✖ 429 http://ffuf.me/cd/rate/juniper 136ms
✖ 429 http://ffuf.me/cd/rate/award 136ms
✖ 404 http://ffuf.me/cd/rate/pnadodb 141ms
We can see that we are being rate-limited by the server, so we need to throttle the requests. The number of requests per second is equal to the throttle times the number of threads.
For example, if we have 110 threads and a throttle of 10, we will have 1100 requests per second. Let’s go with 10 threads and a 5 second throttle (50 requests per second).
rwalk http://ffuf.me/cd/rate common.txt --throttle 5 --threads 10
After a few minutes, we get the following:
_ _
| | |
_ ____ ____ _| | | __
| '__\ \ /\ / / _` | | |/ /
| | \ V V / (_| | | <
|_| \_/\_/ \__,_|_|_|\_\
Version: 0.4.11
Author: cstef
ℹ 4613 words loaded
ℹ Starting crawler with 10 threads
ℹ Press Ctrl+C to save state and exit
✓ 200 http://ffuf.me/cd/rate/oracle 202ms
✓ Done in 2 minutes with an average of 49 req/s
✓ 200 /cd/rate
└─ ✓ 200 /oracle
Final result
We now have a tool that is able to scan a target efficiently, approaching 2000 requests per second on a regular WiFi connection.
Test it out yourself
You can find the source code on GitHub.
The tool can be installed with brew
or cargo
:
brew install cestef/tap/rwalk
or
cargo install rwalk
I like to use ffuf.me to test out the tool, as it’s a great target for web scanning.
A good wordlist to start with is common.txt.
rwalk http://ffuf.me/cd/recursion common.txt --depth 3
rwalk http://ffuf.me/cd/basic/$ common.txt --mode classic
I hope you find this tool as useful as I do
Feel free to open an issue or join the Discord if you have any questions or suggestions !