Scanning the web efficiently

cstef,Fri Feb 23 2024•11 min read

Introduction

In the recent years, I’ve been taking part in more and more Capture The Flag (CTF) competitions. One of the most common tasks in these competitions is to find hidden flags on a web server. This usually involves scanning the web server for hidden files and directories. This is a very long and boring process, and I felt like the existing tools were not efficient enough.

I decided to write my own tool to scan the web efficiently, and I called it rwalk.

I find it very important to have a good understanding of the inner workings of the tools we use, so this article will explain both the process of writing the tool and the features it offers.

Existing tools

Before starting to write my own tool, let’s take a look at the existing ones and see what they offer.

ffuf

ffuf is a fast web fuzzer written in Go. It is based on a FUZZ keyword that is replaced by the words in a wordlist. It can thus be used to scan for parameters, headers, and directories.

ffuf -u http://ffuf.me/cd/basic/FUZZ -w common.txt

Dirsearch

dirsearch is a simple command line tool designed to brute force directories and files in web servers.

python3 dirsearch.py -u http://ffuf.me/cd/basic/ -w common.txt

Writing the tool

I chose Rust as the programming language for this tool because it’s my favorite at the moment and has a great ecosystem for writing command line tools, with libraries like clap and reqwest.

The first functionality I needed was to scan a target recursively. This was achieved by using a simple tree structure.

tree.rs

impl<T> Tree<T> {
    /// Insert a new data into the tree, at the root if no parent provided.
    pub fn insert(
        &mut self,
        data: T,
        parent: Option<Arc<Mutex<TreeNode<T>>>>,
    ) -> Arc<Mutex<TreeNode<T>>> {
        let new_node = Arc::new(Mutex::new(TreeNode {
            data,
            children: Vec::new(),
        }));
 
        match parent {
            Some(parent) => {
                parent.lock().children.push(new_node.clone());
            }
            None => {
                self.root = Some(new_node.clone());
            }
        }
 
        new_node
    }
 
    /// Recursively get all nodes at a given depth
    fn get_nodes_at_depth(
        node: &Option<Arc<Mutex<TreeNode<T>>>>,
        depth: usize,
        nodes: &mut Vec<Arc<Mutex<TreeNode<T>>>>,
    ) {
        if let Some(node) = node {
            if depth == 0 {
                nodes.push(node.clone());
            } else {
                for child in &node.lock().children {
                    Self::get_nodes_at_depth_recursive(&Some(child.clone()), depth - 1, nodes);
                }
            }
        }
    }
}

single_threaded.rs

while depth < max_depth {
    let previous_nodes = tree.get_nodes_at_depth(depth);
    // Iterate over the previous nodes
    for previous_node in previous_nodes {
        // Iterate over the wordlist
        for word in wordlist {
            let response = reqwest::blocking::get(&format!("{}/{}", node, word););
            if response.is_ok() {
                let status = response.unwrap().status();
                if status.is_success() {
                    // Add the new node to the tree, with the previous node as its parent
                    tree.insert(word, previous_node);
                }
            }
        }
    }
    // Go to the next depth (/a/b -> /a/b/c)
    depth += 1;
}

Pseudocode for the search algorithm

The algorithm is quite simple:

We retrieve the nodes at the current depth. (The first time, it’s the root node /)
We iterate over the wordlist and make a request to the server for each word.
If the request is successful, we add the new node to the tree, with the previous node as its parent.
We go to the next depth.

This single-threaded approach was already quite efficient, but I wanted to make it even faster with multi-threading. Such an approach requires a bit of refactoring, but Rust makes it easy to do so with the tokio library.

Chunks of the wordlist are distributed to different threads, and the results are collected at the end.

multi_threaded.rs

let chunks = wordlist.chunks(wordlist.len() / num_threads);
 
while depth < max_depth {
    let previous_nodes = tree.get_nodes_at_depth(depth);
    let mut rxs = vec![];
    // Iterate over the previous nodes
    for previous_node in previous_nodes {
        // Iterate over the wordlist
        for chunk in chunks {
            let (tx, rx) = tokio::sync::mpsc::channel(1);
 
            tokio::spawn(async move {
                for word in chunk {
                    let response = // ...
                    if response.is_ok() {
                        let status = response.unwrap().status();
                        if status.is_success() {
                            tree.insert(word, previous_node);
                        }
                    }
                }
                // Send the signal to the main thread
                tx.send(()).await.unwrap();
            });
 
            rxs.push(rx);
        }
    }
 
    // Wait for all the threads to finish
    for mut rx in rxs {
        rx.recv().await;
    }
    // Go to the next depth (/a/b -> /a/b/c)
    depth += 1;
}

Pseudocode for the multi-threaded version of the previous algorithm

Notice that the tx + rx pattern is used to signal the main thread when a thread has finished its work. We do not need to send any result back as the tree is already wrapped in an Arc<Mutex<T>> and can be accessed from any thread.

Adding more features

Filtering

One of the most important features of a web scanner is to be able to filter the responses.

Basic filters include:

Status code
Response body
Response size
Response body hash
Response headers
Response time

I added those to rwalk and made them configurable via the command line.

filters.rs

let mut outs = vec![];
for filter in filters {
    let negated = filter.0.starts_with('!');
    let out = match filter.0.trim_start_matches('!') {
        "time" => check_range(&parse_range(&filter.1), time) ^ negated,
        "status" => check_range(&parse_range(&filter.1), status_code) ^ negated,
        "contains" => !res_text.contains(&filter.1) ^ negated,
        "starts" => !res_text.starts_with(&filter.1) ^ negated,
        "ends" => !res_text.ends_with(&filter.1) ^ negated,
        "regex" => regex::Regex::new(&filter.1).is_match(res_text) ^ negated,
        "size" => check_range(&parse_range(&filter.1), res_text.len()) ^ negated
        _ => true,
    }
    outs.push(out);
}
if opts.or {
    outs.iter().any(|&x| x)
} else {
    outs.iter().all(|&x| x)
}

Pseudocode for the filtering of the responses

The check_range() and parse_range() functions are used to parse and check a specific type of filter, which is a range.

Format	Description
`5`	Exactly `5`
`5-10`	Between `5` and `10` (inclusive)
`5,10`	Exactly `5` or `10`
`>5`	Greater than `5`
`<5`	Less than `5`
`5,10,15`	Exactly `5`, `10`, or `15`
`>5,10,15`	Greater than `5`, or exactly `10` or `15`
`5-10,15-20`	Between `5` and `10` or between `15` and `20` (inclusive)

Range format

This allows for a lot of flexibility when filtering the responses, as most of them require a numeric comparison (more, less, equal, etc.).

For example, let’s say we want to filter the responses that only took less than 100 milliseconds:

--filter "time:<100"

We also needed a way to negate the filters, which is why the ! character is used to do so with a XOR operation on the base result:

Filter Output	Negated	Result
`true`	`false`	`true`
`false`	`false`	`false`
`true`	`true`	`false`
`false`	`true`	`true`

Truth table for the negation of the filters (XOR)

The default behavior is to and all the filters, but the or behavior can be enabled with the --or flag. Most of the modern API responses are JSON, and we might want to filter them based on their content. We need to be able to easily access the JSON fields and compare them to a value.

The following format is used to filter a JSON response:

--filter "json:obj.key=value1|value2"

To filter only the responses that contain deadbeef as the first item of the data array:

--filter "json:data.0=deadbeef"

{
	"data": ["deadbeef", "cafebabe"],
	"other": "stuff",
	"answer": 42
}

Example of a JSON response

Example

Let’s test out these filters with ffuf.me/cd/no404 !

Note: See the install section to install the tool and try it out yourself.

We start out with a basic scan as follows:

rwalk http://ffuf.me/cd/no404 common.txt

And we get a bunch of 200s:

✓ 200 http://ffuf.me/cd/no404/wp-dbmanager 142ms
✓ 200 http://ffuf.me/cd/no404/header_logo 142ms
✓ 200 http://ffuf.me/cd/no404/signon 142ms
✓ 200 http://ffuf.me/cd/no404/private2 142ms
✓ 200 http://ffuf.me/cd/no404/mantis 152ms
✓ 200 http://ffuf.me/cd/no404/authorization 152ms
✓ 200 http://ffuf.me/cd/no404/technical 130ms
✓ 200 http://ffuf.me/cd/no404/popup_content 130ms
✓ 200 http://ffuf.me/cd/no404/columns 130ms
✓ 200 http://ffuf.me/cd/no404/pre 132ms
✓ 200 http://ffuf.me/cd/no404/103 131ms
✓ 200 http://ffuf.me/cd/no404/inline 132ms
✓ 200 http://ffuf.me/cd/no404/read 142ms
✓ 200 http://ffuf.me/cd/no404/ss_vms_admin_sm 142ms
...

Let’s make use of the --show flag, that allows us to display additional information on each request.

rwalk http://ffuf.me/cd/no404 common.txt --show hash

This yields the body hash of each response:

✓ 200 http://ffuf.me/cd/no404/brochures 138ms | hash: 1c3af61e88de6618c0fabbe48edf5ed9
✓ 200 http://ffuf.me/cd/no404/append 138ms | hash: 1c3af61e88de6618c0fabbe48edf5ed9
✓ 200 http://ffuf.me/cd/no404/reqs 141ms | hash: 1c3af61e88de6618c0fabbe48edf5ed9
✓ 200 http://ffuf.me/cd/no404/json 141ms | hash: 1c3af61e88de6618c0fabbe48edf5ed9
✓ 200 http://ffuf.me/cd/no404/done 146ms | hash: 1c3af61e88de6618c0fabbe48edf5ed9
✓ 200 http://ffuf.me/cd/no404/command 144ms | hash: 1c3af61e88de6618c0fabbe48edf5ed9
✓ 200 http://ffuf.me/cd/no404/path 144ms | hash: 1c3af61e88de6618c0fabbe48edf5ed9
✓ 200 http://ffuf.me/cd/no404/pro 145ms | hash: 1c3af61e88de6618c0fabbe48edf5ed9
✓ 200 http://ffuf.me/cd/no404/webadm 147ms | hash: 1c3af61e88de6618c0fabbe48edf5ed9
✓ 200 http://ffuf.me/cd/no404/th 146ms | hash: 1c3af61e88de6618c0fabbe48edf5ed9
✓ 200 http://ffuf.me/cd/no404/shopstat 146ms | hash: 1c3af61e88de6618c0fabbe48edf5ed9

We can now try our luck and search for a response that does not match 1c3af61e88de6618c0fabbe48edf5ed9:

rwalk http://ffuf.me/cd/no404 common.txt --filter "!hash:1c3af61e88de6618c0fabbe48edf5ed9"

And voilà !

✓ 200 http://ffuf.me/cd/no404/secret 114ms

If you are curious, you can also see what was the content of the response with the --show body option.

rwalk http://ffuf.me/cd/no404 common.txt --show body --filter "!hash:1c3af61e88de6618c0fabbe48edf5ed9"

✓ 200 http://ffuf.me/cd/no404/secret 130ms | body: Controller does not exist

Output

You might want to save the results of your scan to a file, to be able to analyze them with another tool or to share them with someone else.

The available formats are:

JSON
CSV
Markdown
Plain text

But as I was using a tree structure to search for the paths, I thought it would be a good idea to print a nice tree a the end of each scan. The library ptree was perfect for this.

✓ Done in 7 seconds with an average of 1990 req/s
✓ 200 /cd/recursion
└─ ✖ 403 /admin
   └─ ✖ 403 /users
      └─ ✓ 200 /96

Example of the tree output

Permutations

In its current state, rwalk is only able to scan a target at the end of a path. But what if we needed to fuzz a path like /admin/.../login/.../ ?

This is why I added a new scanning mode called classic, which allowed for both regular and permutation scanning. The separation between these modes is due to the fact that they rely on different algorithms.

The permutation scanning mode is based on the itertools crate’s .permutations() method.

generate_urls.rs

if opts.permutations {
    let token_count = self
        .url
        .matches(opts.fuzz_key)
        .count();
    let combinations: Vec<_> = words.iter().permutations(token_count).collect();
 
    combinations
        .iter()
        .map(|c| {
            let mut url = url.clone();
            for word in c {
                url = url.replacen(opts.fuzz_key, word, 1);
            }
            url
        })
        .collect()
} else {
    words
        .iter()
        .map(|w| {
            url.replace(opts.fuzz_key, w)
        })
        .collect()
}

URL generation pseudocode

We first compute the number of tokens in the URL, then generate all the n-permutations of the wordlist and replace the tokens with the permutations.

Wordlists

When you need to quickly scan a target, you might not want to use a huge wordlist.

rwalk comes with a bunch of filters and transformations to help you generate a wordlist on the fly.

Filters

Let’s say you need to scan a target for PHP files, you can use the --filter flag to only keep the words that end with .php.

--wordlist-filter ends=.php

or you might want to keep only 4-letter words:

--wordlist-filter size=4

Transformations

You might want to transform the words in the wordlist to lowercase, uppercase, or capitalize them.

--transform lowercase

or perhaps you want to replace each instance of a with @:

--transform replace:a=@

These two features combined allow for a lot of flexibility when generating a granular wordlist.

You can try this feature with ffuf.me/cd/ext/logs and the following command:

rwalk http://ffuf.me/cd/ext/logs common.txt --transform suffix:.log

This will generate a wordlist with the .log suffix and scan the target for files with this extension.

                    _ _
                   | | |
 _ ____      ____ _| | | __
| '__\ \ /\ / / _` | | |/ /
| |   \ V  V / (_| | |   <
|_|    \_/\_/ \__,_|_|_|\_\
 
Version: 0.4.11
Author: cstef
 
ℹ 4613 words loaded
ℹ Starting crawler with 110 threads
ℹ Press Ctrl+C to save state and exit
✓ 200 http://ffuf.me/cd/ext/logs/users.log 118ms
✓ Done in 5 seconds with an average of 1916 req/s
✖ 403 /cd/ext/logs
└─ ✓ 200 /users.log

Throttling

When scanning a target, you sometimes need to throttle the requests to avoid being blocked by the server.

Let’s take a concrete example with ffuf.me/cd/rate.

We start with a basic scan:

rwalk http://ffuf.me/cd/rate common.txt

This yields us an empty scan result:

                    _ _
                   | | |
 _ ____      ____ _| | | __
| '__\ \ /\ / / _` | | |/ /
| |   \ V  V / (_| | |   <
|_|    \_/\_/ \__,_|_|_|\_\
 
Version: 0.4.11
Author: cstef
 
ℹ 4613 words loaded
ℹ Starting crawler with 110 threads
ℹ Press Ctrl+C to save state and exit
✓ Done in 5 seconds with an average of 2008 req/s
✓ 200 /cd/rate

This means that nothing has been matched, let’s take a look at the response codes by overriding the default filter:

rwalk http://ffuf.me/cd/rate common.txt --filter status:"100-599"

✖ 429 http://ffuf.me/cd/rate/wpau-backup 122ms
✖ 404 http://ffuf.me/cd/rate/video 124ms
✖ 429 http://ffuf.me/cd/rate/transaction 138ms
✖ 429 http://ffuf.me/cd/rate/furl 134ms
✖ 404 http://ffuf.me/cd/rate/calling 140ms
✖ 429 http://ffuf.me/cd/rate/laptops 136ms
✖ 429 http://ffuf.me/cd/rate/resellers 136ms
✖ 429 http://ffuf.me/cd/rate/juniper 136ms
✖ 429 http://ffuf.me/cd/rate/award 136ms
✖ 404 http://ffuf.me/cd/rate/pnadodb 141ms

We can see that we are being rate-limited by the server, so we need to throttle the requests. The number of requests per second is equal to the throttle times the number of threads.

For example, if we have 110 threads and a throttle of 10, we will have 1100 requests per second. Let’s go with 10 threads and a 5 second throttle (50 requests per second).

rwalk http://ffuf.me/cd/rate common.txt --throttle 5 --threads 10

After a few minutes, we get the following:

                    _ _
                   | | |
 _ ____      ____ _| | | __
| '__\ \ /\ / / _` | | |/ /
| |   \ V  V / (_| | |   <
|_|    \_/\_/ \__,_|_|_|\_\
 
Version: 0.4.11
Author: cstef
 
ℹ 4613 words loaded
ℹ Starting crawler with 10 threads
ℹ Press Ctrl+C to save state and exit
✓ 200 http://ffuf.me/cd/rate/oracle 202ms
✓ Done in 2 minutes with an average of 49 req/s
✓ 200 /cd/rate
└─ ✓ 200 /oracle

Final result

We now have a tool that is able to scan a target efficiently, approaching 2000 requests per second on a regular WiFi connection.

Test it out yourself

You can find the source code on GitHub.

The tool can be installed with brew or cargo:

brew install cestef/tap/rwalk

cargo install rwalk

I like to use ffuf.me to test out the tool, as it’s a great target for web scanning.

A good wordlist to start with is common.txt.

rwalk http://ffuf.me/cd/recursion common.txt --depth 3

Recursive scan of ffuf.me

rwalk http://ffuf.me/cd/basic/$ common.txt --mode classic

Classic scan of ffuf.me

I hope you find this tool as useful as I do

Feel free to open an issue or join the Discord if you have any questions or suggestions !