The genesis and architecture of my CyberGordon project

CyberGordon, an aggregator of your cyber reputation checks

7 min readJun 8, 2020

This post was originaly published on my blog.

CyberGordon quickly provides you threat and risk information about observables such as IP addresses or domain names by querying multiple threat intelligence sources.

Thanks to each source that provides free access to great Threat Intelligence against phishing and malware. Without them, Gordon would have not been there.

Why CyberGordon ?

Whether it be during my investigations at work or personal surfing sessions, I’m too lazy to use several sources to check if a domain or email address is suspicious or malicious. Some awesome OSINT tools exist, but I didn’t have one to aggregates them all into one simple web interface. On top of that, I wanted to start by building a usable and useful tool on AWS infrastructure that I could share with my entourage. I would have liked to share CyberGordon widely, but I’m constrained by the query limits that free API sources provide. Lastly, the lock down during the COVID-19 crisis gave me a lot of time, a rare resource that considerably contributed in completing CyberGordon.

Well why “Gordon” ?
As a Batman fan, I chose the commissioner James Gordon, a friend and reliable informant of the Dark Knight :-)
On April 2021, Gordon became CyberGordon to better reflect its function.

Objectives

When I built CyberGordon, I tried to follow several rules:

Simple: get results after pasting observable(s)

Neither I or my entourage would use a tool that has an extremely complex GUI or that requires sophisticated information for submission. The aim is to copy/paste one or more observables even if they are in a messy format (listed, quoted or in CSV format) submit on a unique form and get a readable summary. I’m still working on improving this last point.

Scalable: add easily new sources

Scalability often referred to the ability to manage automatically the capacity depending on the user’s demand. My intention was humble: I wanted an evolving system where I can add, or update, easily a source (called engine) without impacting the existing ones and without adding delay during processing of the user’s observable submission.

Almost free: use adapted and cost-effective services

For my first tool on AWS and as a non-profit service, the cost was of course the most important criteria. All major cloud service providers offer free tiers use for few of their main services for a duration of one year (sometimes less) or for life. I spent some time looking for simple and almost free AWS services before building a draft.

Serverless: minimal maintenance

It started as a challenge: being a relatively simple tool, I tried to avoid managing a Linux server, even though if I loved managing Debian servers previously… I wanted to test functions (containers of code), where you only manage the code, while the underlying layers (runtime, OS, hardware) are managed by AWS.
Of course, code maintenance has to be done at least for each runtime update (Python 3.8 to Python 3.9 for example).

Secure: apply best practices

Last but not least, even tough the manipulated data is not confidential, I have applied some principles: all data — in motion and at rest — is encrypted with AWS managed key (free), permission’s resources are restricted to the minimal needs (least privileges), public exposure is limited and management actions (API) and users’ HTTP request logs are stored for a duration of 6 months.

Attempt with Slack

Before hosting CyberGordon entirely on AWS, I tried to build a front-end on Slack as a ‘bot’ using Slack Commands feature and processing them on AWS. It works like a charm with one engine, but with two or more it is a mess and unusable. Slack is not suitable for presenting multiple results ; it is a good chat tool for “one question, one short response” capability, but not as a reporting tool…

Slack request is sent to an HTTPS endpoint (hosted on AWS API Gateway) and forwarded to the back-end. Results are then returned using the URL incoming webhook included in the Slack request. As you can see below, results are quickly unreadable when using 2 engines…

A teammate suggested I generate reports on a webpage and send links to Slack user. However, after a lot thinking, this hybrid solution didn’t suit me because it limits the user’s scope only to my Slack workspace.

Current architecture and how it works

To get all AWS capabilities and cheapest prices, all resources are hosted in “US East (Northern Virginia)” region, except for 2 resources invoked near the user location.

Short summary
On website (1) you paste an observable and submits it ; request is parsed (2), sent to a queue (3), dispatched to engines (4) that queries API sources. During this background work, you’re forwarded to the results page (5).

Components are represented on this diagram and described in detail later on.

1/ Static website

All front-web assets are stored on a single S3 bucket (object storage). To provide cache, reduced latency and encrypted traffic (HTTPS TLS 1.3), a CloudFront distribution (CDN) is used ; the S3 Bucket policy only allows traffic from CloudFront.

The domain cybergordon.com points to a CloudFront distribution and the DNS zone “cybergordon.com” is entirely managed by AWS Route 53 (DNS service).

2/ Request pipeline

By clicking on “Analyze!”, an HTTP POST request with observable(s), passes through the Gordon-Request Lambda@Edge function: a Python code deployed on multiple geographical points to be executed closer to the user.

The Gordon-Request function generates a request ID (UUID version 4) and parses the observable(s) into 7 predefined types list: IPv4, FQDN, URL, MD5, SHA-1, SHA-256 and Email address. Basically, the function compares the request body with 7 flexible regex that accept new line (\n) or space between each observable.

Then the user is forwarded (HTTP 302 Found) to the results page that is described below.

3/ Queue pipeline

Small but powerful part to dispatch observables to engines depending of the types they can check against sources. The Gordon-Queue SNS topic receives message and send immediately a copy to each subscriber that accepts submitted observables type(s).

4/ Engine pipeline

Each Gordon-Engines Lambda functions receives simultaneously the observable(s) list. The engine controls the integrity of the request (using the SHA-1 function), then gets, if applicable, the API token of the remote source in encrypted variables and queries it then in HTTPS. Finally results are stored in a DynamoDB Table (no-SQL database). Results from all engines are stored in a unique database record. In previous implementation, individual results were stored on S3 objects, with a fourfold increase in lead times to retrieve them!

All engines query remote sources to get live and fresh information, except for the Offline Feeds engine (E23): an hourly CloudWatch Event Rule (scheduled task) invokes a Lambda function (Python code) that downloads, transforms in a JSON format and overwrites the existing feed content stored in the main S3 bucket.

5/ Result pipeline

Returning to the Request stage explained earlier (2/ Request pipeline), the user is forwarded to the result page. When loaded, a JavaScript gets the request ID in URL Query Parameter (see example below) and then makes a call to the result endpoint (HTTP GET /get-result).
This HTTP call is caught by the Gordon-Results Lambda@Edge function (like the Request function). This function reads the database record that contains all results and return it as a JSON Document.
This JavaScript script is “Datatables”, a great free JQuery plugin that generates super-easy HTML table from a JSON input. Datatables provides exporting capabilities: all results are exportable in Excel, CSV, PDF files or in your clipboard for further analysis or archiving purposes if needed.

Result URL example : https://cybergordon.com/result.html?id=0d9f8bf8-ae79-45b3-a759-19f0cec7caaa

Improvements (to do)

The current architecture is far from being perfect and suffers from several issues that are more or less obvious:

Slowness when getting and merging results from each object result. I could merge engines to one function that could generates only one result object ; in this case the Gordon-Result function can be spiked.
Re-enforce the security (input control)
The quality of the Python code, long way…
Provide a front-end API and so User Account system

Since 2020 I did some improvements which will be the subject of a future article: