Gordon, an aggregator of your cyber reputation checks
This post introduces the beginnings and his current architecture
Gordon quickly provides you threat and risk information about observables such as IP addresses or domain names by querying multiple threat intelligence sources.
Thanks to each source that provides free access to great Threat Intelligence against phishing and malware. Without them, Gordon would have not been there.
Why Gordon ?
Whether it be during my investigations at work or personal surfing sessions, I’m too lazy to use several sources to check if a domain or email address is suspicious or malicious. Some awesome OSINT tools exist, but I didn’t have one to aggregates them all into one simple web interface. On top of that, I wanted to start by building a usable and useful tool on AWS infrastructure that I could share with my entourage. I would have liked to share Gordon widely, but I’m constrained by the query limits that free API sources provide. Lastly, the lock down during the COVID-19 crisis gave me a lot of time, a rare resource that considerably contributed in completing Gordon.
Well why “Gordon” ?
As a Batman fan, I chose the commissioner James Gordon, a friend and reliable informant of the Dark Knight :-)
When I built Gordon, I tried to follow several rules:
- Simple: get results after pasting observable(s)
Neither I or my entourage would use a tool that has an extremely complex GUI or that requires sophisticated information for submission. The aim is to copy/paste one or more observables even if they are in a messy format (listed, quoted or in CSV format) submit on a unique form and get a readable summary. I’m still working on improving this last point.
- Scalable: add easily new sources
Scalability often referred to the ability to manage automatically the capacity depending on the user’s demand. My intention was humble: I wanted an evolving system where I can add, or update, easily a source (called engine) without impacting the existing ones and without adding delay during processing of the user’s observable submission.
- Almost free: use adapted and cost-effective services
For my first tool on AWS and as a non-profit service, the cost was of course the most important criteria. All major cloud service providers offer free tiers use for few of their main services for a duration of one year (sometimes less) or for life. I spent some time looking for simple and almost free AWS services before building a draft.
- Serverless: minimal maintenance
It started as a challenge: being a relatively simple tool, I tried to avoid managing a Linux server, even though if I loved managing Debian servers previously… I wanted to test functions (containers of code), where you only manage the code, while the underlying layers (runtime, OS, hardware) are managed by AWS.
Of course, code maintenance has to be done at least for each runtime update (Python 3.8 to Python 3.9 for example).
- Secure: apply best practices
Last but not least, even tough the manipulated data is not confidential, I have applied some principles: all data — in motion and at rest — is encrypted with AWS managed key (free), permission’s resources are restricted to the minimal needs (least privileges), public exposure is limited and management actions (API) and users’ HTTP request logs are stored for a duration of 6 months.
Attempt with Slack
Before hosting Gordon entirely on AWS, I tried to build a front-end on Slack as a ‘bot’ using Slack Commands feature and processing them on AWS. It works like a charm with one engine, but with two or more it is a mess and unusable. Slack is not suitable for presenting multiple results ; it is a good chat tool for “one question, one short response” capability, but not as a reporting tool…
Slack request is sent to an HTTPS endpoint (hosted on AWS API Gateway) and forwarded to the back-end. Results are then returned using the URL incoming webhook included in the Slack request. As you can see below, results are quickly unreadable when using 2 engines…
A teammate suggested I generate reports on a webpage and send links to Slack user. However, after a lot thinking, this hybrid solution didn’t suit me because it limits the user’s scope only to my Slack workspace.
Current architecture and how it works
To get all AWS capabilities and cheapest prices, all resources are hosted in “US East (Northern Virginia)” region, except for 2 resources invoked near the user location.
Imagine Batman meeting Gordon (summary)
On website (1) Batman pastes an observable and submits it ; request is parsed (2), sent to a queue (3), dispatched to engines (4) that queries API sources. During this background work, Batman is forwarded to the results page (5).
Components are represented on this diagram and described in detail later on.
1/ Static website
All front-web assets (results included) are stored on a single S3 bucket (object storage). To provide cache, reduced latency and encrypted traffic (HTTPS TLS 1.2), a CloudFront distribution (CDN) is used ; the S3 bucket policy only allows traffic from CloudFront.
The sub-domain “gordon.mhg.ovh” points to the CloudFront hostname “dazwqtt2uso07.cloudfront.net” as an Alias record. The DNS zone “mhg.ovh” is entirely managed by AWS Route 53 (DNS service).
2/ Request pipeline
By clicking on “Analyze!”, an HTTP POST request with observable(s), passes through the Gordon-Request Lambda@Edge function: a Python code deployed on multiple geographical points to be executed closer to the user.
The Gordon-Request function generates a request ID (UUID version 4) and parses the observable(s) into 7 predefined types list: IPv4, FQDN, URL, MD5, SHA-1, SHA-256 and Email address. Basically, the function compares the request body with 7 flexible regex that accept new line (\n) or space between each observable.
If no type is recognized, the request is canceled and an error message is sent back to the user by the function, otherwise the user is forwarded (HTTP 302 Found) to a waiting page that shows the request’s ID and amount of each recognized observable types (passed on URL query parameters). After 4 seconds, he is forwarded again to the results page that is described below.
3/ Queue pipeline
Small but powerful part to dispatch observables to engines depending of the types they can check against sources. The Gordon-Queue SNS topic receives message and send immediately a copy to each subscriber that accepts submitted observables type(s).
4/ Engine pipeline
Each Gordon-Engines Lambda functions receives simultaneously the observable(s) list. The engine controls the integrity of the request (using the SHA-1 function), then gets, if applicable, the API token of the remote source in encrypted Environment Variables and queries it then in HTTPS.
All engines query remote sources to get live and fresh information, except for the Offline Feeds engine (E23): an hourly CloudWatch Event Rule (scheduled task) invokes a Lambda function (Python code) that downloads, transforms in a JSON format and overwrites the existing feed content stored in a dedicated S3 bucket. Only the Lambda function has permissions on this private bucket.
5/ Result pipeline
Result URL example : https://gordon.mhg.ovh/result.html?id=0d9f8bf8-ae79-45b3-a759-19f0cec7caaa
Improvements (to do)
The current architecture is far from being perfect and suffers from several issues that are more or less obvious:
- Slowness when getting and merging results from each object result. I could merge engines to one function that could generates only one result object ; in this case the Gordon-Result function can be spiked.
- Backup config and code on S3.
- Re-enforce the security (input control)
- The quality of the Python code, long way…
- Industrialize deployment with CI/CD pipeline and Infrastructure as Code with a framework (SAM, Serverless, …).
I’m open to any remarks that will help improve (or fix) Gordon !