Scanning Engine v2¶

Introduction¶

Scanning Engine v2 (SEv2) is a complete rewrite of BinaryEdge's scanning engine. SEv2 implements both a new coordinator, Krang, and a new distributed execution environment, Minion. These new services have new APIs and new job definitions, but the scanning modules from SEv1 have been ported over and improved in most cases. For the tl;dr please see our SEv2 Quick Start Guide.

Rationale¶

Scanning Engine v1 (SEv1) was built with a different set of priorities that matched our needs and the needs of our customers at the time. The three main things that it was missing were comprehensive observability, auditing, and data validation.

While a job was running in SEv1, there was only so much that could be seen regarding its progress. The scan results from scanning modules could be streamed via the Stream API, and the state of the job (e.g., active) could be retrieved, but little more was possible. Internally, we could also view the logs from the Minions, but that was extent of our observability. SEv2 has metrics integrated in both Krang and Minion, tracking everything we could think of. Metrics exposed to users via Krang endpoints include scanning module execution times, topic depths, and error rates within jobs.

In SEv1 it was difficult to answer the questions:

Did the scanning engine contact on at ?
Why did the scanning engine perform ?
Can be trusted?

To answer those questions, SEv2 tracks executions and scanning node (Minion) lifetimes much more carefully. Minions have their connectivity constantly monitored to detect networking problems that could lead to untrustworthy results. Finally, the entire path from job submission to result emission is tagged in every dimension so that it can be traced.

SEv1 and SEv2 take similar approaches to handling the output of a scanning module, with one key difference. SEv1 took the JSON output of a scanning module, wrapped it in another JSON object, and sent it to the result ingestion system. SEv2 adds two stages of validation: output and results schemas. Output schemas operate on the standard output of a module checking each JSON object written against a schema written for that module. Any violation of that schema causes the execution of the scanning module to be retried (up to three times, by a different Minion) in the hopes that it was caused by a transient error. The output is then processed, tagged, and put into the SEv2 results envelope for which there is yet another schema. Any violation at this final stage will result in the execution failing and being retried as well.