Scanning Engine v2 - Glossary¶
Overview¶
The new scanning engine contains many concepts and objects whose names have been carefully chosen and should be consistently used by all documentation. Unlike the names of modules, these words will not be linked back to this document elsewhere in the documentation, but each item in the glossary is permalinked to facilitate communicating these concepts whenever necessary.
Terminology¶
API¶
(noun)
The set of public HTTPS endpoints that can be used to interact with Krang.
Batch¶
(noun)
A batch refers to multiple jobs that are submitted to the scanning engine at the same time by the same client. Batches are often from a single suite, but that is not a requirement.
Blocklist¶
(noun)
The scanning engine contains a blocklist of network prefixes that it will filter from the target lists of jobs. Entities can request to be blocklisted here.
Cascade¶
(verb)
Cascading is the process of generating new shards from the results of a module. Most modules do not cascade: their results are terminal and do not generate new work. The only modules that currently cascade are bootstrap
and portscan
.
Client¶
(noun)
Clients are applications, services, or individuals that are known to the system by their human-readable client IDs (e.g., binaryedge
). When a client authenticates to Krang, their token is used to associate their actions to their client IDs. Clients have policies and topics associated with them that limit their activities.
All requests to Krang are authorized based on the effective client ID associated with the request. The effective client ID may be different from the real client ID of the token used to authenticate the request. For example, a user with the administrative policy can impersonate any other client, such as submitting jobs associated with that client's ID. Both the scanning engine instance's logs and the job store track both real and effective client IDs for auditing purposes. It is also possible to have a group of users auto-impersonate one effective client ID, causing all results to have the same client_id
field while also maintaining the ability to audit who truly performed an operation.
Client Reference¶
(noun)
Client references are strings associated with jobs when they are submitted to Krang. References must be unique to the effective client ID that submitted the job. If a job is submitted with a client reference that has already been used by that effective client ID, the job will be rejected. Different clients are free to use the same client reference without conflict.
Cluster¶
(noun)
A cluster refers to a group of minions with configurations that differ only in their naming and addressing. Minions within a cluster will be hosted at the same provider, in the same region and country, and with the same ordered list of topics to service. Their hostnames, IPv4 address, and IPv6 address will be unique to each Minion.
Clusters are rarely discussed because our scanning engine instances focus on metaclusters.
Country¶
(noun)
Countries are represented by their lowercase ISO 3166-1 alpha codes (e.g., ca
, pt
, us
). For historical reasons, the United Kingdom is always referred to and represented by the code uk
, which was reserved by the standard to prevent confusion. Any instance of gb
will be transformed to uk
.
Emit¶
(verb)
Emit is used to describe the writing of a JSON object to a stream. Modules emit output while minions emit results.
Hostname¶
(noun)
Hostnames are represented in ASCII format and without the root domain (i.e., example.com
not example.com.
).
Instance¶
(noun)
An instance of the scanning engine refers to a related set of minions and their manager. Instances do not interact with one another, and guarantees of uniqueness (e.g., job IDs) do not apply between instances. We maintain multiple instances of the scanning engine, but only one is available to the public.
IP Address¶
(noun)
IP addresses, both v4 and v6, are represented as strings in the common dotted-quad (v4) and ::-compressed (v6) formats.
Note: In SEv1 the ip
field could contain either an IP address or a hostname.
Job¶
(noun)
A job is a unit of scanning work submitted by a client. Jobs are primarily a list of targets and a list of module invocations, with some additional fields to control the execution of the job. Several jobs together may comprise a suite. The work defined by a job is represented by tasks.
Job ID¶
(noun)
Job IDs are UUID v4 strings such as 919108f7-52d1-4320-9bac-f847db4148a8
that uniquely identify a job within an instance of the scanning engine. Job IDs are assigned during the submission of a job to Krang.
Krang¶
(noun)
See manager.
Manager¶
(noun)
Managers coordinate all metaclusters in an instance of the scanning engine, and provide the scanning engine's API. Managers accept jobs from clients, track tasks and their shards, and allocate shards to minions. Managers also track the liveliness and behavior of minions instances.
Mangle¶
(verb)
Mangling is the process of converting an output into a result.
Metacluster¶
(noun)
A metacluster refers to clusters of minions that are not with the same provider, or in the same region and country. Metaclusters are a more useful abstraction in most cases than clusters, because all clusters in a metacluster will be contributing to the same workloads due to sharing the same ordered list of topics. Metaclusters are configured explicitly, while the vendor and geographical choices present in creating the clusters are done probabilistically.
It is common for two different metaclusters to share a topic, but with different priorities. This allows metaclusters to help each other when the higher-priority topics for some metaclusters are empty.
Minion¶
(noun)
Minion refers to our distributed execution environment software, and its Docker image. Minions request shards from Krang, stream the results of performing the work described by the shard, and report back to Krang whether the work was successfully completed.
Minion Instance¶
(noun)
When a minion Docker image boots, it assigns itself a UUID v4 instance identifier that it will include in all of its interactions with Krang throughout its lifetime. This value is used to track reboots and allows for multiple minions to run on a single scanner if desired.
Module¶
(noun)
Modules are named, standalone programs that perform work scan targets. That work usually involves scanning a service hosted on a target, but may involve referencing data sources or other services to query information related to a target (e.g., DNS TXT records). Modules are often written to interface with a particular product/protocol to extract version and configuration information. Modules may also attempt to determine if a product/protocol is vulnerable to a specific exploit by performing the exploratory or non-invasive portions of an exploit chain against a target.
Modules are written in Go, Python, or Node.js. Some modules contain custom scanning logic (e.g., the exchange-owa module), while others act as thin wrappers around existing scanning tools (e.g., the service module runs nmap).
Note: In SEv1, the module
field contained either grabber
or scanner
, while the type
field contained the name of the module.
Module Invocation¶
(noun)
Module invocations describe how to configure a module to scan a target. Module invocations describe the module’s name, applicable ports, and additional configuration values (e.g., HTTP request paths, SSH usernames, TLS versions). The list of applicable ports is optional because not all modules contact targets directly, but instead contact external services to query for information related to the target (e.g., DNS RBL entries). Configuration values are optional because not all modules are configurable, and those that are configurable have carefully-chosen defaults that apply to common cases.
Results include a module_index
field that contains a 1-based index into the modules
key from the job from which they originated. Without this it would not be possible to determine which module invocation produced which result in jobs that invoke the same module multiple times.
Origin¶
See scanner.
Output¶
(noun)
Output, or outputs, are JSON objects emitted by a module. Outputs are a raw data format that exist only in the stream between a module and its minion. Output is non-uniform, varying widely between modules in both the naming of fields (e.g., addr
vs host
vs ip
vs target
) and the representation of values (e.g., ports as 80
vs "80"
vs http
). Minions mangle output to produce results.
Perspective¶
(noun)
Perspective refers to the network address, geographical location, and organization from which a scan is performed. Scans from different perspectives will often give different results due to geoblocking, blocklisting, anycast, and many other factors.
Policy¶
(noun)
Policies are tags associated with tokens that the API uses to determine whether a request should be processed. The current policies in use are:
- user: policy given to any user so they can submit jobs
- maintenance: policy given to periodic processes, grants access to only maintenance endpoints
- admin: policy given to employees, allows impersonation, access to all endpoints, and access to all topics
Port¶
(noun)
Ports are numerical divisions within a transport protocol (e.g., TCP, UDP). In general, while targets (e.g., addresses, hostnames) refer to hosts or network interfaces, ports refer to services (i.e., applications). Port numbers exist only within transport protocols, so discussing a port number in the abstract makes no sense. For this reason port numbers in the scanning engine are always paired as either:
Provider¶
(noun)
Providers are vendors of computer and network capacity that host scanners. All results contain the name of the provider from which they were produced.
Region¶
(noun)
Regions are named divisions of capacity within a providers. A region may refer to a single datacenter, or several datacenters within a geographical area. All results contain the name of the provider from which they were produced.
Result¶
(noun)
Results are JSON objects that envelop most of a module's output object. Result objects contain metadata about the minion (e.g., external address, provider, country, region) and promote certain fields in the output (e.g, target address, port, protocol, timestamp). Results are processed by cascaders to possibly produce new shards, and then written to a sink.
Scan¶
(noun, verb)
A scan refers to the actions performed against a target by a scanning module, and is the smallest unit of work performed by the scanning engine. Scans should not be confused with jobs or shards.
Scanner¶
(noun)
Scanners are hosts running Docker from which scanning is performed. Due to the restrictive Acceptable Use Policies of most cloud providers, scanners must be hosted in datacenters agreeable to our scanning activities.
Scanners have no privileged access to either BinaryEdge or Coalition’s systems beyond those necessary to perform the worker loop. All credentials on a scanner are ephemeral and frequently rotated. Scanners are intermittently torn down and replaced.
Shard¶
(noun)
Shards are the smallest units of work in the scanning engine, and are the objects that describe work to be executed by a minion. A shard is defined as a subset of the work described by a task, sliced in one or more dimensions. The vast majority of the tasks are sliced by targets, such as breaking a task of scanning 1000 targets into 10 shards of 100 targets. Rarely, a module may require shards to be sliced using an additional parameter (e.g., webv2
is sliced by both targets and HTTP request paths).
Note that for large, worldwide scans we may see tasks that are comprised of hundreds of thousands of shards.
Sink¶
(noun)
Sinks are datastores to which results are sent by a minion. The sink used in the production instance of the scanning engine feeds into a data-processing pipeline that lands results in Amazon S3 and Apache Kafka. The results of jobs can be replayed from S3 or streamed live from Kafka using the Stream API.
Submit¶
(verb)
Submitting refers to the act of sending a job to a manager for execution by the scanning engine.
Suite¶
(noun)
Suites are groups of jobs that are logically connected in some way. The scanning engine itself is unaware of suites, or any connection between jobs. Suites are a concept used by systems that integrate with the scanning engine. Common suites we see in practice are the hundreds of the worldscan jobs, or all of the jobs that are created to scan a single organization's online assets.
Target¶
(noun, verb)
Targets refer to hosts or groups of hosts on the Internet:
- ASN (
asn:15133
) - Country names (
country:pt
) - DNS names (e.g.,
example.com
) - IPv4 Addresses (e.g.,
93.184.216.34
) - IPv6 Addresses (e.g.,
2606:2800:220:1:248:1893:25c8:1946
) - IPv4 Prefixes (e.g.,
93.184.216.0/24
) - IPv6 Prefixes (e.g.,
2606:2800:220::/48
)
ASNs and country names get exploded when a job is submit. Prefixes get exploded during portscans.
Task¶
(noun)
Tasks are the reified objects described by a job's module invocation, formed into trees of work to be performed by minions, with the relationships between parent and child tasks defined by cascading. Tasks are broken down into shards, with the breakdown based on the scanning module they invoke.
One module invocation may explode into multiple tasks, usually one per port
. For example, if TCP/8001-8008
appeared in a module invocation it would explode into 8 tasks, one per port in the range.
Timestamp¶
(noun)
Wherever possible the scanning engine provides timestamps in UTC with millisecond precision.
Token¶
(noun)
A token is a value submitted in an X-Token
HTTP request header to a manager as part of API requests. Tokens allow the manager to associate a client with each API request.
Topic¶
(noun)
Topics are ordered stores of work that contain shards available for execution by minions. Every job is associated with a topic, and its tasks and their shards are also associated with that topic.
Minions are configured at boot with a list of topics ordered by their relative priority, inherited from the minion's metacluster. When a minion requests a shard from its manager, it sends its ordered list of topics and receives the next available shard from the first non-empty topic in that list. Minions may be configured for a single topic, ensuring fully-dedicated resources capable of picking up a job at any time, or configured with multiple topics so that when no work is available in one topic they assist elsewhere.
Transport Protocol¶
(noun)
All mentions of protocols in our results and documentation refer to transport layer protocols such as TCP and UDP. The scanning engine always operates over the internet protocol.
Worker¶
(noun)
Workers are threads of parallel execution of scanning modules within a minion. Each worker runs in a loop:
- Request a shard, within the minion's configured topics, from the manager
- Execute a scanning module against the targets described in the shard
- Capture output from a module
- Mangle output to produce results
- Cascade results to produce new shards
- Upload any new shards to the manager
- Report back to the manager either the success or failure of the minion to execute the shard to completion
Worldscan¶
(noun, verb)
Worldscans refer to jobs that target every public IPv4 address, or every known IPv6 prefix, after filtering those targets through the blocklist. BinaryEdge runs worldscans regularly on several hundred ports.