License: arXiv.org perpetual non-exclusive license
arXiv:2403.01314v1 [cs.NI] 02 Mar 2024

Superflows: A New Tool for Forensic Network Flow Analysis

Michael Collins11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT    Jyotirmoy V. Deshmukh11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT    Dristi Dinesh11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT    Mukund Raghothaman11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT    Srivatsan Ravi11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT    Yuan Xia11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT
11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTUniversity of Southern California
Abstract

Network security analysts gather data from diverse sources, from high-level summaries of network flow and traffic volumes to low-level details such as service logs from servers and the contents of individual packets. They validate and check this data against traffic patterns and historical indicators of compromise. Based on the results of this analysis, a decision is made to either automatically manage the traffic or report it to an analyst for further investigation. Unfortunately, due rapidly increasing traffic volumes, there are far more events to check than operational teams can handle for effective forensic analysis. However, just as packets are grouped into flows that share a commonality, we argue that a high-level construct for grou** network flows into a set a flows that share a hypothesis is needed to significantly improve the quality of operational network response by increasing Events Per Analysts Hour (EPAH).

In this paper, we propose a formalism for describing a superflow construct, which we characterize as an aggregation of one or more flows based on an analyst-specific hypothesis about traffic behavior. We demonstrate simple superflow constructions and representations, and perform a case study to explain how the formalism can be used to reduce the volume of data for forensic analysis.

1 Introduction

We propose a new form of NetFlow-like constructs for security analysis, which we call SuperFlows. SuperFlows group multiple individual flows together around a common hypothesis, such as that all the flows represent a single webpage fetch, a scan or a DGA exploit.

Superflows are built out of flows, which were originally developed for traffic measurement [3]. Security analysts adopted flow  [9], and developed tools analyzing flow and adding new flow attributes for collection. For security analysis, flow provides enormous "bang for the buck" – flows provide a compact summary of the most important information about a session. This compact information is critical – flows enable analysts to quickly examine large sets of traffic and infer potentially hostile behavior. The amount of information an analyst needs to examine for a particular flow, in terms of the footprint on disk, is far smaller for a NetFlow – easily three or more orders of magnitude, then for a corresponding full pcap session.

We contend that there are now two classes of NetFlow analysis with different data collection needs: traffic and security. Traffic analysis, which is focused on billing and continuity needs, use sampled NetFlow to understand the normal course of operations; this has led to new statistical summary techniques, notably sketches [16] which are sampling based, at the point of collection, and based on soft real-time constraints. Forensic analysis requires the ability to reconstruct rare events, leading to a specific forensic need for unsampled Netflow [10, 11, 23], requiring new summary constructs to reduce the data footprint while still providing evidence for every network session.

SuperFlows are motivated by the need for traffic summaries describing modern network traffic. The characteristic traffic of the early 1990’s was the telnet session – a TCP moderated, long-lived connection where a single user communicated from a single client on a single host to a single server. The characteristic traffic of the modern era is the webpage – an assemblage of files fetched from dozens of servers, many of whom are geographically distributed clones. This characteristic single client/multiple server behavior also defines many other behaviors, from simple client-server interaction (because DNS TTL’s have dropped to values so low that name lookups are continuous), to scanning, to torrenting, to serverless architectures.

We envision superflows as a new class of summaries that supplement raw flow data; when presented with a set of traffic for analysis, the analysis may be presented initially with multiple superflows that group the constituent flows together through their hypotheses of what the traffic represents. For example, instead of seeing a dozen individual HTTPS flows, the analyst may see a single superflow marked "webpage fetch: news website.com", along with equivalent high-value summary information. If the analyst needs to examine that phenomenon in more depth, they can then pull up the individual flows based on guidance provided by the superflow. For this approach to be effective, the superflow must be compact and the hypothesis guiding its creation must be clear, unambiguous and easily communicated to other users.

Contributions and roadmap The motivation for develo** SuperFlows is to create a universal formalism for expressing such hypotheses over flow sequences. Furthermore, we provide an algorithm for efficiently identifying the maximal subsets of a flow sequence that satisfy large classes of superflow hypotheses. We introduce two case studies for demonstrating the superflow construct: (i) analysing scan data from an institutions dark spaces, (ii) performing a modern webpage analysis for understanding the interaction of protocols in multiple flows to produce a single web page.

The paper’s contributions are structured as follows: we provide a historical motivation for the superflow construct and prior attempts at traffic analysis of aggregated flows (section 2). We introduce a language based on relational logic to express superflow hypotheses and how we can constructively identify the maximal subsets of a flow sequence that satisfy the hypothesis (section 3). We present the efficacy of the formalism to improve the quality of operational network response through two case studies (section 4. We conclude with an important discussion of the missing dimensions in our formalism: how the vantage points for data collection will need to be incorporated into our formalism as well the role that confounders (like NAT boxes) will need to be expressible within our framework for widespread applicability of the superflow construct.

2 Motivation and Related Work

While originally developed for traffic reporting, NetFlow rapidly developed into a forensic tool with the development of analysis packages such as Fullmer and Romig’s Flow-Tools [9] and the CERT’s SiLK Suite [12]. These tools effectively mapped the relational calculus to NetFlow format flat files and, in the course of develo** security analysis identified additional fields needed during flow collection. This work partially informed the development of the IPFIX [4] standard, whose reference implementation, YAF [13], was developed in-house at the CERT, IPFIX includes fields for forensics, notably initial packet flags for TCP.

Outside of this initial work, we have seen two distinct classes of traffic summarization which differ over the role of sampling and estimation. We view the primary difference between these two classes as a comfort with statistical estimation. The first class is focused on traffic summarization techniques and relies on sampling and estimation heavily, the second class is focused on forensic reconstruction and operational security, which increasingly views unsampled NetFlow as mission critical [22, 8].

The first class of traffic analyses are comfortable with sampling, and are focused on develo** highly-efficient soft-realtime summaries, primarily using streaming approaches. The largest group of these techniques are based around various sketch-based algorithms [16, 18], are focused on highly efficient streaming estimates of specific traffic characteristics (e.g., Entropy [5], traffic changes [15], heavy hitters [21]).

The second class consists of techniques to identify and behaviorally summarize different traffic classes [19]. These techniques include systems which create constructs from traffic data, including the SiLK Set and Bag [17], which group together arbitrary collections of IP addresses, and the FlowTuple [2], developed as part of CAIDA’s Corsaro toolkit [1]. Other work involves techniques for identifying specific traffic classes such as botnets [25, 24], scanning [7], or peer-to-peer filesharing [6]. These approaches represent different ways of identifying traffic phenomena, but each one is a separate detector; superflows are intended to unify these different cases into a common data reduction format to improve analyst workflow.

3 Superflow Decompositions

Superflows provide a mechanism for security analysts to group NetFlow records by means of user-provided hypotheses. Formally, a superflow hypothesis is a predicate over sets of flows, h:2FlowBool:superscript2FlowBoolh:2^{\operatorname{Flow}}\to{\operatorname{Bool}}italic_h : 2 start_POSTSUPERSCRIPT roman_Flow end_POSTSUPERSCRIPT → roman_Bool. For example, a set of flows F={f1,f2,,fk}𝐹subscript𝑓1subscript𝑓2subscript𝑓𝑘F=\{f_{1},f_{2},\dots,f_{k}\}italic_F = { italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } may arise from a scan if they all share the same source IP address, attempt to reach hosts within the same subnet, occur within a short time of each other, and probe a sufficiently large set of destinations. The analyst may operationalize this by declaring that the set of flow records F𝐹Fitalic_F satisfies the superflow hypothesis hscansubscriptscanh_{\text{scan}}italic_h start_POSTSUBSCRIPT scan end_POSTSUBSCRIPT when:

hscan(F)subscriptscan𝐹\displaystyle h_{\text{scan}}(F)italic_h start_POSTSUBSCRIPT scan end_POSTSUBSCRIPT ( italic_F ) =f,fF,srcip(f)=srcip(f)formulae-sequenceabsentfor-all𝑓formulae-sequencesuperscript𝑓𝐹srcip𝑓limit-fromsrcipsuperscript𝑓\displaystyle=\forall f,f^{\prime}\in F,{\operatorname{srcip}}(f)={% \operatorname{srcip}}(f^{\prime})\land{}= ∀ italic_f , italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ italic_F , roman_srcip ( italic_f ) = roman_srcip ( italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∧
dstip(f)192.168.1.*similar-todstip𝑓limit-from192.168.1.*\displaystyle\qquad\qquad{\operatorname{dstip}}(f)\sim\texttt{192.168.1.*}% \land{}roman_dstip ( italic_f ) ∼ 192.168.1.* ∧
tstart(f)tstart(f)10 ssubscript𝑡startsuperscript𝑓subscript𝑡start𝑓limit-from10 s\displaystyle\qquad\qquad{t_{\text{start}}}(f^{\prime})-{t_{\text{start}}}(f)% \leq 10\text{ s}\land{}italic_t start_POSTSUBSCRIPT start end_POSTSUBSCRIPT ( italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_t start_POSTSUBSCRIPT start end_POSTSUBSCRIPT ( italic_f ) ≤ 10 s ∧
|{dstip(f)fF}|c,conditional-setdstip𝑓𝑓𝐹𝑐\displaystyle\qquad\qquad|\{{\operatorname{dstip}}(f)\mid f\in F\}|\geq c,| { roman_dstip ( italic_f ) ∣ italic_f ∈ italic_F } | ≥ italic_c , (1)

where c𝑐citalic_c is an analyst-supplied threshold value.

Upon examining a stream of NetFlow records, the analyst might mentally group flows that appear to arise from a common underlying event such as hscansubscriptscanh_{\text{scan}}italic_h start_POSTSUBSCRIPT scan end_POSTSUBSCRIPT. They may then proceed either to further analyze individual hypothesized scan events, or examine flows which do not appear to match the scan hypothesis.

Given a superflow hypothesis hhitalic_h, a superflow decomposition decomph(F)subscriptdecomp𝐹{\operatorname{decomp}}_{h}(F)roman_decomp start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_F ) of a set of flows F={f1,f2,,fn}𝐹subscript𝑓1subscript𝑓2subscript𝑓𝑛F=\{f_{1},f_{2},\dots,f_{n}\}italic_F = { italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } is a partition of F𝐹Fitalic_F into disjoint subsets,

F=F1F2FkFrest,𝐹subscript𝐹1subscript𝐹2subscript𝐹𝑘subscript𝐹restF=F_{1}\mathbin{\cup}F_{2}\mathbin{\cup}\dots\mathbin{\cup}F_{k}\mathbin{\cup}% F_{\text{rest}},italic_F = italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∪ italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∪ … ∪ italic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∪ italic_F start_POSTSUBSCRIPT rest end_POSTSUBSCRIPT ,

such that h(Fi)=truesubscript𝐹𝑖trueh(F_{i})={\operatorname{true}}italic_h ( italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = roman_true for i=1,2,,k𝑖12𝑘i=1,2,\dots,kitalic_i = 1 , 2 , … , italic_k. Naturally, the analyst might be interested in maximally decomposing F𝐹Fitalic_F, so they may additionally stipulate that: (a) h(Fr)=falsesuperscriptsubscript𝐹𝑟falseh(F_{r}^{\prime})={\operatorname{false}}italic_h ( italic_F start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = roman_false, and that (b) h(FiFr)=falsesubscript𝐹𝑖superscriptsubscript𝐹𝑟falseh(F_{i}\mathbin{\cup}F_{r}^{\prime})={\operatorname{false}}italic_h ( italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∪ italic_F start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = roman_false, for all subsets FrFrestsuperscriptsubscript𝐹𝑟subscript𝐹restF_{r}^{\prime}\subseteq F_{\text{rest}}italic_F start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⊆ italic_F start_POSTSUBSCRIPT rest end_POSTSUBSCRIPT, and for i=1,2,,k𝑖12𝑘i=1,2,\dots,kitalic_i = 1 , 2 , … , italic_k. We call each of the partitions, Fisubscript𝐹𝑖F_{i}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, for i=1,2,,k𝑖12𝑘i=1,2,\dots,kitalic_i = 1 , 2 , … , italic_k a superflow.

As we will observe in Section 4, grou** flows in this manner massively shrinks both the set of observed events and the anomalous flows needing further investigation. We will now describe a simple language for analysts to describe rich superflow hypotheses, and an efficient algorithm to perform maximal superflow decompositions.

3.1 A Language for Superflow Hypotheses

Attributes and predicates over flows.

Unlike traditional flow grou** constructs such as rwgroup which is included as part of the SiLK analysis suite, superflows allow us to group flows based on complex properties. Our language for superflow hypotheses is inspired by relational logic, similar to that used in modeling systems such as Alloy [14]. We begin by fixing a set of flow attributes:

attr::=srcipdstiptstarttend#bytes#packetsattr:absentassignconditionalsrcipdstipmissing-subexpressionconditionalsubscript𝑡startsubscript𝑡endmissing-subexpressionconditional#bytes#packetsmissing-subexpression\begin{array}[]{rcl}{\operatorname{attr}}&::=&{\operatorname{srcip}}\mid{% \operatorname{dstip}}\\ &\mid&{t_{\text{start}}}\mid{t_{\text{end}}}\\ &\mid&{\operatorname{\#bytes}}\mid{\operatorname{\#packets}}\\ &\mid&\cdots\end{array}start_ARRAY start_ROW start_CELL roman_attr end_CELL start_CELL : := end_CELL start_CELL roman_srcip ∣ roman_dstip end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ∣ end_CELL start_CELL italic_t start_POSTSUBSCRIPT start end_POSTSUBSCRIPT ∣ italic_t start_POSTSUBSCRIPT end end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ∣ end_CELL start_CELL start_OPFUNCTION # roman_bytes end_OPFUNCTION ∣ start_OPFUNCTION # roman_packets end_OPFUNCTION end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ∣ end_CELL start_CELL ⋯ end_CELL end_ROW end_ARRAY

Each of these attributes is a function which returns the corresponding property of the flow in question, attr(f)attr𝑓{\operatorname{attr}}(f)roman_attr ( italic_f ). A natural choice for these attributes are the properties exported as part of the IPFIX record . We also fix a set of atomic predicates over flows, p𝑝pitalic_p, q𝑞qitalic_q, …, of varying arities. Examples include unary predicates such as dstip(f)192.168.1.*similar-todstip𝑓192.168.1.*{\operatorname{dstip}}(f)\sim\texttt{192.168.1.*}roman_dstip ( italic_f ) ∼ 192.168.1.* and binary predicates such as srcip(f)=srcip(f)srcip𝑓srcipsuperscript𝑓{\operatorname{srcip}}(f)={\operatorname{srcip}}(f^{\prime})roman_srcip ( italic_f ) = roman_srcip ( italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) and tstart(f)tstart(f)10 ssubscript𝑡start𝑓subscript𝑡startsuperscript𝑓10 s{t_{\text{start}}}(f)-{t_{\text{start}}}(f^{\prime})\leq 10\text{ s}italic_t start_POSTSUBSCRIPT start end_POSTSUBSCRIPT ( italic_f ) - italic_t start_POSTSUBSCRIPT start end_POSTSUBSCRIPT ( italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≤ 10 s.

Relational constraints over multiple flows.

Superflow hypotheses may now be constructed as closed first-order logical formulas with cardinality constraints:

h::=f,hf,hh1h2h1h2¬hp(f1,f2,,fk)|{attr(f)p(f)}|c, for {<,>,=}.\begin{array}[]{rcl}h&::=&\forall f,h\mid\exists f,h\\ &\mid&h_{1}\land h_{2}\mid h_{1}\lor h_{2}\mid\lnot h\\ &\mid&p(f_{1},f_{2},\dots,f_{k})\\ &\mid&|\{{\operatorname{attr}}(f)\mid p(f)\}|\bowtie c,\text{ for }\bowtie{}% \in\{<,>,=\}.\end{array}start_ARRAY start_ROW start_CELL italic_h end_CELL start_CELL : := end_CELL start_CELL ∀ italic_f , italic_h ∣ ∃ italic_f , italic_h end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ∣ end_CELL start_CELL italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∧ italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∣ italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∨ italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∣ ¬ italic_h end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ∣ end_CELL start_CELL italic_p ( italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ∣ end_CELL start_CELL | { roman_attr ( italic_f ) ∣ italic_p ( italic_f ) } | ⋈ italic_c , for ⋈ ∈ { < , > , = } . end_CELL end_ROW end_ARRAY

The constructions include the standard first-order logical quantifiers, which range over the set of flows F𝐹Fitalic_F being examined for the superflow hypothesis in question, the Boolean connectives, and simple cardinality constraints over F𝐹Fitalic_F. An example of such cardinality constraints would be the constraint |{dstip(f)fF}|200conditional-setdstip𝑓𝑓𝐹200|\{{\operatorname{dstip}}(f)\mid f\in F\}|\geq 200| { roman_dstip ( italic_f ) ∣ italic_f ∈ italic_F } | ≥ 200, indicating that we see at least 200 constituent flows as part of a superflow grou**.

Example 1 (Chat session hypothesis).

One characteristic of a chat session between two hosts would be the exchange of back and forth messages, each of which is smaller than the minimum transmission unit:

h𝑐ℎ𝑎𝑡(F)subscript𝑐ℎ𝑎𝑡𝐹\displaystyle h_{\text{chat}}(F)italic_h start_POSTSUBSCRIPT chat end_POSTSUBSCRIPT ( italic_F ) =f,finF,absentfor-all𝑓superscript𝑓𝑖𝑛𝐹\displaystyle=\forall f,f^{\prime}inF,= ∀ italic_f , italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_i italic_n italic_F ,
(srcip(f)=srcip(f)dstip(f)=dstip(f)\displaystyle\qquad({\operatorname{srcip}}(f)={\operatorname{srcip}}(f^{\prime% })\land{\operatorname{dstip}}(f)={\operatorname{dstip}}(f^{\prime})\lor( roman_srcip ( italic_f ) = roman_srcip ( italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∧ roman_dstip ( italic_f ) = roman_dstip ( italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∨
srcip(f)=dstip(f)dstip(f)=srcip(f))\displaystyle\qquad{\operatorname{srcip}}(f)={\operatorname{dstip}}(f^{\prime}% )\land{\operatorname{dstip}}(f)={\operatorname{srcip}}(f^{\prime}))\landroman_srcip ( italic_f ) = roman_dstip ( italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∧ roman_dstip ( italic_f ) = roman_srcip ( italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) ∧
#bytes(f)1500.#bytes𝑓1500\displaystyle\qquad{\operatorname{\#bytes}}(f)\leq 1500.start_OPFUNCTION # roman_bytes end_OPFUNCTION ( italic_f ) ≤ 1500 . (2)
Example 2 (Webpage fetch hypothesis).

Another example would be to delineate sessions involving a webpage fetch. A simple webpage fetch may be modeled as a sequence of HTTP requests (heading either to TCP port 80 or port 443), each of which closely follows a preceding DNS request (heading to UDP port 53):

h𝑤𝑒𝑏(F)subscript𝑤𝑒𝑏𝐹\displaystyle h_{\text{web}}(F)italic_h start_POSTSUBSCRIPT web end_POSTSUBSCRIPT ( italic_F ) =fF,dstport(f){80,443,53}formulae-sequenceabsentfor-all𝑓𝐹dstport𝑓limit-from8044353\displaystyle=\forall f\in F,{\operatorname{dstport}}(f)\in\{80,443,53\}\land= ∀ italic_f ∈ italic_F , roman_dstport ( italic_f ) ∈ { 80 , 443 , 53 } ∧
(dstport(f){80,443}\displaystyle\quad({\operatorname{dstport}}(f)\in\{80,443\}\implies( roman_dstport ( italic_f ) ∈ { 80 , 443 } ⟹
fF,0t𝑠𝑡𝑎𝑟𝑡(f)t𝑠𝑡𝑎𝑟𝑡(f)300 sformulae-sequencesuperscript𝑓𝐹0subscript𝑡𝑠𝑡𝑎𝑟𝑡𝑓subscript𝑡𝑠𝑡𝑎𝑟𝑡𝑓limit-from300 s\displaystyle\quad\quad\exists f^{\prime}\in F,0\leq{t_{\text{start}}}(f)-{t_{% \text{start}}}(f)\leq 300\text{ s}\land∃ italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ italic_F , 0 ≤ italic_t start_POSTSUBSCRIPT start end_POSTSUBSCRIPT ( italic_f ) - italic_t start_POSTSUBSCRIPT start end_POSTSUBSCRIPT ( italic_f ) ≤ 300 s ∧
dstport(f)=53).\displaystyle\quad\quad\qquad{\operatorname{dstport}}(f^{\prime})=53).roman_dstport ( italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = 53 ) . (3)

3.2 Efficiently Decomposing Flow Streams

The central computational problem with superflow hypotheses is to identify a superflow decomposition. The generality and nested quantifiers in the language of the previous section makes this challenging. We will now identify some restrictions on superflow hypotheses that make the decomposition problem tractable.

We start by focusing on the model-checking problem: Given a hypothesis hhitalic_h and a set of flows F𝐹Fitalic_F, determine whether it is the case that h(F)=true𝐹trueh(F)={\operatorname{true}}italic_h ( italic_F ) = roman_true. We say that a hypothesis hhitalic_h can be efficiently monitored if the model checking problem can be solved in a streaming manner in time linear in |F|𝐹|F|| italic_F | and with memory that is independent of the size of the flow stream.

Claim 3 (Efficient hypothesis monitoring).

Let p(f1,f2)𝑝subscript𝑓1subscript𝑓2p(f_{1},f_{2})italic_p ( italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) be a binary predicate which is either: (a) transitive (i.e., for all flows f1subscript𝑓1f_{1}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, f2subscript𝑓2f_{2}italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, f3subscript𝑓3f_{3}italic_f start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, p(f1,f2)p(f2,f3)p(f1,f3)𝑝subscript𝑓1subscript𝑓2𝑝subscript𝑓2subscript𝑓3𝑝subscript𝑓1subscript𝑓3p(f_{1},f_{2})\land p(f_{2},f_{3})\implies p(f_{1},f_{3})italic_p ( italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ∧ italic_p ( italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) ⟹ italic_p ( italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT )), or (b) satisfies the property that p(f1,f2)=truep(f1,f3)=truep(f2,f3)=true𝑝subscript𝑓1subscript𝑓2normal-true𝑝subscript𝑓1subscript𝑓3normal-true𝑝subscript𝑓2subscript𝑓3normal-truep(f_{1},f_{2})={\operatorname{true}}\land p(f_{1},f_{3})={\operatorname{true}}% \implies p(f_{2},f_{3})={\operatorname{true}}italic_p ( italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = roman_true ∧ italic_p ( italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) = roman_true ⟹ italic_p ( italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) = roman_true. In both cases, the hypothesis h𝑡𝑟𝑎𝑛𝑠(F)=f1,f2F,p(f1,f2)formulae-sequencesubscript𝑡𝑟𝑎𝑛𝑠𝐹for-allsubscript𝑓1subscript𝑓2𝐹𝑝subscript𝑓1subscript𝑓2h_{\text{trans}}(F)=\forall f_{1},f_{2}\in F,p(f_{1},f_{2})italic_h start_POSTSUBSCRIPT trans end_POSTSUBSCRIPT ( italic_F ) = ∀ italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ italic_F , italic_p ( italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) can be efficiently monitored.

This follows because upon seeing a new element f3subscript𝑓3f_{3}italic_f start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT of the flow stream, the monitoring algorithm only needs to evaluate p(f1,f3)𝑝subscript𝑓1subscript𝑓3p(f_{1},f_{3})italic_p ( italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) for some arbitrarily pre-selected element f1subscript𝑓1f_{1}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT of the flow stream rather than comparing each pair of flow records previously encountered. As an example, both binary constraints appearing in hscansubscriptscanh_{\text{scan}}italic_h start_POSTSUBSCRIPT scan end_POSTSUBSCRIPT are of this form. The constraint in hchatsubscriptchath_{\text{chat}}italic_h start_POSTSUBSCRIPT chat end_POSTSUBSCRIPT is also transitive, so it follows that both hypotheses hscansubscriptscanh_{\text{scan}}italic_h start_POSTSUBSCRIPT scan end_POSTSUBSCRIPT and hchatsubscriptchath_{\text{chat}}italic_h start_POSTSUBSCRIPT chat end_POSTSUBSCRIPT can be efficiently monitored.

We next turn our attention to the problem of constructing maximal decompositions. Observe that all three hypotheses from Section 3.1 lend themselves to a greedy superflow construction procedure, by which one can repeatedly add new flow records to an existing candidate superflow until it fails to satisfy the hypothesis in question. In particular, we have:

Claim 4 (Superflow decomposition).

Let hhitalic_h be an efficiently monitorable superflow hypothesis which is also subset closed: i.e., whenever h(F)=true𝐹normal-trueh(F)={\operatorname{true}}italic_h ( italic_F ) = roman_true, it is also the case that h(F)=truesuperscript𝐹normal-′normal-trueh(F^{\prime})={\operatorname{true}}italic_h ( italic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = roman_true, for all subsets FFsuperscript𝐹normal-′𝐹F^{\prime}\subseteq Fitalic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⊆ italic_F. In this case, a maximal superflow decomposition of a set of flows F={f1,f2,,fn}𝐹subscript𝑓1subscript𝑓2normal-…subscript𝑓𝑛F=\{f_{1},f_{2},\dots,f_{n}\}italic_F = { italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } can be constructed in linear time, and with memory proportional to the number of reported superflows.

4 Superflow-guided Data Reduction

In this section, we discuss several case studies for a superflow construct and discuss the effectiveness of a superflow construct. Recall from §3 that a SuperFlow represents a hypothesis about a class of network traffic, and that the same individual flows may be present in an ensemble of superflows representing competing hypotheses. In order to argue for the efficiency of a superflow implementation, we must demonstrate that a superflow will increase EPAH (events per analyst-hour) processed. In the absence of an operations floor, we use on-disk footprint as a proxy for EPAH on the thesis that reducing the on-disk footprint reduces the query time for an associated phenomenon, and by reducing that query time, we increase the number of events an analyst can process.

In order to explore the superflow concept, we have created traffic traces on a large networking testbed using different client/server and service combinations in isolation. By creating clean traces focusing exclusively on specific classes of traffic, we can examine the construction and description of superflows and determine which attributes are necessary for effectively describing the superflow. For this work, we have considered two scenarios: a website and scanning. The websites we examine in this study are multi-hosted, CDN-based and spread across multiple servers providing images, multimedia data, advertising, user tracking and JavaScript. Modern webpages are often comprised of fetches from dozens or hundreds of different websites.

Scanning, systematically targeting open ports on a network in order to determine the presence of vulnerable services, is an excellent target for a superflow formulation due to the disproportionate footprint scans leave in network traffic summaries. Scans consist of a large number of packets with slightly different addresses, meaning that a small scan (256 hosts) may take up hundreds of more records than a long, multi-terabyte, data transfer.

The remainder of this section is structured as follows, §4.1 discusses the data footprints of NetFlows, this discussion lays the groundwork for discussing how different SuperFlow representations will compare against equivalent NetFlow formulations and show how efficient the superflow has to be in order to substitute for a NetFlow. §4.2 examines a modern website as a potential superflow construct; here we use data from web browsing sessions to show that a potential superflow is constructible and that it will substantially reduce the data footprint. Finally §4.3 examines scanning data collected in situ from our darkspaces to show expected values for data reduction based on the types of scans seen.

4.1 Estimating Data Footprints

NetFlow is a compact fixed-size representation of traffic data, which lends itself to random access and to representation in highly efficient data stores such as columnar databases. Figure 1 shows this compact representation; each grid in this figure and the footprint diagrams that follows shows the footprint of netflows and our theoretical superflow constructs. As this figure shows, standard V5 NetFlow as collected by the router has a 48-byte footprint. However, as discussed in §1 there are a number of pcap to flow tools which generate their own netflow representations; these representations can throw away router-generated data such as the next hop IP and the input or output interfaces. This smaller, 32-byte footprint is what we will consider the default flow size for our data reductions.

Refer to caption
Figure 1: The Basic Footprints for NetFlow as collected by the router or through PCAP

4.2 Modern Webpage Analysis

Modern webpages are assembled from dozens of files stored on discrete webservers; examples include the homepage of a news site like the New York times or a commerce site such as eBay. These sites rely on the browser to fetch information from multiple discrete locations, resulting in multiple flows across different protocols (i.e., DNS, HTTP, HTTPS and QUIC, along with streaming video protocols) to produce a single page. These interactions can become quite complex and large; for example, Figure 2 shows the number of discrete sites contacted by a browser when it fetches a page from CNN. This fetch was constructed using a single browser fetching in private mode, limiting the potential for unrelated page fetches. As Figure 2 shows, the browser contacts 36 different sites during the course of operations. A single page fetch consists of 228 flows to 36 IP addresses.

Refer to caption
Figure 2: Discrete Sites Contacted to Fetch CNN’s Homepage

Figure 3 shows the footprint for a potential website superflow. This superflow has a footprint of (16 + 16 \cdot dcount) bytes, where dcount is the number of subsidiary sites contacted during website construction. This representation also merges destination port and service (which can include UDP/53 (DNS), UDP/443 (QUIC), TCP/443 (HTTPS), TCP/80 (HTTP) and then a variations such as TCP/8080) into a single byte value.

Refer to caption
Figure 3: Footprints for Modern Website representation

The efficiency of a webpage superflow is driven by two factors: the number of sites comprising the webpage and the number of sites encoded in the superflow representation. The CNN example, as noted, requires 36 addresses, which using the NetFlow format given in §4.1, results in a total footprint of 1152 bytes, compared to 592 for the corresponding superflow formulation. Craigslist, by comparison, requires access to only 4 additional sites, yielding a 160 byte footprint using flows, 96 using superflows.

4.3 Scan Analysis

To examine the efficacy of superflow recognition and usage in situ, we collected and analyzed a set of candidate scan data from our institution’s dark spaces. A dark space is a collection of contiguous IP addresses which are routable, but do not have a responding host or DNS name. Traffic to a dark space is suspicious because it was initiated by an outside organization due to a number of different phenomena, notably scanning, backscatter and misconfiguration. Given our specific interest in scan summarization, we filtered the traffic to contain the most obvious scanning packets, these are TCP packets with a low ACK flag (indicating that the packet is not a response). Out of the total traffic observed in any 24 hour sample period, this class of tcp traffic makes up 62% of overall traffic (32% of the remaining volume is any other TCP traffic, while ICMP, DNS and GRE make up remaining 6%).

Using those packets, we developed an estimate for the potential impact between flows and a potential superflow. To do so, we examine the number of flows we expect to reduce as a function of the likelihood of encountering a full 256-address scan. The reduction is the expected number of flows within the dataset that would be replaced by a superflow. For example, a superflow representing scans across a /24 would replace 256 flows.

Given this assumption, we define a scan-256 superflow as a superflow which describes scanning between an individual host and a /24. This superflow has a disk footprint shown in Figure 4 and is detected using the hypothesis in Equation 1 with c=256𝑐256c=256italic_c = 256. We note in passing that this approach treats scan detection as a given; that is, the scan-256 superflow is identifying and isolating obvious scanners as opposed to differentiating very slow and subtle scanners. While such scanners exist, identifying them becomes easier once the scan-256 flow has removed the obvious and noisy scanners from the analyst’s workflow.

Applying the initial scan-256 superflow rules to our darkspace data results in a small reduction, shown in Figure 5. As this figure shows, replacing qualifying flows with scan-256 superflows reduces the total flow footprint by between 1/2 and 2.5%. This shows that while the on-disk reduction for a full 256 scan is substantial, there aren’t enough of them to significantly reduce the total on-disk footprint. Scanners rarely scan every address in a /24; often they will skip addresses such as x.255 or x.0. To compensate for this, we considered an alternative structure we called an allotted scan-256. The allotted scan-256 allots a table of IP addresses in order to indicate that the attacker skipped some subset of the total subnet; for the purposes of this paper we set the allotment to 32 IP addresses, creating a scan-256 superflow for flowsets with as little as 224 addresses or as many as 256, effectively using Equation 1 with c=224𝑐224c=224italic_c = 224. The impact of this change is substantial; Figure 6 shows the estimated flow footprint reduction for allotted scan-256 superflows, as this figure shows, the reduction is now between 12% and 32%.

Refer to caption
Figure 4: Footprints for Full and Allotted Scan-256

Figure 4 shows potential footprints for full and allotted scan-256 superflows. As this figure shows, the full superflow is the same size once accounting for padding. Given that a single /24 scan will comprise at least 256 flows, this is a substantial reduction. Given that a full scan-256 will substitute for 256 flows, this results in a 32 byte footprint, as compared to the 8 kilobyte footprint for a full scan.

Refer to caption
Figure 5: Flow Reduction for Full Scan-256 Superflow
Refer to caption
Figure 6: Flow Reduction for Allotted Scan-256 Superflow

5 Discussion and Future Directions

We expect to implement superflows within the SiLK toolset by adding tools to construct the flows and then query them using the current rwcut and rwfilter tools.

Building new superflow class libraries.

In future work, we intend to develop a dictionary of different classes of superflows. We expect that this dictionary will include the superflow classes already discussed, as well as chat protocols, email and peer to peer traffic. Chat protocols, such as XMPP, Signal, and VOIP protocols, have distinct behaviors, notably jittery packets smaller than the MTU. Other areas of interest include peer-to-peer protocols such as Bittorrent and SMTP. Mail interactions are particularly important because, in addition to representing a significant fraction of Internet infrastructure, require collating information across at least 3 different protocols – DNS, SMTP and POP3 or IMAP and optionally HTTP/S.

Vantage and confounders.

Also of import are the issues of vantage and confounders for superflow generation. Vantage refers to the impact that sensor placement has on data collection; for example, modern websites as discussed in §4 consist of multiple calls from a web client to multiple discrete servers, however modern interactive sites may also involve client requests from servers within the website to other servers, such as a database authenticating the user’s identity. This means that the flow data collected from a client’s vantage may show different data than the flow data collected from one of the server’s vantage. Tightly tied to the issue of vantage are the issue of confounders, these are middleboxes (such as NAT’s) which affect assumptions about the identify of IP addresses across multiple flows. Superflows must assume that some elements (such as client IP addresses) remain consistent and distinguishable. We intend to extend the superflow hypothesis language to express network topologies and to automatically reconcile data from multiple sensors across the network.

Expanding the scope of superflow constructs.

The current superflow formalism is based on relational logic [14], and we provide linear-time algorithms for many superflows expressed in this formalism. As part of future work, we will expand the range of superflow hypotheses that can be expressed and develop algorithms that can more efficiently decompose flow streams. We will also explore the possibility of incorporating temporal patterns in superflow hypotheses and opportunities to automatically learn these relationships using techniques such as Granger Causality [20].

Another area of note is the ability to add post-processing data to superflows. As noted in our discussion on the web superflow in §4, CDN’s make up a significant amount of modern website traffic, and the round-robin DNS allocation used by many CDN’s can result in multiple IP addresses which point to identical content servers.

Finally, we need to further explore the need for superflows to describe alternative hypotheses within the superflow. As noted in the Scan-256 example in §4, the allotted scan-256 provides more flexibility and summaries in exchange for a small initial storage overhead. As the superflows are intended to improve operational response, including annotations about exceptional behavior (such as failed connections in a web superflow) can improve analyst efficiency at a small overhead cost.

References

  • [1] Corsaro Software Suite. https://catalog.caida.org/software/corsaro. Accessed: 2023-6-29.
  • [2] S. Alcock. Flowtuples iv: Reality strikes back.
  • [3] k. Claffy. Internet Traffic Characterization. PhD thesis, University of California at San Diego, USA, 1994.
  • [4] B. Claise, B. Trammell, and P. Aitken. Specification of the ip flow information export (ipfix) protocol for the exchange of flow information. Technical Report 7011, IETF, 2013.
  • [5] P. Clifford and I. Cosma. A simple sketching algorithm for entropy estimation over streaming data. In C. M. Carvalho and P. Ravikumar, editors, Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics, volume 31 of Proceedings of Machine Learning Research, pages 196–206, Scottsdale, Arizona, USA, 29 Apr–01 May 2013. PMLR.
  • [6] M. P. Collins and M. K. Reiter. Finding peer-to-peer file-sharing using coarse network behaviors. In Proceedings of the 11th European Conference on Research in Computer Security, ESORICS’06, page 1–17, Berlin, Heidelberg, 2006. Springer-Verlag.
  • [7] M. P. Collins and M. K. Reiter. On the limits of payload-oblivious network attack detection. pages 251–270. Springer Berlin Heidelberg, 2008.
  • [8] D. Daniels. Netflow/ipfix generation from aws clouds. Blog Post, Gigamon Networks, 2020.
  • [9] M. Fullmer and S. Romig. The OSU flow-tools package and CISCO NetFlow logs. In 14th Systems Administration Conference (LISA 2000), New Orleans, LA, Dec. 2000. USENIX Association.
  • [10] A. Gall. Scalable and cost-effective generation of unsampled netflow. In Presented at 2020 Geant Teelemtry and Big Data Workshop, 2020.
  • [11] A. Gall. Snabbflow: A scalable ipfix exporter. In Presented at 2023 FOSDEM Workshop, 2023.
  • [12] C. Gates, M. Collins, M. Duggan, A. Kompanek, and M. Thomas. More netflow tools for performance and security. In Proceedings of the 18th USENIX Conference on System Administration, LISA ’04, page 121–132, USA, 2004. USENIX Association.
  • [13] C. M. Inacio and B. Trammell. YAF: Yet another flowmeter. In 24th Large Installation System Administration Conference (LISA 10), San Jose, CA, Nov. 2010. USENIX Association.
  • [14] D. Jackson. Software Abstractions: Logic, Language, and Analysis. MIT Press, 2016.
  • [15] B. Krishnamurthy, S. Sen, Y. Zhang, and Y. Chen. Sketch-based change detection: Methods, evaluation, and applications. In Proceedings of the 3rd ACM SIGCOMM Conference on Internet Measurement, IMC ’03, page 234–247, New York, NY, USA, 2003. Association for Computing Machinery.
  • [16] Z. Liu, A. Manousis, G. Vorsanger, V. Sekar, and V. Braverman. One sketch to rule them all: Rethinking network flow monitoring with univmon. In Proceedings of the 2016 ACM SIGCOMM Conference, SIGCOMM ’16, page 101–114, New York, NY, USA, 2016. Association for Computing Machinery.
  • [17] J. Mchugh. Sets, bags, and rock and roll: Analyzing large data sets of network data. In Proceedings of the 2004 ESORICS Conference, volume 3193, pages 407–422, 01 2004.
  • [18] H. Namkung, Z. Liu, D. Kim, V. Sekar, and P. Steenkiste. SketchLib: Enabling efficient sketch-based monitoring on programmable switches. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22), pages 743–759, Renton, WA, Apr. 2022. USENIX Association.
  • [19] I. Paredes-Oliva, X. Dimitropoulos, M. Molina, P. Barlet-Ros, and D. Brauckhoff. Automating root-cause analysis of network anomalies using frequent itemset mining. SIGCOMM Comput. Commun. Rev., 40(4):467–468, aug 2010.
  • [20] A. Shojaie and E. B. Fox. Granger causality: A review and recent advances. Annual Review of Statistics and Its Application, 9(1):289–319, 2022.
  • [21] V. Sivaraman, S. Narayana, O. Rottenstreich, S. Muthukrishnan, and J. Rexford. Heavy-hitter detection entirely in the data plane. In Proceedings of the Symposium on SDN Research, SOSR ’17, page 164–176, New York, NY, USA, 2017. Association for Computing Machinery.
  • [22] C. Systems. Cisco secure - meeting the dni nittf maturity framework white paper. CISCO White paper, 2022.
  • [23] M. Thomas, L. Metcalf, J. Spring, P. Krystosek, and K. Prevost. Silk: A tool suite for unsampled network flow analysis at scale. Technical Report CERTCC-2014-24, CERT/CC, 2014.
  • [24] T.-F. Yen and M. Reiter. Are your hosts trading or plotting? telling p2p file-sharing and bots apart. IEEE, 2010.
  • [25] J. Zhang, R. Perdisci, W. Lee, X. Luo, and U. Sarfraz. Building a scalable system for stealthy p2p-botnet detection. IEEE Transactions on Information Forensics and Security, 2014.