Domain separation

Domain separation refers to separating objects into well-defined, distinct domains. One example of this is metadata and plaintext: during cryptographic processing, it is crucial that we treat metadata and plaintext as two distinct types of strings, otherwise attacks may become possible.

Simple example

Consider this simple authenticated encryption scheme:

What if we wanted to extend the capabilities of this mode to allow for optional metadata/plaintext? That is, what if we wish to save on the cost of absorbing a blank string by adding some conditional logic?

This mode may seem secure at first glance, but it is actually broken due to an ambiguity between string types. There are multiple ways of arriving at an identical internal state.

Both of the above produce an identical authentication tag, which means that an adversary could swap the strings around in transit and the receiver would not detect a problem. This is due to the fact that both the metadata and ciphertext strings are being treated identically when they are input into the Farfalle instance. The solution to this problem is to use domain separation. One simple way to achieve this is to mark the strings with a frame bit to indicate whether it is a metadata string or a ciphertext string:

Now if the adversary swaps the fields in the message, the authentication tag will not match on the receiving end thanks to the frame bits.

Subtle example

Now that we have solved the optional metadata/plaintext problem, let's now extend the mode to support sessions:

Obviously this mode is secure since we are domain separating the metadata and ciphertext string types, right? Actually, it is broken, but in a subtle way. We have failed to domain separate (metadata,ciphertext) pairs from one another. Once again, there are multiple ways of arriving at an identical internal state.

Now it is clear that there is nothing telling us whether a pair of strings goes together (e.g. (A,P)) or whether they are part of separate messages (e.g. (A,"") and ("",P)). This means that an active adversary could intercept two messages and merge them into a new message that was never intended, and the receiver does not detect the tampering because of a domain separation ambiguity. We need a way to group the strings together; we can add another frame bit to resolve this ambiguity. There are multiple ways of approaching this problem:

Indicate whether a string is the first one in the tuple of strings
Indicate whether a string is the last one in the tuple of strings
Indicate whether a tuple of strings is "even" or "odd" relative to other tuples

For example, here is the even/odd parity approach:

The ambiguity is resolved thanks to the parity bit being different between adjacent messages.

The domain separation of messages exchanged as part of a session can be visualized as a linked list. Consider the following list of messages exchanged between a sender and a receiver:

(A,C)
(A)
(A)
(C)
(C)
(A,C)

Visualizing the internal state as a linked list, we have:

Domain Separation

Note that without the parity information, the above is ambiguous in that there is not a single unique way to recover the message stream:

(A,C) (A) (A,C) (C) (A,C)
(A) (C) (A) (A,C) (C) (A,C)
(A,C) (A) (A) (C) (C) (A,C)
...

We can use a frame bit to encode whether a string is the first one in a message:

Domain Separation

Now the session has only one unique interpretation; all ambiguity is removed:

(A,C)
(A)
(A)
(C)
(C)
(A,C)

Alternatively, we can use a frame bit to encode whether a string is the last one in a message:

Domain Separation

Another approach is to encode the parity of the tuple:

Domain Separation

The first/last string bit and parity bit methods are intimately related: the parity bit approach can be thought of as the mathematical integral of the first/last string bit approach. One interesting thing to note is that in the "first string" encoding, we can apply a mode-specific optimization to arrive at yet another valid encoding. The metadata string's "first string" indicator is always 1; when we rearrange the type/indicator bits and strip out the redundant indicator bit from the metadata string, we end up with an encoding that indicates whether a ciphertext string is the first string:

Domain Separation

Let's take a look at the algorithm for the "first ciphertext string" approach:

The ambiguity is resolved thanks to the indicator bit on the ciphertext string type. If two metadata strings are adjacent, then we immediately know they are members of two separate messages. If a metadata and ciphertext string are adjacent, then the indicator bit on the ciphertext string tells us whether it begins a new message or is part of the existing one.

A similar specialization could be applied to the "last string" encoding. I tend to prefer the "first string" encoding due to its nice properties:

No global state required, in contrast with the parity bit encoding
Streaming and factorization friendly: the current string never depends on the next one, in contrast with the "last string" encoding