binarylang

DSL invocation

Two macros are exported:

  • struct which is used to produce a product parser
  • union which is used to produce a sum parser

Both of these macros generate a type declaration and a tuple[get: proc, put: proc]:

  • get returns an object with each parsed field
  • put writes an object to a stream

Each statement corresponds to 1 field. The general syntax is:

type: name (...)
  • For the name you may use _ to discard the field
  • Fields are public by default
  • You may append {.private.} to a field to make it private

Parser options

Each specified option must be in the form option = value:

  • endian: sets the default byte endianness for the whole parser
    • default: big endian
    • b: big endian
    • l: little endian
    • c: cpu endian
  • bitEndian: sets the default bit endianness for the whole parser
    • default: left -> right
    • n: left -> right (normal)
    • r: left <- right (reverse)
  • reference: configures whether the associated type will be a ref or not
    • default: no
    • y: yes
    • n: no
  • plugins: enable additional codegen features (value is a set)
    • converters: generate from and to procs for converting from/to string
struct(data, plugins = {converters}):
  8: x

var fileContent = readFile("data/plugins.hex")
let data = fileContent.toData
assert data.x == 0x41

let reparsed = data.fromData
assert reparsed == "A"
  • visibility: for parser, discr field and symbols generated by plugins
    • default: public
    • public
    • private

Parser parameters

Each parameter must be in the form symbol: type. The generated get/put procs will then have this additional parameter appended.

The only exception is the discriminator field for sum parsers which is always named disc implicitly; and therefore, only the type must be provided -instead of an expression-colon-expression-.

Types

Primitive types

The kind, endianness and size are encoded in a identifier made up of:

  • 1 optional letter specifying the kind:
    • default: signed integer
    • u: unsigned integer
    • f: float
    • s: string
  • 1 optional letter specifying byte endianness:
    • default: big endian
    • b: big endian
    • l: little endian
  • 1 optional letter specifying bit endianness:
    • default: left -> right
    • n: left -> right (normal)
    • r: left <- right (reverse)
  • 1 number specifying size in bits:
    • for a string it refers to the size of each individual character and defaults to 8
    • for an integer the allowed values are 1 .. 64
    • for a float the allowed values are 32 and 64

You can order options however you want, but size must come last (e.g. lru16 and url16 are valid but not 16lru).

Assertion can also be used in a special manner to terminate the previous field if it's a string or a sequence indicated as magic-terminated. This is discussed in later sections.

Product type

A parser is of type product if it is created with the struct macro or by hand, as explained in a later section. To call a product parser you must use * followed by the name of the parser. If your parser requires arguments, you must provide them using standard call syntax.

Example:

struct(inner):
  32: a
  32: b

struct(innerWithArgs, size: int32):
  32: a
  32: b[size]

struct(outer):
  *inner: x
  *innerWithArgs(x.a): y

Sum type

A parser is of type sum if it is created with the union macro or by hand, as explained in a later section. A sum parser has a special field called the discriminator which determines which branch will be activated at run-time -similarly to object variants-.

To call a sum parser you must use + followed by a call-syntaxed expression. The callee is the name of the parser and the first argument is the value of the discriminator field. If the parser requires additional arguments, they also have to be provided. The first argument is treated in a special manner. Unlike other arguments, this one is only evaluated during parsing, whereas during serialization the value stored in the disc field is used.

Example:

union(inner, byte):
  (0): 8: a
  (1): 16: b
  _: nil

struct(outer):
  +inner(0): x

Features

Alignment

If any of the following is violated, BinaryLang should generate an exception:

  • Byte endianness can only be used with byte-multiple integers
  • Bit endianness must be uniform between byte boundaries
  • Spec must finish on a byte boundary
struct(parser, bitEndian = n):
  b9: a # error: cannot apply byte endianness
  r6: b # error: shares bits with previous byte
  10: c # error: spec does not finish on a byte boundary

Moreover, unaligned reads for strings are not supported:

struct(parser):
  6: x
  s: y # invalid, generates an exception

Assertion

Use = expr for producing an exception if the parsed value doesn't match expr:

s: x = "BinaryLang is awesome"
8: y[5] = @[0, 1, 2, 3, 4]

Repetition

There are 3 ways to produce a seq of your type:

  • for: append [expr] to the name for repeating expr times
  • until: append {expr} to the name for repeating until expr is evaluated to true
  • magic: enclose name with {} and use assertion with your next field
8: a[5] # reads 5 8-bit integers
8: b{_ == 103 or i > 9} # reads until it finds the value 103 or
                        # completes 10th iteration
8: {c} # reads 8-bit integers until next field is matches
16: _ = 0xABCD
u8: {d[5]} # reads byte sequences each of length 5 until next field
           # matches
s: _ = "END"

Also, the following symbols are defined implicitly:

  • i: current iteration index
  • _: last element read

These can be leveraged even in other expressions than the expression for repetition itself; for instance you can use them to parameterize a parser:

struct(inner, size: int):
  8: x[size]
struct(outer):
  32: amount
  32: sizes[amount]
  *inner(sizes[i]): aux[amount]

With the above trick you can get a sequence of variable-length sequences.

Due to current limitations of the underlying bitstream implementation, to perform magic, your stream must be aligned and all the reads involved must also be aligned. This will be fixed in the future.

Substreams

Call syntax forces the creation of a substream:

struct(aux, size: int):
  8: x[size]
struct(parser):
  8: x = 4
  8: limit = 8
  *aux(x): fixed(limit)

In the above example, limit bytes (8 in this case) will be read from the main BitStream. Then, a substream will be created out of them, which will then be used as the stream for parsing fixed. Since fixed will only use 4 of them, the remaining 4 will effectively be discarded.

Note that unlike in the type, here size is counted in bytes. It is implied that you cannot create a substream if your bitstream is unaligned.

This feature is not implemented for repetition because it would increase complexity with little benefits. The following syntax is invalid and instead you should use the technique with the auxiliary parser shown above:

struct(parser):
  u8: a[4](6) # does substream refer to each individual element or the
              # whole sequence?

Strings

Strings are special because they don't have a fixed size. Therefore, you must provide enough information regarding their termination. This can be achieved with one of the following:

  • Use of substream
  • Assertion
  • Magic
s: a # null/eos-terminated (because next field doesn't use assertion)
s: b(5) # reads a string from a substream of 5 bytes until null/eos
s: c = "ABC" # reads a string of length 3 that must match "ABC"
s: d # reads a string until next field matches
s: _ = "MAGIC"
s: e[5] # reads 5 null-terminated strings
s: {f} # reads null-terminated strings until next field matches
8: term = 0xff # terminator of the above sequence
s: {g[5]} # sequence of 5-length sequences of null-terminated strings
s: _ = "END_NESTED"

Rules:

  • Strings are null/eos-terminated unless assertion is used on the same field or on the next field
  • When using repetition, each string element is null-terminated

Extensions

Custom parser API

Since a BinaryLang parser is just a tuple[get: proc, put: proc], you can write parsers by hand that are compatible with the DSL. Just be sure that get and put have proper signatures, and there is a type with the same name as your parser but capitalized:

type Parser = SomeType
proc get(s: BitStream): Parser
proc put(s: BitStream, input: Parser)
let parser = (get: get, put: put)

If you want your custom parser to be parametric, simply append more parameters to your procs. These extra parameters must be identical and in the same order in the two procs:

type Parser = SomeType
proc get(s: BitStream, x: int, y: float): Parser
proc put(s: BitStream, input: Parser, x: int, y: float)
let parser = (get: get, put: put)

Operations

Operations can be applied to fields with the following syntax:

type {op(arg)}: name

Operations act on data after the parsing and before the encoding respectively.

An operation is nothing more than a pair of templates which follow a specific pattern:

  • The names of the templates must follow the pattern: <operation>get and <operation>put
  • They must have at least 3 untyped parameters (you can name them as you
    wish):
    • parameter #1: parsing/encoding statements
    • parameter #2: variable previously parsed/encoded
    • parameter #3: output
template increaseGet(parse, parsed, output, num: untyped) =
  parse
  output = parsed + num
template increasePut(encode, encoded, output, num: untyped) =
  output = encoded - num
  encode
struct(myParser):
  64: x
  16 {increase(x)}: y

You can apply more than one operations on one field, in which case they are chained in the specified order, and only the first operation really does any parsing/encoding to the stream. The rest just operate on the value produced by the operation directly before them.

parse fills in the parsed variable. It is a seperate statement because it potentially operates on the stream (this happens always and only for the first operation). Similarly, encode passes on the value in output variable. Passes means the value is potentially written to the stream.

template condGet(parse, parsed, output, cond: untyped) =
  if cond:
    parse
    output = parsed
template condPut(encode, encoded, output, cond: untyped) =
  if cond:
    output = encoded
    encode
template increaseGet(parse, parsed, output, num: untyped) =
  parse
  output = parsed + num
template increasePut(encode, encoded, output, num: untyped) =
  output = encoded - num
  encode
struct(myParser):
  8: shouldParse
  64: x
  16 {cond(shouldParse.bool), increase(x)}: y

It is impossible for BinaryLang to infer the type of the altered value, that is, if your operation changes it. By default it is assumed that the new field value is of the same type as the previous one (for the first operation, this is the type produced according to the field type annotation). Therefore, if your operation alters the type, then you must provide the new type in square brackets:

template asciiNumGet(parse, parsed, output: untyped) =
  parse
  output = char(parsed - '0')
template asciiNumPut(encode, encoded, output: untyped) =
  output = int8(encoded + '0')
  encode
struct(myParser):
  8 {asciiNum[char]}: x

The actual type of the field changes to the type annotated in the last operation. if you annotate the type for some of the operations, then for the ones you did not, the type of the operation directly previous to it is assumed.

Special notes

  • Nim expressions may contain:
    • a previously defined field
    • a parser parameter
    • the _ symbol for subject element (its meaning varies)
    • the i symbol for current index in a repetition
    • the s symbol for accessing the bitstream

i and s might conflict with your variables or fields, so you should consider them reserved keywords and not use them for something else.

Macros

macro struct(name: untyped; rest: varargs[untyped]): untyped
Input:
  • name: Name of the parser tuple to create (must be lowercase)
  • rest: Optional parser options and parameters
  • rest (last): Block of the format described above

Output:

  • Object type declaration with name tnamecapitalizeAscii(name)
  • Reader proc that returns an object of the type tname
  • Writer proc that accepts an object of type tname
  • A tuple named name with the fields get and put

The procs are of the following form:

proc get(s: BitStream): `tname`
proc put(s: BitStream, input: `tname`)
macro union(name, disc: untyped; rest: varargs[untyped]): untyped
Input:
  • name: The name of the parser tuple to create (must be lowercase)
  • disc: The definition of the discriminator field (name: type)
  • rest: Optional parser options and parameters
  • rest (last): Block of the format described above

Output:

  • Variant object type declaration with discriminator disc and name tnamecapitalizeAscii(name)
  • Reader proc that returns an object of the type tname
  • Writer proc that accepts an object of type tname
  • A tuple named name with the fields get and put

The procs are of the following form:

proc get(s: BitStream): `tname`
proc put(s: BitStream, input: `tname`)

The body is similar to that of struct macro, but the fields are partitioned in branches. Each branch starts with one or more possible value of the discriminator in parenthesis, seperated by comma.

For covering the rest of the cases use the _ symbol (without parenthesis).

If you don't want a field for some branch, use nil on the right side.

Example:

union(fooBar, int):
  (0): *foo: a
  (1, 3): u32: *b
  (2): nil
  (4):
    u8: c
    *bar: d
  _: u32: e