Performance

Why FlexUnits is fast

FlexUnits.jl borrows techniques from Unitful.jl that focus on "static units" to obtain near zero overhead performance when units can be resolved at parse time. However, FlexUnits can fall back to using "dynamic units" techniques from DynamicQuantities.jl if units cannot be inferred at parse time. This is done through promotion rules which convert static-dimension quantities to dynamic-dimension quantities if different dimensions are present. This retains the high-performance behaviour of Unitful.jl when units are known at compile time, but often falls back to the performance of DynamicQuantity.jl if they can't be inferred. In the first set of benchmarks, we see that FlexUnits.jl and DynamicQuantities.jl vastly outperform Unitful.jl (by more than 100x) when units cannot be inferred.

using FlexUnits
using .UnitRegistry
import DynamicQuantities
import Unitful
using BenchmarkTools

v1uni  = [1.0*Unitful.u"m/s", 1.0*Unitful.u"J/kg", 1.0*Unitful.u"A/V"]
v1dyn  = [1.0*DynamicQuantities.u"m/s", 1.0*DynamicQuantities.u"J/kg", 1.0*DynamicQuantities.u"A/V"]
v1flex = [1.0u"m/s", 1.0u"J/kg", 1.0u"A/V"]

@btime sum(x->x^0.0, $v1uni)
  8.100 μs (86 allocations: 3.92 KiB)
@btime sum(x->x^0.0, $v1dyn)
  41.717 ns (0 allocations: 0 bytes)
@btime sum(x->x^0.0, $v1flex)
  5.300 ns (0 allocations: 0 bytes)

In the second example, we see that FlexUnits.jl and Unitful.jl outperform DynamicQuantities.jl when units can be inferred by the compiler.

t1uni  = [1.0*Unitful.u"m/s", 1.0*Unitful.u"m/s", 1.0*Unitful.u"m/s"]
t1dyn  = [1.0*DynamicQuantities.u"m/s", 1.0*DynamicQuantities.u"m/s", 1.0*DynamicQuantities.u"m/s"]
t1flex = [1.0u"m/s", 1.0u"m/s", 1.0u"m/s"]

@btime sum(x->x^2, $t1uni)
  3.000 ns (0 allocations: 0 bytes)
@btime sum(x->x^2, $t1dyn)
  7.407 ns (0 allocations: 0 bytes)
@btime sum(x->x^2, $t1flex)
  3.000 ns (0 allocations: 0 bytes)

While this performance boost over DynamicQuantities.jl isn't as dramatic as the previous boost over Unitful.jl, it is still significant. In most benchmarks (examples can be found in the test folder of the FlexUnits repo) FlexUnits matches the best performing alternative (DynamicQuantities or Unitful). There are two notable exceptions:

FlexUnits.jl is slightly slower than Unitful.jl at uconvert, but still much faster than DynamicQuantities.jl
FlexUnits.jl can be significantly faster than both Unitful.jl and DynamicQuantities.jl for iterative statically-inferred algorithms where unitful values are reassigned (Unitful tends to overspecialize but the FlexUnits design avoids this)

Why FlexUnits is efficient at linear algebra

In order for matrix multiplication and other linear algebra operations to be possible, the units need to be consistent. In order to be consistent, the matrix must be constructed to take a vector with units u_in and produce a vector of units u_out. To do this, the units of an entire N×M matrix can be described by the following factorization

A scalar dimension factor u_fact
An M-1 vector of input units u_in
An N-1 vector of output units u_out

This results in a matrix being described by N+M-1 values instead of NxM, this can result in a significant amount of compressions. Additionally, these factored units are stored separately from the matrix of numerical values. This factorization and separation provides the two main benefits:

Efficient, well-tested, pure-numerical matrix operations can be performed on the raw numerical values
Unit dimensions can be solved using efficient O(N) methods on DimsMap objects which only store (M+N+1) values instead of (M×N) values (DimsMap stores two more values than the bare-minimum for validation purposes).

These efficiency gains means that even if units must be dynamic (due to different units in the array), the overhead of resolving these units can be minimal. Let's compare mixed-unit matrix multiplication with different packages.

Matrix Multiplication Benchmarks

The first example consists of multiplying a 200x4 matrix by a 4x4 matrix

#Use unitless matrices as a benchmark
Nr = 200
X = randn(Nr, 4)
M = rand(4,4)

#Construct unitful matrices
uu = [Unitful.u"kg/s", Unitful.u"kW", Unitful.u"rad/s", Unitful.u"N/m"]
ut = reshape(uu, 1, :)
Xu = X.*ut
Mu = inv.(uu) .* M .* inv.(ut)

#Construct DynamicQuantity matrices
udq = [DynamicQuantities.u"kg/s", DynamicQuantities.u"kW", DynamicQuantities.u"rad/s", DynamicQuantities.u"N/m"]
udqt = reshape(udq, 1, :)
Xdq = X.*udqt
Mdq = inv.(udq) .* M .* inv.(udqt)

#Construct LinmapQuant matrices
ufq = [UnitRegistry.u"kg/s", UnitRegistry.u"kW", UnitRegistry.u"rad/s", UnitRegistry.u"N/m"]
Xfq = LinmapQuant(X, UnitMap(u_out = UnitRegistry.u"", u_in = inv.(ufq)))
Mfq = LinmapQuant(M, UnitMap(u_out = inv.(ufq), u_in=ufq))


julia> @btime X*M #No units
  700.000 ns (3 allocations: 6.35 KiB)

julia> @btime Xu*Mu #Unitful, more than 500x slower
  395.700 μs (5603 allocations: 93.83 KiB)

julia> @btime Xdq*Mdq #DynamicQuantities, about 8x slower
  5.700 μs (3 allocations: 31.34 KiB)

julia> @btime Xfq*Mfq #LinmapQuant, almost no overhead
  710.000 ns (4 allocations: 6.41 KiB)

The main reason why FlexUnits.jl has nearly no overhead is that only the inner product of the units between matrices is considered. Only the first 4-element row of X and the first column of M need to be compared. Unit inference does not touch the other 199 rows of X or the other 3 columns of M.

Linear Regression Benchmarks

Linear regression benchmarks can only be compared between FlexUnits and the raw numerical methods. Neither Unitful.jl nor DynamicQuantities.jl can handle matrix inversions.

julia> Mu/Mu #Unitful fails at 'oneunit'
ERROR: MethodError: no method matching oneunit(::Type{Any})
This error has been manually thrown, explicitly, so the method may exist but be intentionally marked as unimplemented.

julia> Mdq/Mdq #DynamicQuantities also fails at 'oneunit'
ERROR: Cannot create a dimensionful 1 from `Type{DynamicQuantities.Quantity}` without knowing the dimensions. Please use `oneunit(::DynamicQuantities.Quantity)` instead.

julia> collect(Mfq)/collect(Mfq) #Matrices of FlexUnit quantities fails further down because LU-factorization logic eventually compares quantities of different units
ERROR: DimensionError: (s²/kg², s⁴/(m² kg²)) have incompatible dimensions

julia> Mfq/Mfq #FlexUnits LinmapQuant matrices actually work
4×4 LinmapQuant{Float64, Dimensions{FixRat32}, Matrix{Float64}, DimsMap{Dimensions{FixRat32}, Vector{Dimensions{FixRat32}}, Vector{Dimensions{FixRat32}}}}:
          1.0           0.0 m²/s²                 0.0 1/kg           -0.0 1/s
     0.0 s²/m²               1.0   -3.29362e-21 s²/(m² kg)  -2.75549e-20 s/m²
        0.0 kg     0.0 (m² kg)/s²                     1.0           -0.0 kg/s
 6.32944e-16 s  -6.10423e-13 m²/s         2.87226e-16 s/kg               1.0

We can use linear algebra to complete the linear regression as follows:

Nr = 2000
XY = randn(Nr, 6) * rand(6, 6)
X = [XY[:, begin:4] ones(Nr)]
Y = XY[:, 5:end]
Xu = LinmapQuant(X, UnitMap(u_out=UnitRegistry.u"", u_in=inv.([UnitRegistry.u"kg/s", UnitRegistry.u"kW", UnitRegistry.u"rad/s", UnitRegistry.u"N/m", UnitRegistry.u""])))
Yu = LinmapQuant(Y, UnitMap(u_out=UnitRegistry.u"", u_in=inv.([UnitRegistry.u"K", UnitRegistry.u"kPa"])))

julia> @btime B = ($X'*$X)\($X'*$Y) #No units
  17.300 μs (10 allocations: 960 bytes)

julia> @btime Bu = ($Xu'*$Xu)\($Xu'*$Yu) #Nearly the same speed
  17.300 μs (14 allocations: 1.34 KiB)

For this operation, the overhead of adding units was also negligible. Even though Xu'*Xu is a long multiplication that compares the 2k column of units in Xu' vs the 2k rows units in Xu, the 2k rows of units are uniform and static which means the inference is greatly accelerated for both Xu'*Xu and Xu'*Yu. So while dynamic unit inference is about 5x slower than the numerical multiplication, static unit inference in one dimension was able to bypass most of this overhead. If we used dynamic units instead, the overhead would become noticeable.

Xu = LinmapQuant(X, UnitMap(u_out=UnitRegistry.ud"", u_in=inv.([UnitRegistry.u"kg/s", UnitRegistry.u"kW", UnitRegistry.u"rad/s", UnitRegistry.u"N/m", UnitRegistry.u""])))
Yu = LinmapQuant(Y, UnitMap(u_out=UnitRegistry.ud"", u_in=inv.([UnitRegistry.u"K", UnitRegistry.u"kPa"])))

julia> @btime Bu = ($Xu'*$Xu)\($Xu'*$Yu) #Nearly the same speed
  21.900 μs (14 allocations: 1.34 KiB)

While the performance penalty is noticeable (roughly 1.3x) it's not that large. Even with dynamic inference being ~5x slower than numeric computation, the 2k rows of units only need to be inferred once for Xu'*Xu and once Xu'*Yu. Meanwhile the numeric calculations for the 2k rows must be done for each element of the denominator matrix (5x5) plus each element of the numerator matrix (2x5) for a grand total of 35 times. This would yield a performance penalty of about (35 + 2*5)/35 = 1.2857 which is very close to what we observed.

Performance Tips

While FlexUnits is generally fast, there are a few things one may need to watch out for to get the most out of this package.

1. Avoid containers with mixed static-unit types, use dynamic units instead

A great deal of the work done in this package was devoted to building a performant type-stable dynamic unit/dimension system that can represent many different unit types and promotion rules that avoid mixed-type containers. While promotion rules help, some Julia functions don't apply conversion (this includes collect and map, but vcat(...) reliably promotes); if such mixed-type containers occur, use udynamic to explicitly convert static units to dynamic ones.

2. Use dynamic units for high-level code

Dynamic quantities are always type-stable and are less likely to result in accidental performance-killing dynamic dispatch calls. This is increasingly important if you want to produce small static binaries with Julia because dynamic dispatch can inhibit this. Using dynamic units can also prevent long compile times as it reduces specialization.

3. Use LinmapQuant/VectorQuant for linear algebra

These representations allow the use of optimized numerical methods for linear algebra, and employ shortcuts to solve the units of the matrix separately. This is especially important for large matrices. If input values are already numerical matrices, constructors for LinmapQuant are more efficient at attaching units to said vectors, as only N+M+1 dimension values are stored instead of the full M×N.

4. Pay attention to the output types of linear algebra operations

If a shortcut method is implemented, the output should be either a LinmapQuant or a VectorQuant. Otherwise a slower fallback method has been used. If this happens where not expected, please submit a bug report. It is likely that a method has been overlooked.

5. Use `dconvert` to transition from dynamic units to static units in low-level code

Dynamic units are great for achieving effortless type stability, but static units really shine in performance-sensitive low-level code where there's a small number of variables with known dimensions. In such cases, one can simply use dconvert(u"...", q) to convert q to a quantity with the same dimensions as u. Because most of the calculation is done using dimensional quantities, no conversion math needs to take place, one simply needs to verify that the units match, thus dconvert has very little overhead.

6. Use `vcat` or some other promoting method to transition from low-level code to high-level code

As mentioned before, some functions like collect and map don't use promote. However, other methods like explicit vector construction like v=[x,y,z] and using vcat on splatted tuples properly trigger promote. When returning containers with multiple units, make sure they are properly promoted to dynamic units, otherwise overspecialization and dynamic calls may leak to other parts of your code.