A collection of thoughts, code and snippets.

Omarchy - Integrating screen brightness via key binds

Posted on Wednesday, 10 September 2025

Controlling External Monitor Brightness (DDC/CI) in Hyprland with a Real OSD

As a follow-up to the previous entry on how to get the brightness adapted by ddcutil, I actually asked OpenAI's Codex to wire it up in my keyboard bindings. Once it succeeded and get the osd wired up, I asked it to document the process. Here is the Codex generated documentation.

As a side note I created a small script to increase or decrease the brightness on the command line, that is executable.

#!/bin/bash
# brightness up/down script using ddcutil
STEP=10
case "$1" in
  up)   ddcutil setvcp 10 +$STEP ;;
  down) ddcutil setvcp 10 -$STEP ;;
  get)  ddcutil getvcp 10 ;;
  *)    echo "Usage: $0 {up|down|get}" ;;
esac

OpenAI Codex steps and explanations

  • Environment: Hyprland (Omarchy on Arch), SwayOSD, ddcutil
  • Goal: Make hardware brightness keys and Alt+F1/F2 control HDMI monitor brightness via DDC/CI, with a correct on-screen display (OSD).

The Problem

  • Omarchy’s default media bindings show the OSD and call brightnessctl, which targets laptop backlights—not external HDMI displays.
  • My script ~/bin/hdmi-brightness already adjusts HDMI brightness using ddcutil, but Hyprland wasn’t calling it from brightness keys.
  • Bonus ask: show an OSD reflecting the real HDMI brightness level.

Solution Summary

  • Unbind default brightness keys.
  • Bind brightness keys and Alt+F1/F2 to the hdmi-brightness script.
  • After each adjustment, read the real brightness via ddcutil getvcp 10 and display an OSD using SwayOSD’s custom-progress mode.

Keybindings

  • File: ~/.config/hypr/bindings.conf
  • Unbind defaults:
    • unbind = , XF86MonBrightnessUp
    • unbind = , XF86MonBrightnessDown
    • unbind = ALT, XF86MonBrightnessUp
    • unbind = ALT, XF86MonBrightnessDown
  • Bind to DDC/CI script + OSD:
    • bindeld = , XF86MonBrightnessUp, HDMI Brightness up, exec, bash -lc "~/bin/hdmi-brightness raise; read P R <<< $(ddcutil getvcp 10 2>/dev/null | awk 'BEGIN{FS=\"[=,]\"} /current value/ {cv=$2+0; mv=$4+0; if(mv<=0){p=0;r=0}else{p=int((cv*100+mv/2)/mv); r=cv/mv}; if(r<0) r=0; if(r>1) r=1; printf(\"%d %.2f\", p, r)}'); $osdclient --custom-icon display-brightness --custom-progress-text \"Brightness: $P%\" --custom-progress \"$R\""
    • bindeld = , XF86MonBrightnessDown, HDMI Brightness down, exec, bash -lc "~/bin/hdmi-brightness lower; read P R <<< $(ddcutil getvcp 10 2>/dev/null | awk 'BEGIN{FS=\"[=,]\"} /current value/ {cv=$2+0; mv=$4+0; if(mv<=0){p=0;r=0}else{p=int((cv*100+mv/2)/mv); r=cv/mv}; if(r<0) r=0; if(r>1) r=1; printf(\"%d %.2f\", p, r)}'); $osdclient --custom-icon display-brightness --custom-progress-text \"Brightness: $P%\" --custom-progress \"$R\""
    • bindeld = ALT, F2, HDMI Brightness up, exec, [same as above with raise]
    • bindeld = ALT, F1, HDMI Brightness down, exec, [same as above with lower]
  • $osdclient comes from Omarchy’s defaults and shows OSD on the focused monitor:
    • swayosd-client --monitor "$(hyprctl monitors -j | jq -r '.[] | select(.focused == true).name')"

OSD Integration

  • ddcutil getvcp 10 returns a line containing “current value” and “maximum value”.
  • We parse it to get:
    • P (percent, e.g., 72)
    • R (ratio 0.00–1.00 for a progress bar)
  • SwayOSD’s --custom-progress shows a clean bar without changing any backlight devices:
    • --custom-progress "$R"
    • --custom-progress-text "Brightness: $P%"
    • --custom-icon display-brightness

Testing

  • Ensure the script is executable: chmod +x ~/bin/hdmi-brightness
  • Reload Hyprland: hyprctl reload
  • Press brightness keys and Alt+F1/F2:
    • External monitor brightness changes (DDC/CI)
    • OSD shows an accurate bar and percentage
  • If no OSD appears:
    • systemctl --user enable --now swayosd
    • Keep a window on the monitor you want the OSD (defaults to focused)

Troubleshooting

  • No brightness change: confirm ddcutil works (ddcutil detect, ddcutil getvcp 10), user in i2c group, and i2c-dev is loaded.
  • Different keycodes: use wev to check actual keysyms and update the binds accordingly.
  • OSD on the wrong output: we can pin $osdclient to a specific monitor name (e.g., --monitor "HDMI-A-1").

Why This Works

  • It replaces backlight-centric controls with DDC/CI, which external monitors use.
  • The OSD is decoupled from any system backlight and directly reflects DDC/CI state, so it’s always accurate.

Omarchy - Setting screen brightness over HDMI by terminal

Posted on Wednesday, 10 September 2025

Omarchy

With the coming arrival of the end of Windows10, I installed Omarchy on one my Beelink MiniS12 N95, fully expecting just to play with it and revert back to Windows11 on the machine. Win11 was slow on the machine, but a decent cheap desktop to have connected to a screen. Omarchy on the Beelink, even on the tiny hardware, has been absolutely flying. To the point that it's now my main desktop for everything right now.

One thing I needed to do is to handle the brightness of the screen, from the command line, so I could toggle the screen brightness from the command line. I discovered I could use ddcutil which I installed using the package manager on Omarchy.

sudo ddcutil detect
Display 1
   I2C bus:  /dev/i2c-0
   DRM_connector:           card1-HDMI-A-2
   EDID synopsis:
      Mfg id:               BNQ - UNK
      Model:                BenQ EL2870U
      Product code:         31049  (0x7949)
      Serial number:        58M02252SL0
      Binary serial number: 21573 (0x00005445)
      Manufacture year:     2021,  Week: 34
   VCP version:         2.2

My screen is connected by HDMI, do I was able to get the information on it.

~ ❯ sudo usermod -aG i2c $USER

So I didn't want to sudo all the time for my screen display information, I added my user the control of the i2c bus. This could be a security weakening, there's other ways to do it, but for my case it's fine.

~ ❯ ddcutil detect
Display 1
   I2C bus:  /dev/i2c-0
   DRM_connector:           card1-HDMI-A-2
   EDID synopsis:
      Mfg id:               BNQ - UNK
      Model:                BenQ EL2870U
      Product code:         31049  (0x7949)
      Serial number:        58M02252SL0
      Binary serial number: 21573 (0x00005445)
      Manufacture year:     2021,  Week: 34
   VCP version:         2.2

Then getting and setting the brightness was done with the following commands: ddcutil getvcp 10 and ddcutil setvcp 10 20

~ ❯ ddcutil getvcp 10
VCP code 0x10 (Brightness                    ): current value =    40, max value =   100

~ ❯ ddcutil setvcp 10 20

To get a listing of what codes the screen supports, you can use ddcutil capabilities.

ddcutil capabilities
Model: EL2870U
MCCS version: 2.2
Commands:
   Op Code: 01 (VCP Request)
   Op Code: 02 (VCP Response)
   Op Code: 03 (VCP Set)
   Op Code: 07 (Timing Request)
   Op Code: 0C (Save Settings)
   Op Code: E3 (Capabilities Reply)
   Op Code: F3 (Capabilities Request)
VCP Features:
   Feature: 02 (New control value)
   Feature: 04 (Restore factory defaults)
   Feature: 05 (Restore factory brightness/contrast defaults)
   Feature: 08 (Restore color defaults)
   Feature: 0B (Color temperature increment)
   Feature: 0C (Color temperature request)
   Feature: 10 (Brightness)
   Feature: 12 (Contrast)
   Feature: 14 (Select color preset)
      Values:
         04: 5000 K
         05: 6500 K
         08: 9300 K
         0b: User 1
   Feature: 16 (Video gain: Red)
   Feature: 18 (Video gain: Green)
   Feature: 1A (Video gain: Blue)
   Feature: 52 (Active control)
   Feature: 60 (Input Source)
      Values:
         0f: DisplayPort-1
         11: HDMI-1
         12: HDMI-2
   Feature: 62 (Audio speaker volume)
   Feature: 72 (Gamma)
      Invalid gamma descriptor: 50 64 78 8c a0
   Feature: 7D (Unrecognized feature)
      Values: 00 01 02 (interpretation unavailable)
   Feature: 7E (Trapezoid)
      Values: 03 0F 10 11 12 (interpretation unavailable)
   Feature: 7F (Unrecognized feature)
   Feature: 80 (Keystone)
      Values: 01 02 03 (interpretation unavailable)
   Feature: 86 (Display Scaling)
      Values:
         01: No scaling
         02: Max image, no aspect ration distortion
         05: Max vertical image with aspect ratio distortion
         0c: Unrecognized value
         10: Unrecognized value
         11: Unrecognized value
         13: Unrecognized value
         14: Unrecognized value
         15: Unrecognized value
         16: Unrecognized value
         17: Unrecognized value
   Feature: 87 (Sharpness)
   Feature: 8D (Audio mute/Screen blank)
      Values: 01 02 (interpretation unavailable)
   Feature: AC (Horizontal frequency)
   Feature: AE (Vertical frequency)
   Feature: B2 (Flat panel sub-pixel layout)
   Feature: B6 (Display technology type)
   Feature: C0 (Display usage time)
   Feature: C6 (Application enable key)
   Feature: C8 (Display controller type)
   Feature: C9 (Display firmware level)
   Feature: CA (OSD/Button Control)
      Values:
         01: OSD disabled, button events enabled
         02: OSD enabled, button events enabled
   Feature: CC (OSD Language)
      Values:
         01: Chinese (traditional, Hantai)
         02: English
         03: French
         04: German
         05: Italian
         06: Japanese
         07: Korean
         09: Russian
         0a: Spanish
         0b: Swedish
         0d: Chinese (simplified / Kantai)
         0e: Portuguese (Brazil)
         0f: Arabic
         12: Czech
         14: Dutch
         1a: Hungarian
         1e: Polish
         1f: Romanian
   Feature: D6 (Power mode)
      Values:
         01: DPM: On,  DPMS: Off
         05: Write only value to turn off display
   Feature: DA (Scan mode)
      Values:
         00: Normal operation
         02: Overscan
   Feature: DC (Display Mode)
      Values:
         04: User defined
         05: Games
         0b: Unrecognized value
         0c: Unrecognized value
         0e: Unrecognized value
         0f: Unrecognized value
         12: Unrecognized value
         13: Unrecognized value
         21: Unrecognized value
   Feature: DF (VCP Version)

This was what my BenQ exposes.

Ease of use

Discovered that ddcutil allows relative up downs:

~ ❯ ddcutil setvcp 10 + 10
~ ❯ ddcutil setvcp 10 - 10

Next steps

I need to figure out how to wire up these commands to the brightness up / down key commands on Omarchy - so I can control the brightness on the keyboard. Still not sure how to get that configuration working, since it doesn't work out of the box with my screen with the default tooling.


It’s a proportional allocation - how hard can it be? Going from Water filling to QP

Posted on Wednesday, 16 April 2025

It’s a proportional allocation - how hard can it be? MIQP

Alice, Bob and Charlie buy a pizza and they each put down a part of the price, respectively 50%, 30% and 20%. Pizza arrives and they slice it up and eat it. But Alice gets full after eating 40% of her 50% slice, how do we allocate the remaining 10% slice to Bob and Charlie? Her 10% can be cut up into 2% slices and we proportionally give 3 to Bob (total 36%) and 2 to Charlie (total 24%).

We have a total allocation of 100 MW of power to allocate to three products, in the ideal allocation with (0.5, 0.3, 0.2) ratio and each product has a maximum of 40MW.

These two problems are the same.

Waterfilling algorithm

There exists a well known algorithm for solving this problem, the water filling algorithm.

In essence, we are looking at finding the level of water across three containers that is flat along the allocation amount.

definitions:

  • \(w_i\) is the weight from product i
  • \(w'_i\) is the updated weight when we have fewer products due to saturation
  • \(M_i\) is the maximum of product i
  • \(x'_i\) is the initial / ideal allocation in case the products are not saturated
  • \(x_i\) is the current allocation to the product
  • \(T\) is total allocation amount to spread over the products
  • \(T_r\) is the remaining allocation to spread over products after the saturated products are removed from T
  • \(saturated\) : means that the allocation is >= max on the product.
  • \(unsaturated\) : means that the allocation is < max on the product

Water-Filling (Iterative) Algorithm:

  • Step 1: Compute the ideal allocations \(x'_i = w_i \cdot T\)
  • Step 2: For any product i for which \(x'_i >= M_i\) (saturated), set \(x_i =M_i\), otherwise \(x_i = x'_i\).
  • Step 3: Compute the remaining capacity by removing the capacity of saturated \(x_i\) : \(T_r = T − \sum_{i_{\text{saturated}}} M_i\).
  • Step 4: For the remaining (unsaturated) products, redistribute \(T_r\) proportionally based on their weights normalized over the unsaturated set \(x_i = w^{\prime}_i \cdot T_r\) where \(w^{\prime}_i = \frac{w_i}{\sum_j w_j}\) with \(j\) representing the unsaturated products
  • Step 5: Repeat the process if additional products get saturated during the redistribution, ie. from Step 2.

The reason this terminates is that because we remove saturated products from the list, the next products get allocated and the set gets reduced, either a new product is saturated and the cycle continues or the final allocation is done and terminates.

import numpy as np

def iterative_waterfilling(T, weights, max_allocations):
    n = len(weights)
    allocations = np.zeros(n)
    unsaturated = np.array([True] * n)
    remaining_T = T

    while True:
        # Calculate proportional weights for unsaturated products
        current_weights = np.array(weights) * unsaturated
        total_current_weight = np.sum(current_weights)
        
        # Calculate ideal allocation for unsaturated products
        ideal_allocations = (current_weights / total_current_weight) * remaining_T

        # Check for saturation
        newly_saturated = ideal_allocations >= max_allocations
        
        # Update allocations and saturation status
        if not np.any(newly_saturated & unsaturated):
            allocations[unsaturated] = ideal_allocations[unsaturated]
            break
        
        for i in range(n):
            if unsaturated[i] and newly_saturated[i]:
                allocations[i] = max_allocations[i]
                unsaturated[i] = False
                remaining_T -= allocations[i]

    return allocations

T = 100
weights = [0.5, 0.3, 0.2]
max_allocations = [40, 40, 40]

allocations_result = iterative_waterfilling(T, weights, max_allocations)
print("Iterative Waterfilling Result:", allocations_result)

Examples

Example 1 - total 100, allocation (0.5,0.3,0.2), maximum (40,40,40), result (40,36,24)

Example 2 - total 100, allocation (0.5,0.3,0.2), maximum (40,34,40), result (40,34,26)

The problem of this algorithm is that while it works wonderfully for simple proportional problems, as soon as you start adding more constraints (minimums) and relations between the allocations, this iterative algorithm doesn’t work that great.

So how do we solve this as an optimization problem? We want to find a solution that when unconstrained (unsaturated) falls back to the proportional allocation and when constrained (saturated maximum) falls back to the waterfilling algorithm.

From Waterfilling to QP : Mixed integer quadratic programming

While the iterative waterfilling algorithm effectively solves basic proportional allocation problems, it struggles under more complex scenarios involving additional constraints, such as minimum allocation limits or relational constraints between allocations. To robustly handle these real-world complexities, we leverage Mixed Integer Quadratic Programming (MIQP). MIQP elegantly generalizes the waterfilling logic into an optimization framework, allowing precise specification of constraints and objectives. By translating allocation decisions into a mathematical optimization problem, we ensure optimal, constraint-respecting allocations, making it suitable for applications demanding reliability and flexibility.

Definitions:

  • \(T\): total
  • \(w_i\): weight
  • \(M_i\): Maximum of allocation
  • \(d_i\): product saturated marker ($\in \mathbb, binary \in \lbrace 0,1 \rbrace $), 0 unsaturated, 1 saturated
  • \(x_i\): allocation amount
  • \(v_i\): target water level for unsaturated product (shared identity)
  • \(U\): upper large bound

Objective:

\(\min \sum_i (x_i - w_i T)^2\)

Constraints:

(I) Basic allocation limits

\(\forall i \quad x_i \geq 0 \quad (a)\)

\(\forall i \quad x_i \leq M_i \quad (b)\)

(II) Total allocation constraint

\(\sum_i x_i \leq T\)

(III) Saturation constraints \(\forall i\)

\(x_i = w_i \cdot v_i + M_i \cdot d_i \quad (a)\)

\(v_i \leq U \cdot (1 - d_i) \quad (b) \quad \text{with } v_i = 0 \text{ when saturated}\)

\(v_i \geq 0 \quad (c)\)

\(v_i \leq \frac{M_i}{w_i} \cdot (1 - d_i) + U \cdot d_i \quad (d)\)

(IV) Water level equality constraints across unsaturated products

\(\forall i (1 \rightarrow n), \quad \forall j (i+1 \rightarrow n)\)

\(v_i - v_j \leq U \cdot (d_i + d_j)\)

\(v_j - v_i \leq U \cdot (d_i + d_j)\)

If both are unsaturated, it means \(d_i = d_j = 0\), thus forcing \(v_i = v_j\).

Explanation of \(v_i\)

This set of equations sets up \(v_i\) as a shared value \(z\) across all unsaturated products.

$$40 + (w_1 + w_2) \cdot z = 100$$

\((0.3 + 0.2) z = 60\)

\(z = 120\)

\(x_1 = 0.3 \cdot 120 = 36\)

\(x_2 = 0.2 \cdot 120 = 24\)

Implementation in Python

#pip install numpy
#pip install cvxpy
#pip install ecos

import cvxpy as cp
import numpy as np

# Problem parameters
T = 100.0
w = [0.5, 0.3, 0.2]          # target weights for products 1, 2, and 3
M_max = [40.0, 40.0, 40.0]     # maximum allocation for each product

# Number of products
n = len(w)

# A sufficiently large constant U (big-M) for enforcing water-level equality.
U = 500.0

# Decision variables:
# x: allocation amounts
x = cp.Variable(n)
# v: auxiliary "water-level" variables for unsaturated products
v = cp.Variable(n)
# delta: binary variables; delta[i] = 1 means product i is saturated (x[i] = M_max[i])
delta = cp.Variable(n, boolean=True)

constraints = []

# For each product, define the allocation as the sum of the unsaturated part and the saturation term.
for i in range(n):
    # If not saturated (delta[i] = 0) then x[i] = w[i]*v[i].
    # If saturated (delta[i] = 1) then x[i] = M_max[i].
    constraints.append(x[i] == w[i] * v[i] + M_max[i] * delta[i])
    
    # Force v[i] = 0 when saturated, by bounding v[i] to 0 when delta[i]=1.
    constraints.append(v[i] <= U * (1 - delta[i]))
    # Ensure nonnegativity of v.
    constraints.append(v[i] >= 0)
    # When unsaturated (delta[i]=0), we must have x[i] = w[i]*v[i] ≤ M_max[i]. 
    # Throught: Why not T here instead of M_max[i]? -> it would be correct, but M_max[i] is a more restrictive boundary so helps convergence
    constraints.append(v[i] <= (M_max[i] / w[i]) * (1 - delta[i]) + U * delta[i])

# Enforce that all unsaturated products share the same water-level.
# For every pair (i,j), if both are unsaturated (delta[i] = delta[j] = 0) then v[i] must equal v[j].
for i in range(n):
    for j in range(i+1, n):
        constraints.append(v[i] - v[j] <= U * (delta[i] + delta[j]))
        constraints.append(v[j] - v[i] <= U * (delta[i] + delta[j]))

# Total allocation constraint: the sum of all allocations must equal the available capacity.
constraints.append(cp.sum(x) == T)

# Ensure each x[i] does not exceed its maximum. (These could be built in via the definition of x.)
for i in range(n):
    constraints.append(x[i] >= 0)
    constraints.append(x[i] <= M_max[i])

# Define the target allocation for each product (unconstrained ideals)
target = np.array([w_i * T for w_i in w])

# Objective: minimize squared deviation from the ideal allocation.
objective = cp.Minimize(cp.sum_squares(x - target))

# Define and solve the MIQP.
# Use a MIQP-capable solver such as GUROBI, CPLEX, ECOS_BB, etc.
prob = cp.Problem(objective, constraints)
result = prob.solve(solver=cp.ECOS_BB)


print("Status:", prob.status)
print("Optimal value:", result)
print("Optimal allocations:")
for i in range(n):
    print(f"  x[{i+1}] = {x.value[i]:.4f}")
print("Water-level (v) values:")
for i in range(n):
    print(f"  v[{i+1}] = {v.value[i]:.4f}")
print("Saturation indicators (delta) values:")
for i in range(n):
    print(f"  delta[{i+1}] = {delta.value[i]:.4f}")

# Expected behavior for this example:
# - For product 1, the unconstrained target is 50 but M_max[0]=40, so we expect it to be saturated (delta[0]=1, x[0]=40).
# - For products 2 and 3 (unsaturated, delta = 0), they share the same water-level z.
#   The total allocation constraint becomes: 40 + (w[1] + w[2]) * z = 100  -->  (0.3+0.2)*z = 60, so z = 120.
#   Hence, x[2] = 0.3 * 120 = 36 and x[3] = 0.2 * 120 = 24.

Summary

In business applications, we often see allocations that should be "preference based" if unconstrained, but then with different constraints it becomes complicated to tease out the preferences in an optimal way. This QP application shows a way to have an allocation identical to the water filling, but with the flexibility of QP.

If you were simplifying the QP constraints to remove the equations (IV), the output would be (40, 35, 25) since that minimises the objective function.

This example here is taken out of an abstration of a practical problem of allocating ancillary service capacity to a series of contracts, with trader preferences. This chapter is part of my book "Energy - From Asset to Cashflow" (not yet published).


In-memory software design 2025 ? - from 40GB/s to 55GB/s

Posted on Sunday, 16 February 2025

In-memory software design in 2025 - from 40GB/s to 55GB/s

In the last blog, we looked at techniques for improving when we were doing partial sums and to reduce the scanned trades we used in-memory indexes using sets.

But what if we simply want to go faster for the complete aggregation?

Up to now, we've been programming for the programmer making the program easy to write and understand, but what if we make the program easy to run for the cpu?

How do we do that? By improving memory locality and structures.

Memory locality

The initial implementation uses a std::vector<Trade> trades; with each trade maintaining a std::vector<DailyDelivery> dailyDeliveries; that contains the two vectors of power and value std::array<int, 100> power; std::array<int, 100> value;.

TradeInMemory

While vector does a great job at trying to get the memory allocated to it to be continuous in cpp, when you nest vectors in vectors and you allocate the sub-vectors, you are creating fragmented memory. This means the CPU's memory controller has to jump around a lot to get the next bloc and there's a high probability that adjacent will not be in the cache for the other cores of the CPU to be able to use.

The best architecture for locality will be always dependent on the access patters to the data. In our case here, we are going to optimize for maximum speed when doing complete aggregations.

TradeCompressed

This is done by creating an specific area for the Power and Value data of each delivery day is allocated to. Trades point to the beginning and end of this data. If you need to move inside this area, because you know exactly the start delivery date of the trade and the size of the delivery day vector (100 ints), you can immediately index into the large array.

Now, if you are dealing with a small amount of trades (~50'000 with an average of 120 days of delivery ~~ 4.9 GB of RAM for 600 million quarter hours), you can get away with a single coherent bloc of RAM for your delivery vector. But see later for a more practical production discussion:

    // Each daily delivery contributes 100 ints for power and 100 ints for value.
    std::vector<int> flatPower(totalDailyDeliveries * 100);
    std::vector<int> flatValue(totalDailyDeliveries * 100);

Now with this information, we can directly index into the area based on the trades data:

#ifdef _OPENMP
#pragma omp parallel for reduction(+: all_totalPower, all_totalValue, \
                                     traderX_totalPower, traderX_totalValue, \
                                     traderX_area1_totalPower, traderX_area1_totalValue, \
                                     traderX_area2_totalPower, traderX_area2_totalValue)
#endif
    for (std::size_t i = 0; i < flatTrades.size(); ++i)
    {
        const auto& ft = flatTrades[i];
        // Each trade's deliveries are stored in a contiguous block.
        size_t startIndex = ft.deliveriesOffset * 100;
        size_t numInts = ft.numDeliveries * 100;
        long long sumPower = 0;
        long long sumValue = 0;
        for (size_t j = 0; j < numInts; ++j)
        {
            sumPower += flatPower[startIndex + j];
            sumValue += flatValue[startIndex + j];
        }
        // (a) All trades
        all_totalPower += sumPower;
        all_totalValue += sumValue;

When running with this configuration, my time to aggregation on the laptop improves from 241ms to 178ms a 37% improvement in speed - which get us to 55 GB/s on a commodity laptop (using OpenMP of course).

Limits

But as you scale to larger and larger amounts of trades and delivery days, you will really find that the vector allocator will not be able to handle that in a single bloc.

At that point, we'll start running our own custom allocation, keeping our own block memory via a custom allocator. By creating blocks of 64 to 256MB that we allocate as needed and indexing our trade deliveries into it, we can then scale to more of the the entire memory of your machine.

Two good references on that are Custom Allocators in C++: High Performance Memory Management and CppCon 2017: Bob Steagall “How to Write a Custom Allocator”.

Next steps?

Going from a simple data structure to a memory adapted structure allowed us to go from 40GB/s (42% of laptop bandwidth) to 55GB/s (57% of laptop's 96 GB/s bandwidth).

If you still need more performance, you must further adapt your data structures and locality to the access patterns. Then start looking in depth at where the stalls are in the execution trace. There's other approaches such as looking at AVX instructions in more detail to find some perf, loop unrolling, and so on. Get a real cpp expert to consult on it.

Or just get a machine with more bandwidth! An example of that is the M2, M4 Max and Ultra family of chips from Apple, with memory bandwidth of over 800GB/s - over 8x what my laptop has. Or just run on a server, as noted in the first article, Azure has now a machine with 6'900 GB/s of bandwidth.