HMM trellis microkernel - experiment log

  • Kernel: smile/progs/sci/hmm_step.c
  • Commit: 1b783a3
  • Host: Mac M1, riscv64-unknown-elf-gcc 13.2.0 (rv32i_zicsr, ilp32)
  • Build + run commands: Toy workload: 3-mer HMM, M = 64 states, N = 26 events, NUM_PATH = 21 transitions/state.

Each run:

  • builds hmm_step.c into a bare-metal RV32 ELF starting at _start at address 0x0;
  • converts to a flat binary prog.bin;
  • runs on Tile1 via tb_tile1;
  • reports the ECALL exit code and a single stats line:
    [STATS] inst=... alu=... loads=... stores=... branches=... taken=...
    

The kernel itself:

  • computes the forward trellis (Viterbi-style) with column-wise normalization,
  • takes a checksum of the final column + end state,
  • writes it to 0x0100,
  • exits with result & 0xff as the exit code.

Legend: DP vs scaffolding arrays

  • DP (dynamic-programming) arrays – core trellis state / model data:
    • log_post[M], cur_log_post[M], last_col[M]
    • event_over_stdv[N], mu_over_stdv[M]
    • neg_log_ptau[NUM_PATH]
  • Scaffolding arrays – implementation mechanics around the recurrence:
    • tran_prev[NUM_PATH] – indices of the 21 predecessor states
    • temp[NUM_PATH] – candidate transition scores for a given state/event
    • pointers[M][N - 1] – backpointer indices (0..20) per state/event

Run 2026-02-04 – baseline hmm_step.c (M = 64, N = 26)

  • Kernel: smile/progs/sci/hmm_step.c
  • Host: Mac M1, riscv64-unknown-elf-gcc 13.2.0
  • Build + run commands:
# From smarc/smile/progs:
# Build the program as RV32I (software multiply via __mulsi3)…
riscv64-unknown-elf-gcc -Os -march=rv32i_zicsr -mabi=ilp32 \
  -ffreestanding -nostdlib -nostartfiles \
  -Wl,-T link_rv32.ld -Wl,-e,_start -Wl,--no-relax \
  sci/hmm_step.c -o prog.elf

# Or, build as RV32IM (hardware M extension)…
riscv64-unknown-elf-gcc -Os -march=rv32im_zicsr -mabi=ilp32 \
  -ffreestanding -nostdlib -nostartfiles \
  -Wl,-T link_rv32.ld -Wl,-e,_start -Wl,--no-relax \
  sci/hmm_step.c -o prog.elf

# Convert to flat binary
riscv64-unknown-elf-objcopy -O binary prog.elf prog.bin

# From smarc:
./build/smile/tb_tile1 -prog=smile/progs/prog.bin -load_addr=0x0 -start_pc=0x0 -steps=2000000 -sw_threads=1
  • Program exit:
[EXIT] Program exited with code 253
  • Tile1 stats (two builds of the same C code):
# RV32I (no M extension: software mul via __mulsi3)
[STATS] inst=909545 alu=542381 add=215546 mul=0 loads=145906 stores=73767 branches=130897 taken=114354

# RV32IM (with M extension: mul/div/rem in hardware)
[STATS] inst=751816 alu=419769 add=210009 mul=1664 loads=145906 stores=75342 branches=107545 taken=98203
  • Quick per-event metrics (N = 26 events, M = 64 states):

    • RV32I:
      • inst/event ≈ 35.0k; inst/cell ≈ 0.54k
      • alu/event ≈ 20.9k; alu/cell ≈ 0.33k
      • add/event ≈ 8.3k; add/cell ≈ 0.13k
      • mul/event = 0 (all multiplies implemented in software via __mulsi3)
      • loads/event ≈ 5.6k; loads/cell ≈ 0.09k
      • stores/event ≈ 2.8k; stores/cell ≈ 0.04k
      • bytes/event ≈ 33.6k; bytes/cell ≈ 0.52k
      • branches/event ≈ 5.0k
      • branch taken ratio ≈ 114,354 / 130,897 ≈ 0.87

      intensity metrics:

      • alu/cell = 330
      • ops/cell = 130 (counting add + mul)
      • alu/byte = 0.33k/0.52k ≈ 0.63 ALU ops per byte
      • add/byte = 0.13k/0.52k ≈ 0.25 adds per byte
    • RV32IM:
      • inst/event ≈ 28.9k; inst/cell ≈ 0.46k
      • alu/event ≈ 16.1k; alu/cell ≈ 0.25k
      • add/event ≈ 8.1k; add/cell ≈ 0.13k
      • mul/event ≈ 0.064k; mul/cell ≈ 0.001k
      • ops/event ≈ 8.14k; ops/cell ≈ 0.13k
      • loads/event ≈ 5.6k; loads/cell ≈ 0.09k
      • stores/event ≈ 2.9k; stores/cell ≈ 0.04k
      • bytes/event ≈ 34.0k; bytes/cell ≈ 0.53k
      • branches/event ≈ 4.1k
      • branch taken ratio ≈ 98,203 / 107,545 ≈ 0.91

      intensity metrics:

      • alu/cell = 250
      • ops/cell = 130 (counting add + mul)
      • alu/byte = 0.25k/0.53k ≈ 0.47 ALU ops per byte
      • add/byte = 0.13k/0.53k ≈ 0.25 adds per byte
      • ops/byte = 0.13k/0.53k ≈ 0.25 ops per byte (counting add + mul)

      Hand calculations show about 24 add/sub ops + 1 mul per cell. These 25 ops/cell is about 25/130 = 19% of the counted ALU ops/cell. The rest of the ALU ops are likely loop control, address calculations, and other overhead.

  • For loads and stores:

In what follows, “DP arrays” and “scaffolding arrays” are used exactly as defined in the legend above.

We can decompose the memory traffic into four phases and then separate “DP-array” traffic (the actual trellis state and model/input data) from “scaffolding” traffic (helper arrays and backpointers).

Analytical breakdown by phase (approximate counts)

Here M = 64, N = 26, so there are (N − 1) × M = 25 × 64 = 1600 inner trellis cells.

Phase Description Loads (approx) Stores (approx)
Initial column Event 0 emission into log_post 128 64
Trellis cells Events 1…N−1, all states (1600 cells) 139,200 70,400
Per-event normalization 25 columns, normalize cur_log_post into log_post 3,225 1,600
Final column + checksum Snapshot to last_col, end-state + checksum 192 64
Total (analytic)   142,745 72,128

Comparing with the measured Tile1 stats (RV32IM run: loads=145,906, stores=75,342) gives per-cell numbers:

  • measured loads/cell ≈ 145,906 / 1600 ≈ 91.2
  • measured stores/cell ≈ 75,342 / 1600 ≈ 47.1

versus the analytic estimate:

  • estimated loads/cell ≈ 142,745 / 1600 ≈ 89.2
  • estimated stores/cell ≈ 72,128 / 1600 ≈ 45.1

The small gap is expected (prologue/epilogue, exit glue, a bit of extra compiler scaffolding), but the match is close enough to validate the hand-count.

Per-cell split: DP arrays vs. scaffolding

Within each trellis cell, the memory ops break down roughly as follows:

  • DP arrays (true algorithmic state/data):
    • log_post[j], cur_log_post[j], last_col[j]
    • event_over_stdv[i], mu_over_stdv[j]
    • neg_log_ptau[v]
  • Scaffolding:
    • tran_prev[v] (21 predecessor indices)
    • temp[v] (21 candidate transition scores)
    • pointers[j][i-1] (backpointer index 0…20)

Ignoring the initial column, normalization, and final checksum, the inner trellis cell loads/stores look like this:

Category Arrays Loads/cell Stores/cell
DP arrays log_post, event_over_stdv, mu_over_stdv, neg_log_ptau 44 1
Scaffolding tran_prev, temp, pointers 43 43
Total (per cell)   87 44

Over all 1600 cells:

  • DP loads ≈ 44 × 1600 = 70,400
  • scaffolding loads ≈ 43 × 1600 = 68,800
  • DP stores ≈ 1 × 1600 = 1,600
  • scaffolding stores ≈ 43 × 1600 = 68,800

Adding the initial column, per-column normalization, and final-column checksum (all of which touch only DP arrays) gives the earlier phase totals:

  • DP loads ≈ 73,945, scaffolding loads ≈ 68,800 → total ≈ 142,745
  • DP stores ≈ 3,328, scaffolding stores ≈ 68,800 → total ≈ 72,128

So the hand-count says that most loads are split roughly 50/50 between DP state and scaffolding, while almost all stores go into scaffolding arrays (mainly tran_prev, temp, and pointers). This matches the intuition that the DP state itself is relatively compact, and a large chunk of the memory traffic is due to how we implement the trellis mechanics in C rather than the bare Viterbi recurrence.

  • Notes:

    • The only intentional difference between the two builds is how log_em() computes dist * dist:
      • RV32I: GCC lowers the multiply to a call to __mulsi3, which is implemented here as a shift–add loop in C.
      • RV32IM: GCC emits a single mul instruction per call (plus surrounding code).
    • The delta between the two runs (now measured with -sw_threads=1) is:
      • Δinst ≈ 1.58×10^5 fewer instructions when using hardware mul
      • Δalu ≈ 1.23×10^5 fewer ALU-category ops
      • Δbranches ≈ 2.34×10^4 fewer branches
        Loads are identical across both builds, which is a useful sanity check that the DP structure and memory access pattern are unchanged. Stores differ slightly because of small scheduling/code-gen changes. The explicit add/mul counters show:
      • RV32I: add=215,546, mul=0 (all multiplies implemented in software via __mulsi3)
      • RV32IM: add=210,009, mul=1,664 (each trellis multiply maps to a single mul instruction)
    • This pair of runs serves as a small but realistic data point for “software vs hardware multiply” in an HMM-style basecalling kernel. Future experiments can scale M/N or modify the transition structure while reusing the same measurement setup.

    • Back-of-the-envelope per-multiply cost:

      • Each trellis cell does one emission multiply: dist * dist.
      • There are N - 1 = 25 non-initial events, and M = 64 states, so:
        • ~25 × 64 = 1600 multiplies in the main loop.
        • Including a few extra multiplies (initial column, checksum, etc.), total multiplies ≈ 1664.
      • Using the measured deltas between RV32I (software mul) and RV32IM (hardware mul) runs:

        • Δinst ≈ 157,729 / 1664 ≈ 95 instructions per multiply
        • Δalu ≈ 122,612 / 1664 ≈ 74 ALU ops per multiply
        • Δbranches ≈ 23,352 / 1664 ≈ 14 branches per multiply

        • The measured mul=1664 in the RV32IM run matches the analytical estimate for trellis multiplies (one emission multiply per state/event, plus a small number of extras for initialization/checksum), which reinforces that this kernel is behaving as intended.
      • This is very much a back-of-the-envelope estimate:
        • The code evolved slightly between runs.
        • GCC may schedule or inline things differently across builds.
      • Still, it gives a good ballpark intuition: each software multiply was costing on the order of a few hundred instructions; with the M extension, each multiply collapses to a single mul plus minimal surrounding overhead.