Systems Engineering Linux Kernel · x86 · Arm · Accelerators

First Silicon Is Not the Finish Line

From power-on to production ML — how new boards and accelerators become real Linux platforms: firmware, ACPI, Device Tree, drivers, DMA, memory management, compiler lowering, graph capture, collectives, and debugging.

ACPI Device Tree Platform Drivers IOMMU / DMA Graph Capture Collectives Profiling
TL;DR

Bring-up is a stack of contracts. Every layer — firmware, kernel, runtime, compiler — must make and keep a promise to the layer above it. Until all contracts are sound, the silicon is not a platform.

Firmware → kernel handoff DMA correctness Runtime orchestration Compiler lowering Collective comms
← All writings

The uncomfortable truth: a chip is not a platform

A new board can tape out successfully, power on, and still be months away from being useful to anyone. Silicon readiness and software readiness are different milestones — and the gap between them is where teams get hurt.

"First silicon" is usually the beginning of platform bring-up, not the end of development. The board still needs firmware, boot flow, clock and reset sequencing, interrupt routing, memory maps, IOMMU setup, device enumeration, kernel drivers, runtime libraries, compiler support, graph capture, collective communication, profiling hooks, and failure diagnostics.

The silicon may be ready, but the software stack decides whether the silicon is usable, debuggable, schedulable, and fast.

That is why "new chip = 2× faster" is rarely true on day one. Performance arrives only when the kernel and runtime can keep the device fed without drowning it in synchronization, page faults, inefficient memory movement, or poorly shaped execution graphs.

The board bring-up mental model

Adding Linux support for a new board is not one patch. It is a stack of contracts. Firmware describes the hardware. The kernel enumerates it. Drivers bind to devices. The memory subsystem makes buffers visible to the right agents. Runtime software converts user intent into kernel-visible operations. Debug tools make the inevitable failures observable.

LAYER 4 Applications / Frameworks — PyTorch, TensorFlow, JAX, inference servers LAYER 3 Compiler + Runtime — lowering, graph capture, kernels, collective comms LAYER 2 Linux Kernel — drivers, DMA, memory allocator, scheduler, tracing LAYER 1a x86 — UEFI + ACPI + PCIe LAYER 1b Arm — U-Boot / UEFI + DT / ACPI LAYER 0 Silicon + Board — SoC, accelerator, DRAM / HBM, PCIe / CXL, NIC, PMIC
Practical rule: if a board needs custom clocks, reset lines, GPIOs, regulators, mailbox firmware, IOMMU windows, reserved memory, or non-standard interrupt routing, the kernel cannot "just discover it." It needs a correct hardware description and drivers that interpret that description correctly.

x86 and Arm solve discovery differently

The first architectural split is how Linux learns what hardware exists. On conventional x86 systems, enumeration is driven by UEFI, ACPI, PCI/PCIe discovery, SMBIOS, and chipset conventions. On Arm boards — especially embedded and SoC-style systems — Device Tree is still the common way to describe non-discoverable hardware; Arm server systems may use UEFI and ACPI.

Areax86 board pathArm board pathWhat can go wrong
Boot firmware UEFI/BIOS initializes platform state and publishes ACPI tables. Boot ROM → SPL/TF-A → U-Boot or UEFI; passes Device Tree blob or ACPI tables. Wrong memory map, missing reserved regions, broken PSCI/SMC interface, invalid ACPI/DT nodes.
Hardware description ACPI namespace, PCI config space, MADT/IORT/DMAR-style tables. Device Tree nodes for SoC blocks, board wiring, clocks, resets, regulators; ACPI for server-class Arm. Driver never probes, resources are wrong, interrupt line is wrong, DMA mask is wrong.
Device model Mostly discoverable PCIe devices plus platform devices described through ACPI. Many non-discoverable platform devices described by Device Tree. Probe deferral loops, missing regulators/clocks, incompatible strings not matching driver table.
Accelerator integration PCIe card with BARs, MSI-X, DMA/IOMMU, user runtime, firmware loader. PCIe card or integrated SoC accelerator with reserved memory, mailbox, SMMU, power domains. Good microbenchmarks but poor end-to-end performance due to memory movement and runtime stalls.

Arm Device Tree: describe the board, not just the SoC

Device Tree is a data structure passed to the kernel that describes hardware layout: CPUs, memory, buses, interrupt controllers, clocks, GPIOs, regulators, reserved memory, and devices that cannot be discovered automatically. The board file wires what the SoC file defines — it decides whether a node is enabled, which memory region it uses, and how it is physically connected.

arch/arm64/boot/dts/example/example-ai-board-reva.dts
/dts-v1/;
#include "example-ai-soc.dtsi"

/ {
    model = "Example AI Board RevA";
    compatible = "example,ai-board-reva", "example,ai-soc";

    chosen {
        stdout-path = "serial0:115200n8";
        bootargs = "earlycon console=ttyS0,115200";
    };

    memory@80000000 {
        device_type = "memory";
        reg = <0x0 0x80000000 0x0 0x80000000>;   /* 2 GiB DRAM */
    };

    reserved-memory {
        #address-cells = <2>;
        #size-cells = <2>;
        ranges;

        /* 64 MiB reserved for firmware command buffers — no-map keeps
           the kernel from touching this region */
        accel_reserved: accel-buffer@90000000 {
            compatible = "shared-dma-pool";
            reg = <0x0 0x90000000 0x0 0x04000000>;
            reusable;
        };
    };
};

/* Board-level override: enable the accelerator defined in the .dtsi */
&ai_accel0 {
    compatible = "example,ai-accelerator-v1";
    reg = <0x0 0xa0000000 0x0 0x10000>;         /* 64 KiB MMIO window */
    interrupts = <GIC_SPI 83 IRQ_TYPE_LEVEL_HIGH>;
    clocks = <&clk 12>;
    resets = <&rst 4>;
    memory-region = <&accel_reserved>;
    dma-coherent;
    status = "okay";
};

x86 ACPI: firmware publishes platform contracts

On x86, board support often looks invisible because the platform description lives in firmware tables — but the work still exists. The ACPI namespace must describe devices, resources, power states, interrupts, and device-specific properties. The kernel driver matches a Hardware ID, then reads everything else from ACPI.

Illustrative DSDT/SSDT ASL fragment — memory-mapped accelerator
Device (XAI0)
{
    Name (_HID, "EXAI0001")          /* Hardware ID matched by Linux driver   */
    Name (_UID, 0)
    Name (_CCA, One)                 /* Cache-coherent DMA (platform-specific) */

    Method (_CRS, 0, NotSerialized)
    {
        Name (RBUF, ResourceTemplate () {
            Memory32Fixed (ReadWrite, 0xA0000000, 0x00010000)   /* 64 KiB BAR */
            Interrupt (ResourceConsumer, Level, ActiveHigh, Exclusive) { 83 }
        })
        Return (RBUF)
    }

    /* _DSD: device-specific data — treat this like a stable firmware ABI */
    Name (_DSD, Package () {
        ToUUID ("daffd814-6eba-4d8c-8a91-bc9bbf4aa301"),
        Package () {
            Package () { "firmware-name",    "example/xai0.bin" },
            Package () { "num-queues",       8                  },
            Package () { "dma-address-bits", 48                 }
        }
    })
}
Key difference: x86 does not mean "no board support." It often means the board support is split between firmware, ACPI tables, PCIe enumeration, vendor drivers, and runtime libraries rather than a visible board-specific .dts file. The work is the same; the file is different.

What actually gets added to the Linux kernel?

For a new board, kernel work falls into seven buckets: early boot support, hardware description, platform drivers, bus drivers, memory and DMA, power management, and observability.

1. Boot and firmware handoff

Confirm the kernel receives a correct memory map, CPU topology, interrupt controller state, boot arguments, and firmware interfaces.

2. Hardware description

Add Device Tree bindings and DTS files, or ACPI IDs and properties for devices that firmware exposes.

3. Driver binding

Make sure drivers match compatible strings, ACPI IDs, or PCI IDs, then request resources and initialize hardware safely.

4. Interrupts

Wire GIC / APIC / MSI / MSI-X correctly and verify affinity, storm behavior, and completion latency.

5. DMA and IOMMU

Set DMA masks, map buffers, handle cache coherency, and validate scatter-gather behavior under pressure.

6. Power and reset

Handle regulators, clocks, resets, runtime PM, suspend/resume, thermal throttling, and device-level recovery.

7. Userspace ABI

Expose stable interfaces through char devices, sysfs, debugfs, ioctl, mmap, dma-buf, or existing subsystem APIs.

8. Tooling

Add tracepoints, counters, error registers, firmware logs, and test hooks before customers need them.

Minimal kernel patch shape

AI and compute accelerator drivers live under drivers/accel/ — a dedicated kernel subsystem introduced specifically to separate AI/compute hardware from legacy DRM/graphics drivers. Using accel/ rather than misc/ or char/ signals to reviewers that your driver participates in the standard compute-accelerator ABI and DRM infrastructure (memory management, file operations, context isolation) without the full display stack overhead. A realistic upstreamable board-support series for an Arm SoC accelerator might look like:

[PATCH 0/8] Add support for Example AI Board RevA
[PATCH 1/8] dt-bindings: vendor: document Example AI accelerator
[PATCH 2/8] arm64: dts: example: add AI Board RevA base device tree
[PATCH 3/8] clk: example: add accelerator clock gate
[PATCH 4/8] reset: example: add accelerator reset line
[PATCH 5/8] mailbox: example: add firmware mailbox transport
[PATCH 6/8] dmaengine: example: add DMA engine support
[PATCH 7/8] accel/example: add initial accelerator driver
[PATCH 8/8] MAINTAINERS: add Example AI Board entries

For an x86 PCIe accelerator, the series is more driver-centric — board description lives in firmware, so the patch set focuses on the driver subsystem:

[PATCH 0/6] Add Example PCIe accelerator driver
[PATCH 1/6] accel/example: add PCI device IDs and BAR mapping
[PATCH 2/6] accel/example: add firmware load and device reset
[PATCH 3/6] accel/example: add DMA queue submission path
[PATCH 4/6] accel/example: add mmap and user command ABI
[PATCH 5/6] accel/example: add tracepoints and debugfs counters
[PATCH 6/6] MAINTAINERS: add Example accelerator driver

Modern accelerators need more than a driver

A driver can make a device visible. It does not automatically make the device fast. A modern AI accelerator needs an entire software stack that converts high-level model execution into low-latency, memory-efficient device work.

MODEL GRAPH ops, tensors, shapes COMPILER lowering, fusion, tiling RUNTIME graph capture, queues DRIVER DMA, IRQ, mmap Hardware Execution Fabric command queues DMA engines · SRAM/HBM/DRAM · collectives · NIC/fabric · firmware scheduler

The accelerator platform checklist

LayerWhat it doesFailure mode if immature
Kernel driverBAR mapping, queues, interrupts, reset, firmware load, mmap, DMA mapping, security checks.Device exists but crashes, stalls, leaks memory, or requires fragile userspace workarounds.
SchedulerSubmits work, prioritizes contexts, handles preemption or fairness, isolates tenants.One model monopolizes the device; latency-sensitive inference waits behind batch jobs.
Memory allocatorManages device memory, host pinned memory, huge pages, dma-buf, IOVA space, and fragmentation.Page faults and allocator churn dominate runtime; large graphs fail after hours of use.
DMA / runtime orchestrationMoves weights, activations, KV cache, gradients, and command descriptors with overlap.Compute units wait for memory; PCIe/CXL/HBM bandwidth is wasted.
Compiler loweringMaps high-level ops to kernels, tiles, vector modes, SRAM reuse, and memory schedules.The hardware peak is unreachable because generated kernels do not match the architecture.
Graph captureTurns repeated dynamic execution into reusable static command graphs.CPU submission overhead and synchronization points eat latency at every inference call.
Collective librariesEnable multi-device all-reduce, all-gather, reduce-scatter, broadcast, and topology-aware routing.Scale-out performance collapses even if single-chip performance looks good.
Profiling / debug toolsExpose queue depth, DMA stalls, page faults, IRQ latency, kernel occupancy, fabric congestion.Teams guess instead of measuring; bring-up becomes folklore passed between engineers.
Performance killer: a board can pass smoke tests and still spend half its life stalled behind sync points, page faults, cache maintenance, IOMMU misses, CPU submission overhead, or inefficient memory placement. The checklist exists to close each of those gaps deliberately.

Core code patterns

Note: the examples below are intentionally small and illustrative. Real upstream code needs error unwinding, locking, runtime PM, security review, stable userspace ABI design, and subsystem maintainer feedback.

0. Device Tree binding YAML: the contract before the DTS

For Arm boards, the DTS should not be a random hardware dump. The binding document is the formal contract: which properties are required, which clocks/resets/interrupts exist, whether DMA is coherent, and what the compatible string promises.

Documentation/devicetree/bindings/accel/example,ai-accelerator.yaml
$schema: http://devicetree.org/meta-schemas/core.yaml#
title: Example AI Accelerator

maintainers:
  - Example Maintainer <maintainer@example.com>

properties:
  compatible:
    enum:
      - example,ai-accelerator-v1

  reg:
    maxItems: 1                    # Single MMIO window

  interrupts:
    maxItems: 1

  clocks:
    maxItems: 1

  resets:
    maxItems: 1

  memory-region:
    description: Reserved firmware or command-buffer region.

  dma-coherent: true

required:
  - compatible
  - reg
  - interrupts
  - clocks
  - resets

additionalProperties: false

examples:
  - |
    accelerator@a0000000 {
        compatible = "example,ai-accelerator-v1";
        reg = <0x0 0xa0000000 0x0 0x10000>;
        interrupts = <0 83 4>;
        clocks = <&clk 12>;
        resets = <&rst 4>;
        dma-coherent;
    };

1. One driver, two discovery paths: OF + ACPI

A well-structured platform driver shares most code between Arm Device Tree and x86/Arm-server ACPI. Keep discovery-specific parsing thin; keep hardware implementation common. device_property_read_string works transparently for both _DSD and DT properties.

drivers/accel/example/example-accel.c — match tables
static const struct of_device_id example_accel_of_match[] = {
    { .compatible = "example,ai-accelerator-v1" },
    { }
};
MODULE_DEVICE_TABLE(of, example_accel_of_match);

static const struct acpi_device_id example_accel_acpi_match[] = {
    { "EXAI0001", 0 },
    { }
};
MODULE_DEVICE_TABLE(acpi, example_accel_acpi_match);

/* device_property_read_string works for both DT and ACPI _DSD */
static int example_read_fw_name(struct device *dev, const char **fw_name)
{
    int ret = device_property_read_string(dev, "firmware-name", fw_name);
    if (ret)
        *fw_name = "example/xai0.bin";   /* safe default */
    return 0;
}

static struct platform_driver example_accel_driver = {
    .probe      = example_accel_probe,
    .remove_new = example_accel_remove,
    .driver = {
        .name             = "example-ai-accelerator",
        .of_match_table   = example_accel_of_match,
        .acpi_match_table = ACPI_PTR(example_accel_acpi_match),
    },
};
module_platform_driver(example_accel_driver);

2. Platform driver probe: resources, IRQ, DMA mask

The probe function is where the driver claims ownership of hardware. Get memory, IRQ, and DMA mask in that order. Use devm_* helpers so cleanup on error is automatic.

Platform driver probe — SoC / Device Tree path
static int example_accel_probe(struct platform_device *pdev)
{
    struct device *dev = &pdev->dev;
    struct example_accel *accel;
    int irq, ret;

    accel = devm_kzalloc(dev, sizeof(*accel), GFP_KERNEL);
    if (!accel)
        return -ENOMEM;

    /* Map MMIO BAR — devm frees it on probe failure or driver unbind */
    accel->regs = devm_platform_ioremap_resource(pdev, 0);
    if (IS_ERR(accel->regs))
        return PTR_ERR(accel->regs);

    irq = platform_get_irq(pdev, 0);
    if (irq < 0)
        return irq;

    ret = devm_request_irq(dev, irq, example_accel_irq, 0,
                           dev_name(dev), accel);
    if (ret)
        return ret;

    /* Set DMA addressing limit — call before any dma_alloc_* */
    ret = dma_set_mask_and_coherent(dev, DMA_BIT_MASK(48));
    if (ret)
        return dev_err_probe(dev, ret, "failed to set DMA mask\n");

    platform_set_drvdata(pdev, accel);
    return example_accel_hw_init(accel);
}

3. PCIe accelerator skeleton

PCI probe — x86 discrete accelerator card
static int example_pci_probe(struct pci_dev *pdev,
                             const struct pci_device_id *id)
{
    struct example_accel *accel;
    int ret;

    ret = pcim_enable_device(pdev);
    if (ret)
        return ret;

    /* Map BAR 0 and register it with devres — auto-unmapped on remove */
    ret = pcim_iomap_regions(pdev, BIT(0), "example-accel");
    if (ret)
        return ret;

    pci_set_master(pdev);

    ret = dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(48));
    if (ret)
        return ret;

    /* Request up to 32 MSI-X vectors; fall back to MSI */
    ret = pci_alloc_irq_vectors(pdev, 1, 32, PCI_IRQ_MSIX | PCI_IRQ_MSI);
    if (ret < 0)
        return ret;

    accel = devm_kzalloc(&pdev->dev, sizeof(*accel), GFP_KERNEL);
    if (!accel)
        return -ENOMEM;

    accel->bar0 = pcim_iomap_table(pdev)[0];
    pci_set_drvdata(pdev, accel);

    return example_accel_init_queues(accel);
}

static const struct pci_device_id example_pci_ids[] = {
    { PCI_DEVICE(0x1abc, 0x0001) },
    { }
};
MODULE_DEVICE_TABLE(pci, example_pci_ids);

4. DMA mapping: the easy place to create hard bugs

DMA correctness is one of the most important bring-up boundaries. The driver must use the kernel DMA API, respect device addressing limits, handle cache-coherency assumptions, and unmap every buffer it maps.

Coherent allocation + streaming map pattern
dma_addr_t dma;
void *cpu_addr;

/* Coherent command ring: CPU and device both see writes immediately.
   Use this for descriptor rings and doorbells — never for bulk data. */
cpu_addr = dma_alloc_coherent(dev, ring_size, &dma, GFP_KERNEL);
if (!cpu_addr)
    return -ENOMEM;

writel(lower_32_bits(dma), accel->regs + RING_BASE_LO);
writel(upper_32_bits(dma), accel->regs + RING_BASE_HI);

/* Streaming map: used for one-shot transfers.
   Direction must match — DMA_TO_DEVICE skips cache invalidation on return. */
dma_addr_t src = dma_map_single(dev, user_buf, len, DMA_TO_DEVICE);
if (dma_mapping_error(dev, src))
    return -EIO;

example_program_dma(accel, src, len);

/* Must unmap before the CPU touches the buffer again */
dma_unmap_single(dev, src, len, DMA_TO_DEVICE);

5. Runtime PM, clocks, reset, firmware load

Real boards need reset sequencing, clock enablement, firmware loading, and runtime power management. Many first-silicon failures hide here because the register map looks correct but the hardware is still held in reset.

Hardware init — clock, reset, firmware, runtime PM
static int example_accel_hw_init(struct example_accel *accel)
{
    struct device *dev = accel->dev;
    const struct firmware *fw;
    const char *fw_name;
    int ret;

    example_read_fw_name(dev, &fw_name);

    accel->clk = devm_clk_get(dev, NULL);
    if (IS_ERR(accel->clk))
        return dev_err_probe(dev, PTR_ERR(accel->clk), "missing clock\n");

    accel->rst = devm_reset_control_get_optional_exclusive(dev, NULL);
    if (IS_ERR(accel->rst))
        return dev_err_probe(dev, PTR_ERR(accel->rst), "missing reset\n");

    ret = clk_prepare_enable(accel->clk);
    if (ret)
        return ret;

    reset_control_deassert(accel->rst);  /* hardware is now out of reset */

    ret = request_firmware(&fw, fw_name, dev);
    if (ret) {
        dev_err(dev, "firmware %s not available: %d\n", fw_name, ret);
        goto err_power;
    }

    ret = example_upload_firmware(accel, fw->data, fw->size);
    release_firmware(fw);
    if (ret)
        goto err_power;

    pm_runtime_set_active(dev);
    pm_runtime_enable(dev);
    return 0;

err_power:
    reset_control_assert(accel->rst);
    clk_disable_unprepare(accel->clk);
    return ret;
}

6. Char device ABI: ioctl for submit, mmap for rings

Many accelerators expose a char device for command submission and memory mapping. Validate all user pointers, lengths, queue IDs, flags, and device state — this is where security matters most.

ioctl + mmap userspace interface
#define EXAMPLE_IOC_MAGIC      'x'
#define EXAMPLE_IOC_SUBMIT     _IOW(EXAMPLE_IOC_MAGIC, 1, struct example_submit)

struct example_submit {
    __u64 user_ptr;
    __u32 len;
    __u32 queue_id;
    __u64 flags;
};

static long example_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
{
    struct example_file *ctx = filp->private_data;
    struct example_submit submit;

    if (cmd != EXAMPLE_IOC_SUBMIT)
        return -ENOTTY;

    if (copy_from_user(&submit, (void __user *)arg, sizeof(submit)))
        return -EFAULT;

    /* Validate all fields — never trust userspace */
    if (submit.len == 0 || submit.len > EXAMPLE_MAX_SUBMIT_SIZE)
        return -EINVAL;

    if (submit.queue_id >= ctx->accel->num_queues)
        return -EINVAL;

    /* submit.user_ptr is intentionally not dereferenced here.
     * example_submit_user_work() is responsible for safe access:
     * either copy_from_user() for small descriptors, or
     * pin_user_pages_fast() + get_user_pages() for DMA-pinned buffers.
     * Never pass a raw __u64 user pointer directly to the hardware. */
    return example_submit_user_work(ctx, &submit);
}

static int example_mmap(struct file *filp, struct vm_area_struct *vma)
{
    struct example_file *ctx = filp->private_data;
    unsigned long size = vma->vm_end - vma->vm_start;

    if (size > ctx->ring_size)
        return -EINVAL;

    /* Map coherent ring into userspace — DMA-safe, no cache issues */
    return dma_mmap_coherent(ctx->accel->dev, vma,
                             ctx->ring_cpu, ctx->ring_dma, size);
}

7. Tracepoints and debugfs: make stalls visible early

A new accelerator should ship with measurement hooks from day one. You want to know whether latency is in userspace submission, kernel queueing, DMA programming, interrupt completion, wakeup, or graph replay — before customers ask.

include/trace/events/example_accel.h
#undef TRACE_SYSTEM
#define TRACE_SYSTEM example_accel

#if !defined(_TRACE_EXAMPLE_ACCEL_H) || defined(TRACE_HEADER_MULTI_READ)
#define _TRACE_EXAMPLE_ACCEL_H

#include <linux/tracepoint.h>

TRACE_EVENT(example_submit,
    TP_PROTO(u32 queue_id, u32 bytes),
    TP_ARGS(queue_id, bytes),
    TP_STRUCT__entry(
        __field(u32, queue_id)
        __field(u32, bytes)
    ),
    TP_fast_assign(
        __entry->queue_id = queue_id;
        __entry->bytes    = bytes;
    ),
    TP_printk("queue=%u bytes=%u", __entry->queue_id, __entry->bytes)
);

#endif
#include <trace/define_trace.h>
debugfs counters — expose in probe, read from userspace
static int example_debugfs_init(struct example_accel *accel)
{
    accel->debugfs_dir = debugfs_create_dir("example_accel", NULL);

    debugfs_create_u64("submitted_jobs", 0444,
                       accel->debugfs_dir, &accel->submitted_jobs);
    debugfs_create_u64("dma_faults",     0444,
                       accel->debugfs_dir, &accel->dma_faults);
    debugfs_create_u32("queue_depth",    0444,
                       accel->debugfs_dir, &accel->queue_depth);
    return 0;
}

8. Kconfig and Makefile

drivers/accel/example/Kconfig + Makefile
config EXAMPLE_AI_ACCEL
    tristate "Example AI accelerator support"
    depends on PCI || OF
    select DMA_SHARED_BUFFER
    help
      Enables support for the Example AI accelerator device.
      Works on both x86 PCIe and Arm SoC platforms.

obj-$(CONFIG_EXAMPLE_AI_ACCEL) += example-ai-accel.o

Bring-up debugging: measure before guessing

A board team needs observability from day one. Without instrumentation, every failure becomes "firmware maybe," "kernel maybe," "driver maybe," "runtime maybe," or "compiler maybe." Good platform support includes counters and trace hooks early — not after the first customer escalation.

Kernel-level signals

  • dmesg probe and firmware logs
  • lspci -vv, setpci, ACPI table dumps
  • /proc/interrupts, IRQ affinity, MSI-X vector use
  • debugfs queue depth and error counters
  • ftrace, tracepoints, perf sched/irq analysis

Runtime-level signals

  • Kernel launch latency and graph replay latency
  • DMA bytes, DMA overlap, and copy engine utilization
  • Page fault count and pinned-memory failures
  • HBM/DRAM bandwidth and fragmentation
  • Collective latency, topology, and retry behavior
Shell — IRQ visibility, ftrace, PCIe state
# Show IRQ lines claimed by the driver
cat /proc/interrupts | grep example

# List available tracepoints
cat /sys/kernel/debug/tracing/available_events | grep example

# Function-graph trace around the probe path
echo function_graph > /sys/kernel/debug/tracing/current_tracer
echo example_accel_probe > /sys/kernel/debug/tracing/set_graph_function
cat /sys/kernel/debug/tracing/trace

# PCIe device state and capability dump
lspci -nnvv -d 1abc:
dmesg -T | grep -i "example\|iommu\|dma\|msi"
Bring-up maturity test: when a workload is slow, the team should be able to answer within minutes — is the stall in compiler lowering, graph capture, CPU submission, scheduler queueing, DMA, page faults, IRQ completion, collective communication, or actual compute? If you cannot answer that question, you do not yet have enough instrumentation.

When it doesn't boot: a first-responder checklist

Bring-up lives and dies on the first few minutes after power-on. Most early failures fall into a predictable set of causes. Work through these in order before diving into driver code.

Go-to commands when the board won't cooperate
1. Get a serial console first. Add earlycon console=ttyS0,115200 (or the Arm equivalent earlycon=pl011,...) to kernel bootargs. Without console output, everything else is guessing. If nothing prints, suspect the UART clock not being enabled by firmware.
2. Enable initcall debug. Append initcall_debug to bootargs. Every initcall entry and exit is logged with timestamps — a hung call means one subsystem is stuck waiting for something (a clock, an IRQ, a firmware response). The last completed call before silence tells you exactly where.
3. Check driver probe. Run dmesg | grep -E "example|probe|IOMMU|dma|deferred". Deferred probe is common when clocks, regulators, or reset controllers haven't registered yet — they will register later but the accelerator driver gave up. Fix: ensure DT dependencies are correct and clock/reset providers probe before the consumer.
4. Trace DMA and IOMMU faults. Enable trace_event=dma_fence,iommu and watch for IOMMU faults in dmesg. An IOMMU fault almost always means the IOVA range wasn't mapped before the device started a transfer — check dma_set_mask_and_coherent is called before any dma_alloc_* and that the DMA mask matches the hardware capability.
5. Check interrupt routing. After probe, run cat /proc/interrupts and confirm the driver's IRQ line shows a non-zero count after triggering any device operation. Zero count means either the IRQ is misconfigured in DT/ACPI, the MSI/MSI-X allocation failed, or the device firmware never fires the interrupt. Use perf sched to confirm the interrupt reaches the expected CPU.
6. Validate firmware load. Use dmesg | grep firmware to confirm the firmware file was found and accepted. If firmware upload silently fails, the device MMIO registers will return wrong values or bus errors. Add a sentinel register read after firmware boot and compare against a known-good value documented in the hardware spec.
Common pattern: a driver that probes successfully but produces no output from the device usually means the firmware uploaded correctly but a command ring or doorbell register is wrong. Read back the ring-base address registers after writing them. Compare to what you wrote. If they differ, you have a BAR offset bug or an endianness issue.

A practical board-support checklist

  • Boot: kernel reaches userspace reliably on serial console; panic logs are recoverable.
  • Hardware description: DT bindings or ACPI properties are reviewed and versioned; no board-specific hacks hidden in drivers.
  • Memory map: DRAM, reserved memory, firmware regions, MMIO ranges, and IOMMU windows are correct and tested.
  • Interrupts: IRQ routing, affinity, MSI/MSI-X, and interrupt storm handling are verified.
  • DMA: coherent and streaming mappings are correct; DMA mask and cache-coherency assumptions are tested.
  • Power: clocks, resets, regulators, runtime PM, suspend/resume, and thermal throttling are tested.
  • Driver ABI: userspace interface is stable, documented, permissioned, and fuzz-tested.
  • Runtime: memory allocator, queue submission, graph capture, and synchronization model are integrated.
  • Compiler: lowering path produces kernels that match the memory hierarchy and execution fabric.
  • Collectives: multi-device communication works before marketing claims scale-out numbers.
  • Profiling: tracepoints, counters, error paths, and debug tools exist before customer deployment.
  • Upstream path: patches follow subsystem boundaries, coding style, binding conventions, maintainer feedback, and test evidence.

First silicon proves the chip can exist.
Board bring-up proves the platform can survive real software.

💬
What part of bring-up bit you hardest? DMA bugs? Deferred probe loops? A firmware handoff that silently corrupted memory? Share in the comments — the war stories are where the real knowledge lives.

References and further reading

  • Linux kernel documentation: Linux and the Devicetree — hardware description model for non-discoverable devices.
  • Linux kernel documentation: ACPI on Arm systems — expectations when Arm systems use ACPI rather than Device Tree.
  • Linux kernel documentation: Platform Devices and Drivers — platform bus and driver binding model.
  • Linux kernel documentation: Dynamic DMA mapping guide and generic DMA API documentation.
  • Linux kernel documentation: kbuild and external module build process.
  • Linux kernel documentation: ftrace function tracer for kernel debugging and latency analysis.
  • Linux kernel process documentation: submitting patches and patch submission checklist.
  • Linux kernel documentation: Writing ACPI AML for device-specific data (_DSD usage).
  • Devicetree specification: schema validation with dt-schema and yamllint.