

#### Instrumenting and Debugging FireSim-Simulated Designs

https://fires.im

**MICRO 2024 Tutorial** 

Speaker: Abe Gonzalez





#### Agenda

- FPGA-Accelerated Deep-Simulation Debugging
  - Debugging Using Integrated Logic Analyzers
  - Trace-based Debugging
  - Synthesizable Assertions/Prints
  - Synthesizable Counters
    - Hands-on example
- Debugging Co-Simulation
  - FireSim Debugging Using Software Simulation







#### When SW RTL Simulation is Not Enough...

"Everything looks OK in SW simulation, but there is still a bug somewhere"

"My bug only appears after hours of running Linux on my simulated HW"







#### FPGA-Based Debugging Features

- High simulation speed in FPGA-based simulation enables advanced debugging and profiling tools.
- Reach "deep" in simulation time, and obtain large levels of coverage and data
- Examples:
  - ILAs
  - TracerV
  - AutoCounter
  - Synthesizable assertions, prints



SW Simulation





# Debugging Using Integrated Logic Analyzers

Integrated Logic Analyzers (ILAs)

- Common debugging feature provided by FPGA vendors
- Continuous recording of a sampling window
  - Up to 1024 cycles by default.
  - Stores recorded samples in BRAM.
- Realtime trigger-based sampled output of probed signals
  - Multiple probes ports can be combined to a single trigger
  - Trigger can be in any location within the sampling window
- On the AWS F1-Instances, ILA interfaced through a debug-bridge and server

// Integrated Logic Analyzers (ILA) ila 0 CL ILA 0 ( .clk (clk\_main\_a0), .probe0 (sh\_ocl\_awvalid\_q), .probe1 (sh\_ocl\_awaddr\_q ), .probe2 (ocl\_sh\_awready\_q), .probe3 (sh ocl arvalid q), .probe4 (sh\_ocl\_araddr\_q ), .probe5 (ocl\_sh\_arready\_q) ); // Debug Bridge cl\_debug\_bridge CL\_DEBUG\_BRIDGE ( .clk(clk main a0), .S\_BSCAN\_drck(drck), .S\_BSCAN\_shift(shift), .S BSCAN tdi(tdi), .S BSCAN update(update), .S\_BSCAN\_sel(sel), .S\_BSCAN\_tdo(tdo), .S BSCAN tms(tms), .S\_BSCAN\_tck(tck), .S\_BSCAN\_runtest(runtest), .S\_BSCAN\_reset(reset), .S BSCAN capture(capture), .S\_BSCAN\_bscanid\_en(bscanid\_en) );

From: aws-fpga cl hello world example





# Debugging Using Integrated Logic Analyzers

AutoILA – Automation of ILA integration with FireSim

- Annotate requested signals and bundles in the Chisel source code
- Automatic configuration and generation of the ILA IP in the FPGA toolchain
- Automatic expansion and wiring of annotated signals to the top level of a design using a FIRRTL transform.
- Remote waveform and trigger setup from the manager instance





#### BOOM Example

- Debugging an out-of-order processor is hard
  - Throughout this talk, we'll have examples of FPGA debugging used in BOOM.
- Example from boom/src/main/scala/lsu/dcache.scala
- Debugging a non-blocking data cache hanging after Linux boots





# Debugging using Integrated Logic Analyzers

#### Pros:

- No emulated parts what you see is what's running on the FPGA
- FPGA simulation speed O(MHz) compared to O(KHz) in software simulation
- Real-time trigger-based

#### Cons:

- Requires a full build to modify visible signals/triggers (takes several hours)
- Limited sampling window size
- Consumes FPGA resources



#### TracerV

- Out-of-band full instruction execution trace
- Bridge connected to target trace ports
- By default, large amount of info wired out of Rocket/BOOM, per-hart, per-cycle:
  - Instruction Address
  - Instruction
  - Privilege Level
  - Exception/Interrupt Status, Cause
- TracerV can rapidly generate several TB of data.





#### TracerV

- Out-of-Band: profiling does not perturb execution
- Useful for kernel and hypervisor level cyclesensitive profiling
- Examples:
  - Co-Optimization of NIC and Network Driver
  - Keystone Secure Enclave Project
  - High-performance hardware-specific code (supercomputing?)
- Requires large-scale analytics for insightful profiling and optimization.







#### Trigger Mechanisms

- Full trace files can be very large (100s GB TB)
- We are usually interested only in a specific region of execution
- TracerV can be enabled based on in-band and out-of-band triggers
  - Program counter
  - Unique instruction
  - Cycle count
- Can use the same trigger for some other simulation outputs
  - AutoCounter perf. counters

config\_runtime.yaml

| tracing:                     |  |  |  |  |  |  |  |  |
|------------------------------|--|--|--|--|--|--|--|--|
| enable: no                   |  |  |  |  |  |  |  |  |
| #0 = no trigger              |  |  |  |  |  |  |  |  |
| #1 = cycle count trigger     |  |  |  |  |  |  |  |  |
| #2 = program counter trigger |  |  |  |  |  |  |  |  |
| #3 = instruction trigger     |  |  |  |  |  |  |  |  |
| selector: 1                  |  |  |  |  |  |  |  |  |
| startcycle: 0                |  |  |  |  |  |  |  |  |
| endcycle: -1                 |  |  |  |  |  |  |  |  |



# 0

#### Integration with Flame Graphs

- Flame Graph Open-source profiling visualization tool
- Direct integration with TracerV traces
  - Automated stack unwinding (kernel space)
  - Automated Flame-graph generation





#### TracerV

#### Pros:

- Out-of-Band (no impact on workload execution)
- SW-centric method
- Large amounts of data

#### Cons:

- Slower simulation performance (40 MHz)
- No HW visibility
- Large amounts of data

#### AutoCounter

- Automated out-of-band counter insertion
- Based on ad-hoc annotations and existing cover-points
  - No invasive RTL change
- Runtime-configurate read rate

| 253 | io.send.req.ready := state === s_idle                                                                                  | C        |
|-----|------------------------------------------------------------------------------------------------------------------------|----------|
| 254 | io.alloc.valid := helper.fire(io.alloc.ready, canSend)                                                                 | sen      |
| 255 | io.alloc.bits.id := xactId                                                                                             | send     |
| 256 | io.alloc.bits.count := (1.U << (reqSize - byteAddrBits.U))                                                             | rec      |
| 257 | <pre>tl.a.valid := helper.fire(tl.a.ready, canSend)</pre>                                                              |          |
| 258 | <pre>tl.a.bits := edge.Get(</pre>                                                                                      | recv     |
| 259 | <pre>fromSource = xactId,</pre>                                                                                        |          |
| 260 | toAddress = sendaddr,                                                                                                  |          |
| 261 | lgSize = reqSize)2                                                                                                     |          |
| 262 |                                                                                                                        |          |
| 263 | cover((state === s_read) && xactBusy.andR && tl.a.ready, "NIC_SEND_XACT_ALL_BUSY", "nic send blocked by lack of transa | ctions") |
| 264 | cover((state === s_read) && !io.alloc.ready && tl.a.ready, "NIC_SEND_BUF_FULL", "nic send blocked by full buffer")     |          |
| 265 | cover(tl.a.valid && !tl.a.ready , "NIC_SEND_MEM_BUSY", "nic send blocked by memory bandwidth")                         |          |
|     |                                                                                                                        |          |



#### AutoCounter Example

• Example ad-hoc performance counters in the L2 cache

```
class SinkA(params: InclusiveCacheParameters) extends Module
{
    val io = new Bundle {
        val req = Decoupled(new FullRequest(params))
        val a = Decoupled(new TLBundleA(params.inner.bundle)).flip
        val pb_pop = Decoupled(new PutBufferPop(params)).flip
        val pb_beat = new PutBufferAEntry(params)
    }
    PerfCounter(io.a.fire(), "12_requests", "Number of requests to the first bank of the L2");
```

- Simple configuration (config\_runtime.yaml)
  - Readrate Trade-off visibility/detail and performance
  - TracerV trigger Collect results from singular point of interest

autocounter:

read rate: 1000000





#### AutoCounter Output CSV Schema

| Version              | Version Number    |                  |                  |         |   |
|----------------------|-------------------|------------------|------------------|---------|---|
| Clock Domain<br>Name | Domain Name       | Multiplier       | x                | Divisor | Y |
| Labels               | local_clock       | Label0           | Label1           |         |   |
| Description          | local clock cycle | Desc0            | Desc1            |         |   |
| Event Width          | 1                 | Width0           | Width1           |         |   |
| Accumulator<br>Width | 64                | 64               | 64               |         |   |
| Туре                 | Increment         | Туре0            | Type1            |         |   |
| Ν                    | Cycle @ time N    | Value0 @ time N  | Value1 @ time N  |         |   |
|                      |                   |                  | •••              |         |   |
| kN                   | Cycle @ time kN   | Value0 @ time kN | Value1 @ time kN |         |   |





#### AutoCounter Output CSV Schema

| Version              | Version Number    |                  |                  |         |   |
|----------------------|-------------------|------------------|------------------|---------|---|
| Clock Domain<br>Name | Domain Name       | Multiplier       | x                | Divisor | Y |
| Labels               | local_clock       | Label0           | Label1           |         |   |
| Description          | local clock cycle | Desc0            | Docc1            |         |   |
| Event Width          | 1                 | Width0           | More counters    |         |   |
| Accumulator<br>Width | 64                | 64               |                  |         |   |
| Туре                 | Increment         | Туре0            |                  |         |   |
| Ν                    | Cycle @ time N    | Value0 @         |                  |         |   |
|                      |                   |                  |                  |         |   |
| kN                   | Cycle @ time kN   | Value0 @ time kN | Value1 @ time kN |         |   |





#### AutoCounter Output CSV Schema

| Version              | Version Number    |                  |                  |         |              |
|----------------------|-------------------|------------------|------------------|---------|--------------|
| Clock Domain<br>Name | Domain Name       | Multiplier       | X                | Divisor | Υ            |
| Labels               | local_clock       | Label0           | Label1           |         |              |
| Description          | local clock cycle | Desc0            | Desc1            |         |              |
| Event Width          | 1                 | Width0           | Width1           |         |              |
| Accumulator<br>Width | 64                | 64               | 64               |         | More samples |
| Туре                 | Increment         | Туре0            | Type1            |         |              |
| Ν                    | Cycle @ time N    | Value0 @ time N  | Value1 @ time N  |         |              |
|                      |                   |                  |                  |         |              |
| kN                   | Cycle @ time kN   | Value0 @ time kN | Value1 @ time kN |         |              |

# 0

#### Automated Performance Counters

#### Pros:

- Macro view of execution behavior
- Trigger integration
- Pre-configured cover points, no RTL interference
- SW-controlled granularity (tradeoff simulation for read rate)

#### Cons:

- New counters require new FPGA images
- Simulation performance degradation depending on read rate and number of counters





#### Synthesizable Assertions

- Assertions rapid error checking embedded in HW source code.
  - Commonly used in SW Simulation
  - Halts the simulation upon a triggered assertion. Represented as a "stop" statement in FIRRTL
  - By default, emitted as non-synthesizable SV functions (\$fatal)



From: BROOM: An open-source Out-of-Order processor with resilient low-voltage operation in 28nm CMOS, Christopher Celio, Pi-Feng Chiu, Krste Asanovic, David Patterson and Borivoje Nikolic. HotChip 30, 2018

# class Count extends Module { val io = IO(new Bundle { val en = Input(Bool()) val done = Output(Bool()) val cntr = Output(UInt(4.W)) }) // count until 10 when `io.en' is high val (cntr, done) = Counter(io.en, 10) io.cntr := cntr io.done := done // assertion for software simulation // `cntr' should be less than 10 assert(cntr < 10.U) } </pre>

From: Trillion-Cycle Bug Finding Using FPGA-Accelerated Simulation Donggyu Kim, Christopher Celio, Sagar Karandikar, David Biancolin, Jonathan Bachrach, Krste Asanović. ADEPT Winter Retreat 2018



#### Synthesizable Assertions

#### • Synthesizable Assertions on FPGA

- Transform FIRRTL  ${\tt stop}$  statements into synthesizable logic
- Insert combinational logic and signals for the  ${\tt stop}$  condition arguments
- Insert encodings for each assertion (for matching error statements in SW)
- Wire the assertion logic output to the Top-Level
- Generate timing tokens for cycle-exact assertions
- Assertion checker records the cycle and halts simulation when assertion is triggered





#### BOOM Example

- Example from boom/src/main/scala/exu/rob.scala
- Assert is the ROB is behaving un-expectedly
  - Overwriting a valid entry

assert (rob\_val(rob\_tail) === false.B, "[rob] overwriting a valid entry.")
assert ((io.enq\_uops(w).rob\_idx >> log2Ceil(coreWidth)) === rob\_tail)
assert (!(io.wb\_resps(i).valid && MatchBank(GetBankIdx(rob\_idx)) &&
!rob\_val(GetRowIdx(rob\_idx))), "[rob] writeback (" + i + ") occurred to an
invalid ROB entry.")





#### BOOM Example

#### • How it looks in the UART output (while Linux is booting):

| <pre>[ 0.008000] VFS: Mounted root (ext2 filesystem) on device<br/>[ 0.008000] devtmpfs: mounted<br/>[ 0.008000] Erecting unused kernel memory: 148K</pre>                                                                                                                                                                                        | 253:0.                                                                                                                             |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------|
| [ 0.008000] This architecture does not have kernel memory<br>mount: mounting sysfs on /sys failed: No such device<br>Starting syslogd: OK<br>Starting klogd: OK                                                                                                                                                                                   | protection.                                                                                                                        |
| <pre>Starting mdev mdev: /sys/dev: No such file or directory [id: 1840, module: Rob, path: FireBoom.boom_tile_1.core.rob] Assertion failed: [rob] writeback (0) occurred to an invalid     at rob.scala:504 assert (!(io.wb_resps(i).valid &amp;&amp; MatchE     at cycle: 1112250469 **** FALLED *** (code = 1841) ofter 1112250485 cycles</pre> | <mark>ROB entry.</mark><br>Bank(GetBankIdx(rob_idx)) &&                                                                            |
| <pre>time elapsed: 307.8 s, simulation speed = 3.61 MHz FPGA-Cycles-to-Model-Cycles Ratio (FMR): 2.77 Beats available: 2165 Runs 1112250485 cycles [FAIL] FireBoom Test SEED: 1569631756 at cycle 4294967295</pre>                                                                                                                                | It would take ~62 hours to hit<br>this assertion is SW RTL<br>simulation (at 5 KHz sim rate),<br>vs. just a few minutes in FireSim |



# 0

#### Synthesizable printf

- Research feature presented in DESSERT [1] (together with assertions)
- Enable "software-style" debugging using printf statements
- Convert Chisel printf statements to synthesizable blocks
  - Appropriate parsing in simulation bridge
  - Including signal values
- Impact on simulation performance depends on the frequency of printfs.
- Output includes the exact cycle of the printf event
  - Helps measure cycles counts between events



https://www.deviantart.com/stym0r/art/Bart-Simpson-Programmer-134362686





#### BOOM Example

- Example from boom/src/main/scala/lsu/lsu.scala
- Print a trace of all loads and stores, for verifying memory consistency.

| if (MEMTRACE_PRINTF) {                                                       |                          |
|------------------------------------------------------------------------------|--------------------------|
| when (commit_store    commit_load) {                                         |                          |
| <pre>val uop = Mux(commit_store, stq(idx).bits.uop, ldq(idx).bits.uop</pre>  | )                        |
| <pre>val addr = Mux(commit_store, stq(idx).bits.addr.bits, ldq(idx).b;</pre> | ts.addr.bits)            |
| val stdata = Mux(commit_store, stq(idx).bits.data.bits, 0.U)                 |                          |
| val wbdata = Mux(commit_store, stq(idx).bits.debug_wb_data, ldq(id           | ).bits.debug_wb_data)    |
| printf(midas.targetutils.SynthesizePrintf("MT %x %x %x %x %x %x %x           | <mark>.n",</mark>        |
| io.core.tsc_reg, uop.uopc, uop.mem_cmd, uop.mem_size, addr, stdat            | <mark>a, wbdata))</mark> |
| }                                                                            |                          |
| }                                                                            |                          |



# 0

# Synthesizable printf/Assertions

#### Pros:

- FPGA simulation speed
- Real-time trigger-based
- Consumes small amount of FPGA resources (compared to ILA)
- Key signals have pre-written assertions in re-usable components/libraries

#### Cons:

- Low visibility: No waveform/state
- Assertions are best added while writing source RTL rather than during "investigative" debugging
- Large numbers of printfs can slow down simulation



#### Spike Co-Simulation

- Spike Golden reference RISC-V functional simulator
- Can be used to debug BOOM in FireSim through functional cosimulation and comparison
- Find functional bugs billions of cycles into simulations
  - Find divergence against functional golden model
  - Dump waveforms for affected signals

| [error] Spike PC ffffffe001055d84, DUT PC ffffffe001055d84                                                                                                                                                                       |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| [error] Spike INSN 14102973, DUT INSN 14102973                                                                                                                                                                                   |
| [error] Spike WDATA 000220d6, DUT WDATA 000220d4                                                                                                                                                                                 |
| [error] Spike MSTATUS a000000a0, DUT MSTATUS 00000000                                                                                                                                                                            |
| [error] DUT pending exception -1 pending interrupt -1                                                                                                                                                                            |
| [ERROR] Copsike: Errored during simulation tick with 8191                                                                                                                                                                        |
| <pre>*** FAILED *** (code = 8191) after 2,356,509,311 cycles<br/>time elapsed: 2740.8 s, simulation speed = 859.79 KHz<br/>FPGA-Cycles-to-Model-Cycles Ratio (FMR): 8.14<br/>Runs 2356509311 cycles<br/>FAIL] FireSim Test</pre> |

2 billion cycle divergence where receiving an interrupt during mis-speculation affects architectural state (EPC)



#### Spike Co-Simulation

#### Pros:

- FPGA simulation speed
- Verify against golden model
- Out-of-Band (no impact on workload execution)

#### Cons:

- Slower simulation performance (40 MHz)
- No uarch verification



# Hands on with AutoCounter



30



- We would like to observe some statistics about when Gemmini stalls
- \$ACYDIR/generators/gemmini/src/main/scala/gemmini/DMA.scala
  - Line 324 and 637

PerfCounter(tl.a.ready && translate\_q.io.deq.valid && io.tlb.resp.miss, "rdma\_tlb\_wait\_cycles",
 "cycles during which the read dma is stalling as it waits for a TLB response")
PerfCounter(tl.a.valid && !tl.a.ready, "rdma\_tl\_wait\_cycles",
 "cycles during which the read dma is stalling as it waits for the TileLink port to be available")

PerfCounter(tl.a.ready && translate\_q.io.deq.valid && io.tlb.resp.miss, "wdma\_tlb\_wait\_cycles",
 "cycles during which the write dma is stalling as it waits for a TLB response")
PerfCounter(tl.a.valid && !tl.a.ready, "wdma\_tl\_wait\_cycles",
 "cycles during which the write dma is stalling … for the TileLink port to be available")





• For reference, the build recipe for this FPGA image (in \$FDIR/deploy/config\_build\_recipes.yaml) is:

firesim\_gemmini\_rocket\_singlecore\_no\_nic: DESIGN: FireSim TARGET\_CONFIG: FireSimLeanGemminiRocketConfig PLATFORM\_CONFIG: WithAutoCounter\_BaseF1Config deploy\_triplet: null platform\_config\_args: fpga\_frequency: 10 build\_strategy: TIMING post\_build\_hook: null metasim\_customruntimeconfig: null bit builder recipe: bit-builder-recipes/f1.yaml



#### Update our workload to copy the output printf file:

- vim \$FDIR/deploy/workloads/resnet50-baremetal.json
- Add the AUTOCOUNTERFILE\*.csv to our simulation output

```
'
   "benchmark_name": "resnet50-baremetal",
   "common_simulation_outputs": [
       "uartlog", "AUTOCOUNTERFILE*.csv"
],
   "common_bootbinary": "...",
   "common_rootfs": "..."
}
```

Make sure to avoid adding an extra comma!





• Setup the config\_runtime.yaml

\$ vim \$FDIR/deploy/config\_runtime.yaml

- Select the AGFI that was synthesized with counters
- Select the baremetal ResNet50 workload
- Tell sample rate of counters to enable them
- Boot the simulation by running the following sequence of commands:

\$ firesim infrasetup

• This should take about 3 minutes

\$ firesim runworkload







# While this is running...









#### Target-Level Simulation

- Software Simulation
- Target Design Untransformed
- No Host-FPGA interfaces

#### Metasimulation

- Software Simulation
- Target Design
   Transformed by
   Golden Gate
- Host-FPGA interfaces/shell emulated using abstract models

# FPGA-Level Simulation

- Software Simulation
- Target Design Transformed by Golden Gate
- Host-FPGA interfaces/shell simulated by the FPGA tools





















| Level  | Waves | VCS    | Verilator | XSIM    |
|--------|-------|--------|-----------|---------|
| Target | Off   | ~5 kHz | ~5 kHz    | N/A     |
| Target | On    | ~1 kHz | ~5 kHz    | N/A     |
| Meta   | Off   | ~4 kHz | ~2 kHz    | N/A     |
| Meta   | On    | ~3 kHz | ~1 kHz    | N/A     |
| FPGA   | On    | ~2 Hz  | N/A       | ~0.5 Hz |





# Back to our hands-on example





#### Viewing the Simulation

#### Look for the run instance's IP address in the status:

| FireSim Simulation Status @ 2022-06-18 00:17:10.188191                                                                                                                                                                                                                                                                           |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| This workload's output is located in:<br>/home/centos/chipyard/sims/firesim/deploy/results-workload/2022-06-1800-16-00-<br>mobilenet-baremetal/<br>This run's log is located in:<br>/home/centos/chipyard/sims/firesim/deploy/logs/2022-06-1800-16-00-runworkload-<br>NEZCRUKBA2M44B9M.log<br>This status will update every 10s. |
| Instances                                                                                                                                                                                                                                                                                                                        |
| Hostname/IP: 192.168.3.52   Terminated: False                                                                                                                                                                                                                                                                                    |
| Simulated Switches                                                                                                                                                                                                                                                                                                               |
| Simulated Nodes/Jobs                                                                                                                                                                                                                                                                                                             |
| Hostname/IP: 192.168.3.52   Job: resnet50-baremetal0   Sim running: True                                                                                                                                                                                                                                                         |
| Summary                                                                                                                                                                                                                                                                                                                          |
| <pre>1/1 instances are still running. 1/1 simulations are still running.</pre>                                                                                                                                                                                                                                                   |



#### Viewing the Simulation

• On the *manager* instance, ssh into the run farm instance:

\$ ssh 192.168.3.52



Readme: /home/centos/src/README.md AMI Release Notes: /home/centos/src/RELEASE\_NOTES.md GUI/Cluster setup: https://github.com/aws/aws-fpga/blob/master/developer\_resources

• Then look at the stream of prints (or if complete, look at the output results)

#### \$ tail -f sim\_slot\_0/AUTOCOUNTER\*





#### Output file in

\$FDIR/deploy/results-workload/<timestamp>-resnet50-baremetal/resnet50-baremetal0/AUTOCOUNTER0.csv

```
version,1
Clock Domain Name, ...
label, local_cycle, ...rdma_tlb_wait_cycles, ...wdma_tl_wait_cycles, ...rdma_tl_wait_cycles
"description", ...
type, Accumulate, Accumulate, Accumulate
event width, 1, 1, 1, 1
accumulator width, 64, 64, 64, 64
...
50000000, 50000000, 37, 245382, 2175, 287878
...
170000000,170000000,5416,5073118,56093,10989706
...
2320000000,2320000000,24953,15317004,364054,25548250
...
```

#### ... let's view this as a table



#### Output file in

\$FDIR/deploy/results-workload/<timestamp>-resnet50-baremetal/resnet50-baremetal0/synthesized-prints.out

|                      | label     | local_cycle | rdma_<br>tlb_<br>wait_cycles                                                        | 3 | wdma_<br>tl_<br>wait_cycles | wdma_<br>tlb_<br>wait_cycles | rdma_<br>tl_<br>wait_cycles |  |
|----------------------|-----------|-------------|-------------------------------------------------------------------------------------|---|-----------------------------|------------------------------|-----------------------------|--|
|                      | 50000000  | 5000000     | 37                                                                                  |   | 245382                      | 2175                         | 287878                      |  |
|                      | 170000000 | 17000000    | 5416                                                                                |   | 5073118                     | 56093                        | 10989706                    |  |
| Sample at cycle 170M |           | 24953       | Showing waiting for memory response 48250<br>(TileLink) is much higher than waiting |   |                             | nse <sup>48250</sup>         |                             |  |
|                      |           |             |                                                                                     |   | for a TLB r                 | esponse                      |                             |  |





#### Output file in

\$FDIR/deploy/results-workload/<timestamp>-resnet50-baremetal/resnet50-baremetal0/synthesized-prints.out

| label                                                                                                          | local_cycle   | rdma_tlb_w<br>ait_cycles    | wdma_tl_wa<br>it_cycles | wdma_tlb_w<br>ait_cycles | rdma_tl_wa<br>it_cycles |  |  |
|----------------------------------------------------------------------------------------------------------------|---------------|-----------------------------|-------------------------|--------------------------|-------------------------|--|--|
| 5000000                                                                                                        | 5000000       | 37                          | 245382                  | 2175                     | 287878                  |  |  |
| 17000000                                                                                                       | 17            | If tail'ing the result      |                         |                          |                         |  |  |
| 2320000000                                                                                                     | 23 • Exit tai | • Exit tail by doing Ctrl-c |                         |                          |                         |  |  |
| <ul> <li>Then exit out of the simulation instance with<br/>Ctrl-d to return to the manager instance</li> </ul> |               |                             |                         |                          |                         |  |  |



Don't forget to terminate your runfarms (otherwise, we are going to pay for a lot of FPGA time)

\$ firesim terminaterunfarm

Type yes at the prompt to confirm





# The FireSim Vision: Speed and Visibility

- High-performance simulation
- Full application workloads
- Tunable visibility & resolution
- Unique data-based insights



#### Summary

- Debugging Using Integrated Logic Analyzers (docs)
- Advanced Debugging and Profiling Features
  - TracerV (docs)
  - AutoCounter (docs)
  - Assertion and Print Synthesis (docs)
- Debugging Using Software Simulation (docs)
  - Target-Level
  - Metasimulation
  - FPGA-Level
- FireSim Debugging and Profiling Future Vision

Check out https://docs.fires.im/

for more usage details