Hardware-Software Codesign

0. Overview

  • Specification and Models of Computation
    • State-Charts
    • Kahn Process Networks
  • Design Space Exploration
    • Mapping
    • Partitioning
    • Multi-Criteria Optimization
  • Performance Estimation
    • Simulation-based Methods
    • Worst-Case Execution Time Analysis
    • Performance Analysis of Distributed Systems
    • Thermal-aware Design

1. Introduction

System Design

  • Specifications and Models
  • Design
  • Estimation

Embedded Systems

  • Information processing system embedded in a larger product
  • Interface to outside world
    • external process
    • Sensors, actors
    • Human interface
  • Technical environment does not forgive errors
  • Stricter requirements (usually update not possible)
Difference Embedded Systems ↔ General Purpose Computing
Embedded SystemsGeneral Purpose Computing
Few Applications, known at design-timeBroad class of applications
Not programmable by end userProgrammable by end user
Fixed run-time requirementFaster is better
Levels of abstraction (from higher to lower)
  • Transistors
  • Gates
  • Processor (Memory, Register, ALU, …)
  • Architecture (CPU, ASIC, FPGA, LAN, …)
  • Distributed System
Design
  • Design aspects look similar in every level
  • Design goes usually from higher abstraction level to lower level
  • Abstraction: leave out (unimportant) details
  • System level design:
    • SW Synthesis (compile)
    • HW Synthesis
    • Interfaces
Estimation
  • Reason about solution
  • Different solutions for a specific task
  • Precision of estimation is different on each level
    • Higher level estimation usually less precise than lower level
Mapping
  • Mapping parts of specification to different parts of the system
    • Similar to scheduling but bigger design space (including HW)
Specification
  • Specification: in embedded systems higher requirements to correctness
  • Specific languages:
    1. Developer is restricted (lesser errors)
    2. Better code generator/synthesizer

Doing things right is difficult!

  • Observer Pattern is not save!
    • Race conditions
    • Dead-Locks
    • Use non-locking algorithms!

2. Models of Computation

Model of computation = Components + Communication

  • Hierarchy
    • Behavioral
    • Structural
  • Timing
  • State oriented behavior (micro controller, FPGA)
  • Dataflow oriented behavior (DSP)

State Charts

  • Classical automata (Moore/Mealy): FSM

  • State: Information needed to get (determine) the output form the input

  • State Charts introduce hierarchy

    • Combining States with Sub-States in Super-States
    • Active States
    • Basic States (no Sub-States)
    • Super-States (ancestor states): OR-Super-States (hierarchy)
    • Default State, History State (can be combined)
    • AND-Super-States (concurrency)
  • 2 distinct features form FSM

    • Hierarchy
    • Difference of Control-Path and Data-Path
  • Communication between Control- and Data-Path

    • Control-Path to Data-Path: action
    • Data-Path to Control-Path: condition
  • Events, Conditions, Actions

event[condtition]/action\underrightarrow{event [condtition] / action}
  • No storage of events (only available for next step)
  • Events are global
Evaluation of State Charts
  • Evaluation is not trivial!
  • Three phases
    1. Event is emitted
    2. Transition is selected
    3. Simultaneously make transition and apply the actions
      • Execute the right hand side of action simultaneously and assign then after

Specification and Description Language (SDL)

  • Targeted at unambiguous specification and description of systems
  • Asynchronous message passing
  • Appropriate also for distributed systems
  • Communication between FSMs (or processes)
    • message passing with FIFO queues
    • FIFOs can be indefinitely large
  • Each process fetches entry from FIFO
    • if input enables transition it takes place
    • otherwise discard input
  • All orders of events are legal: different correct behavior in simulators (not deterministic)

Dataflow Languages

  • Imperative language style (program counter)

  • Movement of data is priority

  • Scheduling: responsibility of the system (not the programmer)

  • Basic characteristics:

    • all processes run ‘simultaneously’
    • Processes can be described with imperative code
    • Processes can only communicate through buffers
    • Sequence of read tokens is same as of read tokens
  • Useful for applications that deal with streams of data

    • Concurrent: maps to parallel hardware
    • Perfect for block-diagram specifications (control systems, signal processing)
Kahn Process Networks
  • General-purpose scheme for parallel programming
    • read: destructive and blocking (reading empty channel blocks until data is available)
    • write: non-blocking
    • FIFO: infinite size
  • Unique attribute: determinate
Determinacy
  • Random: knowing about system and inputs is not sufficient to determine output
  • Determinate: the histories of all channels depend only on the histories of the input channels
  • Importance
    • Functional behavior is independent of timing (scheduling, communication time, execution time of processes)
    • Separation of functional properties and timing
Adding Non-Determinacy
  • Several possibilities
    • Allow processes to test for empty queues
    • Allow multiple processes to write to or read from one channel
    • Allow processes to share a variable (memory)
Synchronous Dataflow (SDF)
  • Restriction of Kahn Networks to allow compile-time scheduling
  • Each process reads and writes a fixed number of tokens each time it fires
  • Firing is an atomic process
  • Schedule can be determined completely at compile-time
  • Steps:
    1. Establish relative execution rates (solving a system of linear balancing equations)
    2. Determine periodic schedule by simulating system for a single round

3. Mapping Application to Architecture

System synthesis from specification

  • Allocation: Select components (HW)
  • Binding: Bind SW (Application) to HW (Components)
  • Scheduling: Executing code

Mapping = Binding + Scheduling

Specification Examples

Data-Flow Graph (DFG)
x = *3a + B*b - c;
y = a + b*x;
z = b - c*(a + b);
  • no loops
  • parallelism

Data Flow Graph on Wikipedia

Control-Flow Graph (CFG)
what_is_this {
    read(a, b);
    done = FALSE;
    repeat {
        if (a>b)
            a = a-b;
        else if (b>a)
            b = b-a;
        else done = TRUE;
    } until done;
    write(a);
}
  • State machine
  • Conditional branches

Control Flow Graph on Wikipedia

Architecture Specification

  • Reflects structure and properties of underlying platform
  • Can be done at different abstraction levels

Mapping Specification

Mapping: application and architecture specification

  • binds processes to processors
  • binds communication between processes to architecture communication paths
  • specifies resource sharing and scheduling

Partitioning and Mapping

4. System Partitioning

Partitioning: Given a set of objects: Partition this set in subsets.

Mapping: Subset is mapped to specific resource.

Mapping at different Levels

Low Level
  • Register Transfer Level (RTL) or Netlist Level
  • Split a digital circuit and map it to several devices (FPGAs, ASICs)
  • System parameters (e.g., area, delay) relatively easy to determine
High Level
  • System Level
  • Comparison of design alternatives for optimality (design space exploration)
  • System parameters are unknown and difficult to determine
    • to be estimated via analysis, simulation, (rapid) prototyping

Cost Functions

  • system cost (C[$])
  • latency (L[sec])
  • power consumption (P[W])
Estimation

Estimation is required to find C, L, P values, for each design point

i.e

f(C,L,P)=k1hC(C,Cmax)+k2hL(L,Lmax)+k3hP(P,Pmax)f(C,L,P)= k_1 \cdot h_C(C,C_{max})+ k_2 \cdot h_L(L,L_{max})+ k_3 \cdot h_P(P,P_{max})
  • hCh_C, hLh_L, hPh_P: denote how strong CC, LL, PP violate design constraints CmaxC_{max}, LmaxL_{max}, PmaxP_{max}
  • k1k_1, k2k_2, k3k_3: weighting and normalization
The Formal Partitioning Problem

Assign nn objects O={o1,,on}O= \{ o_1, \cdots , o_n \} to mm blocks (also called partitions) P={p1,,pm}P= \{ p_1, \cdots , p_m \}, such that

  • p1p2pm=Op_1 \cup p_2 \cup \cdots \cup p_m = O (all objects are assigned - mapped)
  • pipj={}  i,j:ijp_i \cap p_j = \{ \}\; \forall i,j:i\neq j (an object is not assigned or mapped twice)
  • and costs c(P)c(P) are minimized
Partitioning Methods

Exact vs. Heuristic Methods:

  • Exact provides optimal solution (or set of solutions)
  • Heuristic provides a ‘good’ solution but not best

Overview:

  • Exact Methods
    • Enumeration
    • Integer linear programs (ILP)
  • Heuristic
    • Constructive Methods
      • Random mapping
      • Hierarchical clustering
    • Iterative methods
      • Kerninghan-Lin algorithm
      • Simulated annealing
      • Evolutionary algorithm
Hierarchical Clustering
  • Combine always 2 nodes
  • Then recalculate
  • Repeat until goal achieved
  • Steps can be visualized in a tree diagram
Integer Linear Programming (ILP)
  • Exact method
  • Objective function
  • Constraints

Objective:

C=xiXaixi  with  aiR,xiNC = \sum_{x_i\in X}a_ix_i\; with\; a_i \in \mathbb{R}, x_i \in \mathbb{N}

Constraints:

jJ:xiXbi,jxicj  with  bi,j,cjR\forall j \in J: \sum_{x_i \in X}b_{i,j}x_i \geq c_j \; with \; b_{i,j}, c_j \in \mathbb{R}

Integer programming problem:

  • Minimize objective function subject to constraints
Example for Partitioning
  • optimize for a load balanced system

Scheduling:

Taskt0t1t2t3
Processor01100
Processor10011

Run Times:

Run timet0t1t2t3
Processor05151030
Processor110201010

Cost ci,kc_{i,k}:

c0,0=5c0,1=10}0i30k1\left.\begin{matrix} c_{0,0} = 5 \\ c_{0,1} = 10 \\ \cdots \end{matrix}\right\} \begin{matrix} 0 \leq i \leq 3 \\ 0 \leq k \leq 1 \end{matrix}

Binary variables xi,kx_{i,k}:

x0,0+x0,1=1x1,0+x1,1=1}k=01xk,i=1    0i3\left.\begin{matrix} x_{0,0} + x_{0,1} = 1 \\ x_{1,0} + x_{1,1} = 1 \\ \cdots \end{matrix}\right\} \sum_{k=0}^1 x_{k,i} = 1\; \forall\; 0 \leq i \leq 3
Approaches
  1. Approach:
min{i=03k=01ci,kxi,k}min\left \{ \sum_{i=0}^3 \sum_{k=0}^1 c_{i,k} x_{i,k} \right \}

2. Approach:

min{i=03ci,0xi,0i=03ci,1xi,1}min\left \{ \left | \sum_{i=0}^3 c_{i,0} x_{i,0} - \sum_{i=0}^3 c_{i,1} x_{i,1} \right |\right \}

This is not a linear problem!

Solutions

1. Solution:

Solving 2 linear problems: split approach 2 into 2 linear problems:

min{i=03ci,0xi,0l1i=03ci,1xi,1l2}min \left \{ \left | \underbrace{\sum_{i=0}^3 c_{i,0} x_{i,0}}_{l_1} - \underbrace{\sum_{i=0}^3 c_{i,1} x_{i,1}}_{l_2} \right | \right \}

Becomes:

l0l1:min{l0l1}l1l0:min{l1l0}\begin{matrix} l_0 \geq l_1: min\left \{ l_0 - l_1 \right \} \\ l_1 \geq l_0: min\left \{ l_1 - l_0 \right \} \end{matrix}

2. Solution:

Empirical approach: Run code and measure execution time and try to minimize it.

Move tasks to different processors.

l0=(i)ci,0xi,0l1=(i)ci,1xi,1\begin{matrix} l_0 = \sum_{(i)} c_{i,0} x_{i,0}\\ l_1 = \sum_{(i)} c_{i,1} x_{i,1} \end{matrix}
min{B}min \{ B \}
Bl0Bl1\begin{matrix} B \geq l_0 \\ B \geq l_1 \end{matrix}
Iterative Methods

Often used principle:

  • Start with some initial configuration (partitioning)
  • search neighborhood (similar partitions)
    • select a neighbor as candidate
  • evaluate fitness (cost) function of candidate
    • accept candidate using acceptance rule
    • if not, select another neighbor
  • Stop if:
    • Quality is sufficient or
    • no improvement can be found or
    • after a fixed time

Ingredients:

  • initial configuration
  • function to find a neighbor as candidate
  • cost function
  • acceptance rule
  • stop criterion

Disadvantages:

  • local optimum as best result
  • local optimum depends on initial configuration
  • generally no upper bound on iteration length
Kernighan-Lin
  • as long as a better partition is found
    • from all possible pairs of objects: virtually re-group the ‘best’ (lowest cost of resulting partition)
    • from the remaining (not yet touched) objects: virtually re-group the ‘best’ pair
    • continue until all objects have been re-grouped
    • from these n/2 partitions, take the one with smallest cost and actually perform the corresponding re-group operation
Simulated Annealing
  • inspired by physical process
  • replace existing solutions by (random) new feasible solutions form neighborhood
  • always accept better solutions but allow for a guided acceptance for worse neighbors
  • gradually cooling: gradually decrease the probability of accepting worse solutions

Advantage:

  • Allowance for ‘uphill’ moves potentially avoids local optima

Possible implementation:

temp = temp_start
cost = c(P)
while Frozen() == False:
    while Equilibrium() == False:
        P_new = RandomMove(P)
        cost_new = c(P_new)
        deltacost = cost_new - cost
        if Accept(deltacost, temp) > random(0,1)
            P = P_new
            cost = cost_new
        temp = DecreaseTemp(temp)

Functions:

RandomMove(P):

  • Choose a random solution in the neighborhood of P

DecreaseTemp(), Frozen():

  • Cooling down. There are different choices. For example:
    • initially: temp := 1.0
    • in any iteration: temp := alpha * temp (typically: 0.8 \leq alpha \leq 0.99)
  • Frozen
    • after a given time or
    • if there is not further improvement

Equilibrium():

  • Usually after a defined number of iterations

Accept(deltacost, temp):

  • edeltacostktempe^{- \frac{deltacost}{k \cdot temp}}

Complexity:

  • From exponential to constant depending on

Equilibrium(), DecreaseTemp() and Frozen().

5. Multi-Criteria Optimization

Part of system synthesis.

Exploration of different solutions.

Example: Network Processors

Network processor: High-performance, programmable device designed to efficiently execute communication workloads

  • Given:
    1. Specification of the task structure (task model): tasks to be executed for each flow
    2. Different usage scenarios (flow model): sets of flows
  • Sought:
    • Network processor implementation: architecture + task mapping + scheduling
  • Objectives:
    1. Maximize performance
    2. Minimize cost
  • Subject to:
    1. Memory constraint
    2. Delay constraint

Basic Definition

  • We intend to minimize a vector-valued objective function
f=(f1;f2;...;fn):XRnf= (f_1; f_2; ... ; f_n): X \rightarrow R^n
  • XX denotes the decision space. i.e the feasible set of alternatives for the optimization problem
  • The image of the decision space XX using the objective function ff is denoted as the objective space ZRnZ \subset R^n with
Z={f(x)xX}Z = \{f(x) |x \in X \}
  • A single alternative xXx \in X is (sometimes) named solution and the corresponding object value z=f(x)Zz = f(x) \in Z is named objective vector

decision space → objective space

Basic question:

How do we define the minimum of a vector-valued function?

Pareto Dominance

Definition:

A solution aXa \in X weakly Pareto-dominates a solution bXb \in X, denoted as aba \preceq b, if it is at least as good in all objectives, i.e:

fi(a)fi(b)  1inf_i(a) \leq f_i(b)\; \forall 1 \leq i \leq n
  • Dominance is transitive
  • A solution is named Pareto-optimal, if it’s not Pareto-dominated by any other solution in XX
Optimization Alternatives
  • Classical single objective optimization methods
    • Simulated annealing
    • Integer linear program
    • other constructive or iterative heuristic methods

→ Decision making is done before the optimization

  • Population based optimization methods
    • Evolutionary / genetic algorithms

→ Decision making is done after the optimization

Evolutionary Algorithms

Basic scheme (many variations exist):

  1. A set of initial solution (initial population is chosen (usually at random). This set is called parent set.
  2. Solutions form parent set are selected (mating selection)
  3. Solutions from mating selection are changed using neighborhood operators (cross-over and mutation operators). The resulting set is called children set.
  4. Determine union of the parent and the children sets
  5. Solutions of the set from 4. are selected based on their merit to construct the new parent set (environmental selection).
  6. continue at 2.

Evolutionary Algorithm

  • Cross-over operator: Take best parts of two solutions
  • Mutation operator: same as in simulated annealing (make small variations)
  • How to choose solutions that should be removed from the population
    • Solutions should be ‘close’ to the (unknown) Pareto-optimal front (optimality)
    • Solutions should cover large parts of the objective space (diversity)
Hypervolume Indicator
  • Environmental selection:
    • Select subset of solutions that maximizes hypervolume indicator

Given a set of solutions AXA \subseteq X and a set of reference points RRnR \subset \mathbb{R}^n. Then the hypervolume indicator IH(A,R)I_H(A,R) of AA with respect to RR is defined as

IH(A,R)=zH(A,R)dzI_H(A,R) = \int_{z \in H(A,R)} dz

where H(A,R)H(A,R) id the dominated space of AA regarding RR:

H(A,R)={zRnaA:rR:(f(a)zr)}H(A,R) = \{ z \in \mathbb{R}^n | \exists a \in A : \exists r \in R : (f(a) \leq z \leq r) \}

zz: Objective

  • The hypervolume indicator leads to diversity and optimality
Representation and Neighborhood
  • A representation corresponds to an abstract data that encodes a solution
  • Neighborhood operators work on representations

Issues:

  • Each solution has an encoding (completeness)
  • All solutions are represented equally often (uniformity)
  • Each encoding maps to a feasible solution (feasibility)

Encoding: i.e order of tasks executing to explore scheduling.

Tree Representation
  • Solutions can be represented as tree structures
  • Mutation:
    • Grow subtree
    • Shrink subtree
    • Switch two subtrees
    • Replace subtree
    • Subtrees can be single nodes
Handling Constraints

Constraint: g(x)0g(x) \geq 0

  • Feasible: g0g \geq 0
  • Infeasible: g<0g < 0

Approaches:

  1. Representation is chosen such that decoding always yields a feasible solution
  2. Construct initialization and neighborhood operators such that Infeasible solutions are not generated
  3. Add only feasible solutions to children population
  4. Preferably select feasible solutions in environmental selection
  5. Penalty function:
    • Calculate constraint violation g(x)g(x) and incorporate it into objective function
    • penalty(x)>0  ifg(x)<0,  penalty(x)=0  ifg(x)0penalty(x) > 0\; if g(x) < 0,\; penalty(x) = 0\; if g(x) \geq 0
    • i.e add penalty function to every objective (increase ‘price’ for Infeasible solutions)
  6. Include the constraints as new objectives

6. System Simulation

Estimation with simulation.

Estimation: find out if a solution is good or bad.

System and Model

  • System: Combination of components that act together
  • Model: Formal description of the system (covers selected information)

State

  • Contains all information of the system at t0t_0 that is necessary to determine the output (for tt0t \geq t_0) from the input (at (at tt0t \geq t_0).

Input + State = Output

  • The set XX of all possible states of a system is called it’s state space

Discrete/Continuous Systems

Examples:

  • Continuous state systems:
    • Physical processes (usually)
    • Electrical networks
    • Mechanical systems
  • Discrete state systems
    • Finite state machines
    • Queuing systems
    • Computer systems
  • Continuous time systems
    • Physical processes (usually)
    • Electrical circuits
    • Asynchronous systems
  • Discrete time systems
    • Digital clocked system
    • Equidistant sampling (z-transform)
    • Synchronous system models
Discrete Event Systems (DES)
  • Driven by events

  • Event: e=(v,t)e = (v, t) is a touple of a value vv and a tag tt

    • Tags are usually totally ordered
    • If the tag denotes time, then the event is a timed event
    • If the tag (only) denotes a sequence number, the event is an untimed event
  • DES is an event-driven system

    • Depends entirely on occurrence of discrete events
    • Does not depend on evolution of time
  • Can be defined in continuous of discrete time

  • The state space XX can be either discrete or continuous

  • Signals or streams represent ordering/timing of events

  • Processes are represented as functions acting on signals or streams

Simulation
  • Discrete-time, time-driven

    • Simulated time is partitioned into equidistant time intervals
    • Length of time intervals determined by
      • The simulated system (i.e clock period)
      • The intended precision (dicretization loss)
      • The simulation effort
    • A simulation step is performed even if nothing happens
  • Event-driven simulation

    • State change only at events
    • Analysis and simulation possible in discrete or continuous time
Discrete-Event Modeling and Simulation
  • Concurrent processes are modeled using modules
    • Behavior described as logic/algebraic expressions
    • State described using persistent variables (inside module)
    • Communication between modules through ports, via signals
    • Synchronization through abstract events
  • Modules can be hierarchical
  • Event-Driven
Components of Discrete-Event Simulation
  • Simulation time
    • Simulation clock represents the current value of simulation time
    • In discrete-event simulations the clock advances to the next event start-time during simulation
  • Event list
    • Events are processed in order
    • Event list is typically organized as priority queue
    • Event lists may include simulation times when events will happen
  • System modules
    • Model subsystems of simulated system
    • Are called by simulation engine if a relevant event is scheduled
Simulation Engine
  1. Initialization routine
    • Initialize the simulation model;
      • Set initial states of subsystem modules
      • Fill the event queue with initial events
  2. Timing routine
    • Determine next event from the event queue
    • Advance the simulation clock to the time when the event occurs
  3. Event routine
    • Update system state when event occurs
  • In a simulation cycle ‘cause’ and ‘effect’ can share the same time of occurrence
    • Use zero duration virtual time interval called delta cycle (δ\delta)

SystemC

  • System-level modeling language
    • Several levels of abstractions
      • from purely functional
      • to cycle-accurate/pin-accurate
    • Special attention to systems with embedded software
  • C++ library for modeling concurrent systems
    • HW-oriented data types
    • Communication mechanisms
    • Concurrency modeling
  • Event-driven simulation kernel for executing models
  • Free (open source)
  • For Windows, Linux, OS X…
Language Architecture
  • Core language
    • Minimal set of modeling constructs
      • structural description
      • concurrency
      • communication
      • synchronization
  • Data types (separated from core language)
  • On top of core language
    • Communication mechanism (signals, FIFO…)
    • Models of Computation (MoCs)
  • Layer (tier) architecture
Processes

Processes are the basic units of functionality.

  • SC_THREADs
    • Typically called once, run forever (in a while(true) loop)
    • Can be suspended by calling wait()
    • Keep the state of execution implicitly
  • SC_METHODs
    • Execute repeatedly from the beginning to end
    • Simulate faster
    • Do not keep the state of execution implicitly
  • Processes must be contained in a module
    • Bun not every member function is a process
Modules
  • Building blocks of models
    • Hierarchy
    • Abstraction
    • IP reuse
Inter Process Communication
  • Processes can communicate directly trough signals
  • Events: Flexible, low-level synchronization primitive
  • Channel:
    • Container for communication and synchronization
      • can have state (private data), transport data, transport events
    • Implement one or more interfaces
  • Interface:
    • Set of access methods to the channel
    • Interface methods need to be implemented
Wait and Notify
  • wait: halt process until an extent is raised
    • wait() with arguments → dynamic sensitivity
      • wait(sc_event)
      • wait(time)
      • wait(timeout, sc_event)
  • notify: rise an event
    • notify() with arguments → delayed notification
      • my_event.notify();: notify immediately
      • my_event.notify(SC_ZERO_TIME);: notify next delta cycle
      • my_event.notify(time);: notify after ‘time’

Levels of Abstractions

Simulation can be done on different levels of abstraction.

  • Functional level (untimed/timed)
    • Use: Model (un-)timed functionality
    • Communication: shared variables, messages
    • Typical languages: C/C++, Matlab
  • Transaction level
    • Use:
      • MPSoC architecture analysis
      • Early SW development
      • Timing estimation
    • Communication: method calls to channels
    • Typical languages: SystemC
  • Register transfer level / Pin level
    • Use: HW design and verification
    • Communication: wires and registers
    • Typical languages: Verilog, VHDL

7. Design Space Exploration

Example: Network Processor (NP)

  • Different processing units: ARM9, DSP, …
  • Streams: ordered sequence of tasks
Optimal Design of Network Processor

TT: Tasks ww: Requests MM: Mapping RR: Resources cc: Cost

  • Mappings: MT×RM \subseteq T \times R all possible bindings of tasks
    • i.e if (t,r)M(t,r) \in M then task tt could be executed on resource rr
  • Request w(r,t)0w(r,t) \geq 0
    • i.e execution of one packet in tt would use ww computing units of rr
  • Resource allocation cost c(r)0c(r) \geq 0
  • Binding ZZ of tasks to resources ZMZ \subseteq M
    • leading to actual implementation
    • Subset of mappings MM s.t every task tTt \in T is bound to exactly one allocated resource
      • eRalloc(r)=1e \in R alloc(r) = 1 and
      • r=bind(t)r = bind(t)
Design constraints
  • Delay constraints: e.g max time a packet is processed within network processor
  • Throughput maximization: e.g packets per second
  • Cost minimization: implementing with small amount of resources (processing units, memory, networks…)
  • And conflicting usage scenarios …
NP Design Space Exploration
  • Allocation: determine HW components
  • Binding: For each SW process choose allocated HW unit
  • Scheduling: For tasks on specific resources choose scheduling policy

Performance Estimation

Global Picture

Metric
  • Time
  • Power
  • Area (integration on silicon)
  • Cost
  • Quality
  • SNR
Method
  • Analytic
  • Simulation
  • Statistic
Abstraction Level
  • Low-level: Register Transfer Level, Instruction Set Architecture
  • Intermediary-level: Transaction Level-Model, Operating System, already linked to HW
  • High-level: Functional, High-level language, independent of HW
Subsystem to Analyze
  • HW subsystem
  • CPU subsystem
  • Interconnect subsystem
  • Multiprocessor System-on-Chip (MPSoC)

Performance Estimation in Design Flow

  • High-level
    • Advantages:
      • Short estimation time (fast)
      • Implementation details not necessary
    • Drawbacks:
      • Limited accuracy
      • e.g no information about timing
  • Low-level
    • Advantages:
      • Higher accuracy
    • Drawbacks:
      • Long estimation time (slow)
      • Many implementation details need to be known

Multi-Processor System-on-Chip

Simulation should be divided in to independent modules. So they can be exchanged.

Composable in terms of:

  • Subsystems
  • Interactions
  • Computation
  • Communication
  • Memory
  • Scheduling

Estimation should cover different metrics e.g:

  • Delays
  • Throughput
  • Memory consumption
  • Power
  • Energy
  • Temperature
  • Cost
Communication
  • Bus
  • Network
  • Hierarchical Bus
Computation and Memory
  • DRAM
  • SDRAM
  • DSP
  • RISC
  • FPGA
  • muC
  • HW
Scheduling and Arbitration
  • TDMA
  • EDF
  • FCFS
  • proportional share
  • fixed priority
  • static
Difficulty

Why is MPSoC performance estimation difficult?

  • Computation, communication and memory
    • Non-deterministic computation in processing nodes
    • Non-deterministic communication delays
    • Non-deterministic memory access
    • Complex resource interaction via scheduling/arbitration
  • Cyclic timing dependencies
    • Internal data streams interact on computation and communication resources
    • Interaction determines stream characteristics
  • Uncertain environment
    • Different load scenarios
    • Unknown, worst case inputs

Making one component faster can make the complete system slower!

If a component is replaced with a faster one it could block shared resources (e.g bus) for other components.

Abstraction

  • Technology: transistors, layouts
  • Signal: gate, schematic, RTL
  • Transaction: SW, HW systems
  • Tokens: SW tasks, comm. backbones, IPs
  • Simulators:
    • SPICE
    • VHDL
    • SystemC / Instruction Set Simulator

Performance Estimation Methods

How to evaluate a system?

  • Measurements
    • Use existing instance of the system to measure performance
  • Simulation
    • Develop a program which implements a model of the system and evaluate performance by running the program
  • Statistics
    • Develop a statistical abstraction of the system and derive statistic performance via analysis or simulation
  • Formal analysis
    • Develop a mathematical abstraction of the system and compute formulas which describe the system performance
Static Analysis Models
  • Describe computing, communication and memory resources by algebraic equations
  • Describe system properties by parameters (e.g data rate)
  • Combine relations
comm_delay=#wordsburst_sizecomm_timecomm\_delay= \left \lceil \frac{\# words}{burst\_size} \right \rceil \cdot comm\_time
  • Fast and simple estimation
  • But generally inaccurate modeling (e.g resource sharing not modeled)
Dynamic Analytic Models
  • Combination between
    • Static models possibly extended by non-determinism in run-time and event processing
    • Dynamic models for describing e.g resource sharing mechanism (scheduling and arbitration)
  • Existing approaches
    • Classical real-time scheduling theory
    • Stochastic queuing theory (statistical bounds)
    • Analytic (non-deterministic) queuing theory (worst-/best-case bounds)
Stochastic Queuing Systems
  • A stochastic model for queuing systems is described by probability density functions (distributions) of
    • Arrival rates
    • Service mechanisms
    • Queuing disciplines
  • Performance measures are stochastic values (functions)
    • Average delays in queue
    • Time-average number of customers in queue
    • Proportion of time servers is busy

Markovian (exponential) distribution

Worst-Case/Best-Case Queuing Systems
  • A worst-/best-case queuing system is described by worst-/best-case bounds on system properties
    • worst-/best-case bounds on arrival times
    • worst-/best-case on server behavior
    • resource interaction
  • Performance measures
    • worst-/best-case delay in queue
    • worst-/best-case number of customers in queue
    • worst-/best-case system delay
Simulation
  • Model
    • Program implementing a model of the system (application, hardware, platform, mapping)
    • Performance is evaluated by running the program
  • Simulation
    • Considers HW platform and mapping of application on that platform (virtual platform)
    • Combines functional simulation and performance data
    • Evaluates behavior for one simulation scenario
  • Typically complex set-up and extensive run-times
  • Accurate results and good debugging possibilities
Trace-Based Simulation
  • Abstract system-level simulation (without timing)
    • Faster than low-level simulation
    • But still based on a single input trace
  • Abstraction
    • Application: abstract execution traces
      • → Graph of events: read, write and execute
    • Architecture: ‘virtual machines’ and ‘virtual channels’
      • → Calibrated with non-functional properties (timing, power, energy)
  • Trace-based simulation steps
    • Build application abstract model
      • → execution trace determined by functional application simulation
    • Extending abstract model with architecture and mapping
      • → Event graph extended by non-functional properties of virtual architecture elements
    • Simulation of extended model

9. Worst Case Execution Time Analysis

  • Calculation not simulation
  • Measurement of running task
  • Simulation would need accurate model
  • Worst case formal analysis
  • Program path analysis

Hard Real-Time Systems

  • Often in safety-critical applications
    • Aeronautics
    • Automotive
    • Train industries
    • Manufacturing control
  • Embedded controllers are expected to finish tasks reliably within time bounds
  • Task scheduling must be performed
  • Execution time bounds
    • Worst-Case Execution Time (WCET)
    • Best-Case Execution Time (BCET)
  • Works if:
    • Worst-case input can be determined or
    • Exhaustive measurement is performed
  • Otherwise:
    • Determine upper bound from execution times of instructions

Industry’s Best Practice

  • Measurements: determine execution times directly by observing the execution of a simulation on a set of inputs
    • Does not guarantee an upper bound to all executions in general
    • Exhaustive execution in general not possible
      • Too large space: ‘Input Domain’ ×\times ‘Initial execution states’
  • Compute upper bounds along the structure of the program
    • Programs are hierarchically structured
    • Statements are nested inside statements
    • So compute upper bound of a statements from the upper bounds of its constituents
Calculating Upper Bounds

ub(s)ub(s): Upper bound for statement ss

Sequence of Statements
AA1;A2A \equiv A1;A2

Constituents of AA: A1A1 and A2A2

Upper bound for AA is the sum of the upper bounds for A1A1 and A2A2

ub(A)=ub(A1)+ub(A2)ub(A) = ub(A1) + ub(A2)
Conditional Statement
A  if  Bthen  A1else  A2\begin{align*} A \equiv \; & if \; B \\ & then \; A1 \\ & else \; A2 \end{align*}

Constituents of AA:

  1. Condition BB
  2. Statements A1A1 and A2A2
ub(A)=ub(b)+max(ub(A1),ub(A2))ub(A) = ub(b) + \mathbf{max}(ub(A1), ub(A2))
Loops
Afor  i1  to  100  doA1\begin{align*} A \equiv for\; & i \leftarrow 1\; to\; 100\; do \\ & A1 \end{align*}

Precondition: Number of loops needs to be known (e.g 100)

ub(A)=ub(i)+100×(ub(i100)+ub(A1))+ub(i100)ub(A) = ub(i \leftarrow) + 100 \times (ub(i \leq 100) + ub(A1))+ ub(i\leq100)
Modern Hardware Features
  • Modern processors increase performance by using
    • Caches
    • Pipelines
    • Branch prediction
    • Speculation
  • These features make WCET computation difficult (execution times of instructions vary widely)
    • Best case: everything goes right: no cache miss, operands ready, resources free, branch predicted correctly
    • Worst case: everything goes wrong: all loads miss cache, resources occupied operands not ready
  • Span may be several hundred (even thousand) cycles
Program Path Analysis
  • Which sequence of instructions is executed in the worst-case (longest run time)?
  • Problem: The number of possible program paths grow with the length of the program
  • Number of loop iteration must be bounded
  • Must be done at machine code instruction level
Basic Blocks

A basic block is a sequence of instruction where the control flow enters at the beginning and exits at the end, without stopping in-between or branching (except at the end).

The branch instruction at the end of a basic block belongs to that block!

t1 := c - d
t2 := e * t1
t3 := b * t1
t4 := t2 + t3
if t4 < 10 goto L

Determine basic blocks of a program

  1. Determine the first instructions of blocks:
    • The first instruction
    • Targets of jumps (conditional/unconditional)
    • Instructions that follow jumps (conditional/unconditional)
  2. Determine the basic blocks
    • There is a basic block for each block beginning
    • The basic block consists of the block beginning and runs until the next block beginning (exclusive) or until the end of the program
Calculation of the WCET

A program consists of NN basic blocks, where each basic block BiB_i has a worst-case execution time cic_i and is executed for exactly xix_i times:

WCET=i=1NcixiWCET = \sum_{i=1}^N c_i \cdot x_i
  • The cic_i values are determined using the static analysis
  • Determine xix_i
    • Structural constraint given by the program structure
    • Additional constraints provided by the programmer (e.g bounds for loop counters…)

WCET Calculation

This image is taken from the lecture slides provided by Lothar Thiele.

Flow Equations (sum of incoming edges equals sum of outgoing edges)

d1=d2=x1d2+d8=d3+d9=x2d3=d4+d5=x3d4=d6=x4d5=d7=x5d6+d7=d8=x6d9=d10=x7\begin{align*} d1 = d2 = x_1 \\ d2 + d8 = d3 + d9 = x_2 \\ d3 = d4 + d5 = x_3 \\ d4 = d6 = x_4 \\ d5 = d7 = x_5 \\ d6 + d7 = d8 = x_6 \\ d9 = d10 = x_7 \end{align*}

Additional Constraints

  • Execute the program only once: d1=1d1 = 1
  • loop is executed at most 10 times: x310x1x_3 \leq 10 \cdot x_1
  • Block 5 is executed at most one time: x5<leq1x1x_5 <leq 1 \cdot x_1

The more information the better the result

Integer Linear Program for WCET Calculation

ILP with structural and additional constraints

WCET=max{i=1Ncixid1=1program is executed oncejin(Bi)dj=kout(Bi)dk=xi,i=1Nstructural constraints    additional constraints}WCET = max\left \{ \sum_{i=1}^N c_i \cdot x_i | \underbrace{d_1=1}_{\text{program is executed once}} \land \underbrace{\sum_{j\in in(B_i)} d_j = \sum_{k\in out(B_i)}d_k=x_i, i=1\ldots N}_{\text{structural constraints}} \;\land \;\text{additional constraints} \right \}
Abstract Interpretation (AI)
  • Semantics-based method for static program analysis
  • Basic idea
    • Perform programs computations using value descriptions (or abstract values)
    • Start with the description of all possible inputs
  • AI supports correctness proofs

Ingredients:

  • Abstract domain: related to concrete domain (by abstraction and concretization functions)
    • e.g LIntervalsL \rightarrow \text{Intervals}
    • where: Intervals=LB×UB,LB=UB=Int{,}\text{Intervals} = LB \times UB, LB = UB = Int \cup \{-\infty, \infty\}
    • instead of: LIntL \rightarrow Int
  • Abstract transfer functions: for each statement type (abstract versions of their semantics)
    • e.g +:Intervals×IntervalsIntervals+ : Intervals \times Intervals \rightarrow Intervals
    • where: [a,b]+[c,d]=[a+c,b+d][a,b] + [c,d] = [a+c,b+d]
    • with: ++ extends to ,-\infty, \infty
  • A join function combining abstract values from different control-flow paths
    • e.g :Interval×IntervalInterval\cup : Interval \times Interval \rightarrow Interval
    • where: [a,b][c,d]=[min(a,c),max(b,d)][a,b] \cup [c,d] = [min(a,c),max(b,d)]
Value Analysis

Motivation:

  • Provide access information to data-cache/pipeline analysis
  • Detect Infeasible paths
  • Derive loop bounds

Method:

Calculate intervals at all program points. Lower and upper bounds for the set of possible values occurring in the program (addresses, register contents, local and global variables).

Caches
  • Main memory to slow

  • Speed gap between CPU and memory too large (and increasing)

  • Caches work well in the average case:

    • Programs access data locally (many hits)
    • Programs reuse items (instructions, data)
    • Access patterns are distributed evenly across the cache
  • Caches: fast, small, expensive

  • Memory: relatively slow (bottle neck), large, cheap

  • Access cycles:

    • CPU ↔ Cache: ~1 cycle
    • Cache ↔ Memory: ~100 cycles

How caches work:

  • CPU wants to access (read/write) at memory address a
  • Caches:
    • Block m containing a is in the cache (hit):
      • Request for a is served in next cycle
    • Block m is not in the cache(miss):
      • m is transferred from main memory to the cache
      • m may replace some block in the cache
      • Request for a is served as soon as possible (while transfer still continues)
    • Several replacement strategies:
      • LRU: Least Recently Used
      • PLRU: Pseudo-LRU
      • FIFO
Cache Analysis

Statically precompute cache contents:

  • Must Analysis
    • For each program point (and calling context) find out which blocks are in the cache
    • Determines safe information about cache hits
    • Predicted cache hit reduces WCET
  • May Analysis
    • For each program point (and calling context) find out which blocks may be in the cache
    • Complement says what is not in the cache
    • Determines safe information about cache misses
    • Each predicted cache miss increases BCET
Must Cache
  • Only variables that are in all caches

  • Order: worst age (oldest)

  • Number of variables in cache \leq numbers of places in cache

  • Join

    • Intersection + maximal age
    • Interpretation:
      • Memory block a is definitely in the cache
      • Always a hit
May Cache
  • All variables that could be in cache
  • Minimal age
  • Join
    • Union
    • Minimal age
    • Interpretation:
      • All block may be in cache
      • None is definitely not in cache
Cache Contribution to WCET

Reference to variable s in code:

  • if s is in must-cache:
    • tWCET=thitt_{WCET} = t_{hit}
  • otherwise
    • tWCET=tmisst_{WCET} = t_{miss}
  • if s is in may-cache:
    • tBCET=thitt_{BCET} = t_{hit}
  • otherwise
    • tBCET=tmisst_{BCET} = t_{miss}

Reference to variable s in loop body (max nn iterations):

  • First time miss, then always hit

  • Loop unfolding

  • within loop

    • ntmissn \cdot t_{miss}
    • nthitn \cdot t_{hit}
    • tmiss+(n1)thitt_{miss} + (n - 1) \cdot t_{hit}
    • thit+(n1)tmisst_{hit} + (n - 1) \cdot t_{miss}
Pipelines
  • Different stages in parallel
    • Fetch
    • Decode
    • Execute
  • Pipeline hazards
    • Data hazard: Operands not yet available
    • Resource hazard: Consecutive instructions use same resource
    • Control hazard: Conditional branch
    • Instruction-cache hazard: Instruction fetch causes cache miss
  • CPU as State Machine: Abstract Pipeline for basic blocks

10 Performance Analysis of Distributed Embedded Systems

Embedded System = Computation + Communication + Resource Interaction

  • Analysis ↔ Design

    • Analysis: infer system properties from subsystem properties
    • Design: build system from subsystems while meeting requirements
  • Abstract model

Concrete instance (real world)Abstract representation (model)
Input streamLoad model
ProcessorService model
Task(s) running on processorProcessing model
  • System view
    • Modular Performance Analysis (MPA)
  • Mathematical view
    • Real-Time Calculus (RTC)
    • Min-Plus Calculus, Max-Plus Calculus

Real-Time Calculus

  • Real-Time Calculus can be regarded as a worst-/best-case variant of classical queuing theory
  • It’s a formal method for the analysis of distributed real-time (embedded) systems
Algebraic Structure
  • Set of elements SS

    • One ore more operators defined on elements of this set
  • Algebraic structures with two operators: \boxplus, \boxdot

    • plus-times:
      • (S,,)=(R,+,×)(S, \boxplus, \boxdot) = (R, +, \times)
      • Our ‘normal’ algebra
    • min-plus:
      • (S,,)=(R{+},inf,+)(S, \boxplus, \boxdot) = (R \cup \{ + \infty \}, inf, +)
      • Used for Real-Time calculus
  • Infimum

    • The infimum of a subset (of some set) is the greatest element (not necessary in the subset) that is less than or equal to all other elements of the subset
    • S1Sinf(S1)=maxi{iS:is  sS1}\begin{matrix}S_1 \subseteq S \\ inf(S_1) = \underset{i}{max}\{i \in S : i \leq s\; \forall s \in S_1\}\end{matrix}
    • Examples:
      • inf{[3,4]}=3inf\{[3,4]\}=3 and min{[3,4]}=3min\{[3,4]\}=3
      • inf{(3,4]}=3inf\{(3,4]\}=3 but min{(3,4]}not definedmin\{(3,4]\} \text{not defined}
  • Joint properties of \boxdot:

    • Closure: abSa \boxdot b \in S
    • Associativity: a(bc)=(ab)ca \boxdot (b \boxdot c) = (a \boxdot b) \boxdot c
    • Commutativity ab=baa \boxdot b = b \boxdot a
    • Existence of identity element: v:av=a\exists v: a \boxdot v = a
    • Existence of negative element: a1:aa1=v\exists a^{-1}: a \boxdot a^{-1} = v
    • Identity element of \boxplus absorbing for \boxdot: aϵ=ϵa\boxdot \epsilon = \epsilon
    • Distributivity of \boxdot w.r.t. \boxplus: a(bc)=(ab)(ac)a \boxdot (b \boxplus c) = (a\boxdot b) \boxplus (a \boxdot c)
  • Joint properties of \boxplus:

    • Closure: abSa \boxplus b \in S
    • Associativity: a(bc)=(ab)ca \boxplus (b \boxplus c) = (a \boxplus b) \boxplus c
    • Commutativity: ab=baa \boxplus b = b \boxplus a
    • Existence of identity element: ϵ:aϵ=a\exists \epsilon : a \boxplus \epsilon = a
  • Differences \boxplus:

    • plus-times: Existence of negative element for \boxplus: (a):a(a)=ϵ\exists (-a): a \boxplus (-a) = \epsilon
    • min-plus: Idempotency: aa=aa \boxplus a = a
  • Example:

    • plus-times: a×(b+c)=a×b+a×ca \times (b + c) = a \times b + a \times c
    • min-plus: a+inf{b,c}=inf{a+b,a+c}a + inf\{b,c\} = inf\{a+b,a+c\}
System Theories
  • Plus-times system theory:
    • signals, impulse response, convolution, time-domain

    • f(t)f(t): Signals

    • g(t)g(t): Impulse response

    • (fg)(f*g): Convolution

    • tt: Time domain

Plus-times calculus

h(t)=(fg)convolution(t)time-domain=0tf(ts)g(s)  dsh(t) = \underbrace{(f*g)}_{\text{convolution}}\underbrace{(t)}_{\text{time-domain}}= \int_0^t f(t-s)\cdot g(s) \;ds
  • Min-plus system theory
    • streams, variability curves, time-interval domain, convolution

    • R(t)R(t): Streams

    • g(Δ)g(\Delta): Variability curves

    • Δ\Delta: Time-interval domain

    • (Rg)(R\otimes g): Convolution

Min-plus calculus

R(t)(Rg)convolution(t)=inf0λt{R(tλ)+g(λ)}R'(t)\geq \underbrace{(R\otimes g)}_{\text{convolution}}(t)= \underset{0\leq \lambda\leq t}{inf}\{R(t-\lambda)+g(\lambda) \}
From Streams to Cumulative Functions
  • Event streams → Arrival Curves (tΔt \rightarrow \Delta)
    • Event stream: number of events in t=[0..T]t=[0 .. T]
    • Arrival Curves α=[α,αu]\alpha = [\alpha', \alpha^u]: maximum and minimum arriving events in any interval of length tt
  • Resources → Service Curves (tΔt \rightarrow \Delta)
    • Resource Availability: available service in t=[0..T]t=[0 .. T]
    • Service Curves β=[β,βu]\beta = [\beta', \beta^u]: maximum and minimum available service in any interval of length tt
  • Arrival and Service Curves work on time intervals as domain (Δ\Delta)

Overview of Methods and Algorithms

DomainNotes
Hierarchical ClusteringPartitioning
Integer Linear ProgrammingPartitioning
Simulated AnnealingPartitioning
Trace Based SimulationEstimation
Evolutionary AlgorithmsPartitioningMulti Objective Optimization
Tom Parks AlgorithmKahn Process Networks
Kerninghan LinPartitioning


  • Category

  • Programming

  • Tags

  • Computer Science
    ETH

  • Created

  • 15. November 2015


  • Modified

  • 3. June 2023