Checkpoints: Process View
Resource Utilization
- Can there exist "race conditions" between processes? Do processes compete for
critical resources? What happens if they cannot get them?
- What happens when I/O queues or buffers are full?
- Does the system monitor itself (capacity threshold, critical performance threshold,
resource exhaustion)? What actions does it take?
Performance
- What are the response time requirements for each message? Is there a diagnostic mode for
the system which allows message response times to be measured?
- Have you specified the nominal and maximal performance thresholds? Are there test sets
that represent them? How will testing assess that the requirements have been met?
- Is there a performance model (back of the envelope, queuing model, discrete-event
simulation) to determine that performance will be met? Is the model fed with realistic or
measured data?
- Are the tests and the performance model taking care of only the steady state mode, or do
they also take into account startup and major failures?
- Where are the performance bottlenecks (and they are there!)? Every system has points at
which performance will drop precipitously if any workload is added; its better to
know where these are in advance. Clues to look for include:
- Use of some finite shared resource such as (but not limited to) semaphores, file
handles, locks, latches, shared memory, etc.
- Excessive inter-process communication. Communication across process boundaries is always
more expensive than in-process communication.
- Excessive inter-processor communication. Communication across process boundaries is
always more expensive than inter-process communication.
- The point at which the system runs out of physical memory and starts using virtual
memory is a point at which performance usually drops precipitously. Avoid using virtual
memory if at all possible.
Fault Tolerance
- If you have a redundant system - with both primary and backup processes can two
or more processes "think" that they are primary? What happens then? How is the
situation resolved? Can no processes be primary at some point in time?
- Are there external processes or programs that can clean up when things are left in an
inconsistent state?
- Is the system tolerant of errors and exceptions. When an error or exception occurs, can
the system revert to a consistent state?
- Can you run diagnostic routines on a running system if necessary?
- Can the system be upgraded while running? Does it need to be?
- Where do alarms go? Is there a single alarm mechanism? Can you "tune" it to
prevent false or redundant alarms? Can the users determine which alarms they want to
monitor?
- Can some tracing facility be turned on or off to help troubleshooting? What is the added
overhead? Does the facility require special tools or training?
- How much "head room" (free memory/free CPU cycles) is allowed in the CPU
utilization? How is it assessed?
- Are the load or performance requirements reasonable? (e.g. can a user really enter X bytes
per minute? Does the user really need to see the result in less than Y
milliseconds?)
- Are there memory budgets? How do you detect or prevent memory leaks? How do you use the
virtual memory system? Monitor it? Tune it?
Modularity
- Are the processes sufficiently independent of one another that they can be easily
distributed across processors or nodes? Do the throughput or response time requirements
virtually dictate that certain processes remain co-located? Does the inter-process
communication mechanism (e.g. semaphores or shared memory) virtually require the processes
to be co-located?
- Can certain messages be made asynchronous, so that they can be processed when resources
are more available?
- Can the system be scaled-up by adding processes and nodes?
| |

|