SE
477:
Software and Systems Project Management
Assignment 4, Due: February 22, 2017
Assignment: Risk Management
- Description – Perform
a
risk management assessment and a risk mitigation plan for the
computing and software tools infrastructure for the following
project.
- Motivation –
use a real world example to assess risk management.
- Summary – Develop a risk management plan for
the software development infra-structure of a project
(Identify risks; estimate risk probability and impact;
identify potential for risk mitigation; identify potential
risk responses).
1. Reading
2. Files
3. Assignment: Risk Management
Problem Statement: Perform a risk management
assessment and a risk mitigation plan for the computing and
software tools infrastructure for the following project. We
are interested only
in the software development and computing environment!
Develop
a
risk
management
plan
for
the
software
development
infra-structure
of
a
project
(Identify
risks;
estimate
risk
probability
and
impact;
identify
potential
for
risk
mitigation;
identify
potential
risk
responses).
Description: this
is a major development project with over 100 developers. It is
being performed on a cluster of 120 Sun Workstations and Sun
Servers running UNIX and a distributed file system over a
large LAN. There are 10 servers doing builds and several used
for file systems, for development tools such as a CASE tool, a
UML tool, compilers and linking tools, and a change management
system. It also supports a web server, email server and
provides the working home directories for all desktop systems.
Areas of concern:
- Integrity of the cluster
- Integrity of the change management data base (holds all
code)
- Integrity of software tools
- Maximum availability of system, tools and code
Context:
- The computer center has been built on an adhoc basis and is
managed by a team of internal people and not by the normal CIO/IT
group.
- The project is under delivery pressure and people are
already working 10-12 hours a day.
- The builds are being done constantly using the 10 build
servers. [Multiple branches/versions of code.]
- There has been a steady departure of people from the
software tools, system administration and load building
teams. The two-thirds of the software tools team has gone
into retirement.
- The senior system administrator has just transferred out
of the group because of the turmoil and the interference
with the cluster configurations by the department chief and
technical managers.
- There are now two very junior systems administrators and
four load builders left.
Hint: what could
possibly go wrong?
- Power loss
- Computer(s) crash or hardware failures
- Corrupted disks
- Backups?
- Loss of essential configuration files
- What are they?
- Do we have copies?
- Network problems
- Cluster configuration problems
3. Deliverables:
- Layout and format. The layout and format for the
assignment are defined in the Risk Register
MS Word document
template.
- Perform a risk assessment on such a complex system and
suggest a mitigation plan. Estimate the probability of each
event occurring and the impact.
- Executive summary. Provide an assessment
of current project [the computing environment] and areas of
concern. Document the most serious risks. Describe the areas
of most concern based on the information above and the
probable events that might occur. Do a risk audit and
discuss the potential problems. You
should add a summary assessment on the current state of the
project vs. the ideal state and make recommendations.
- Risk
Register. Use the Risk
Register template to define the risks for this project. Copy
and paste the table in the template in order to have a risk
register entry for each identified risk. The items in the
risk register entry include:
- Risk number. A unique number assigned
to each risk register entry. Use any suitable numeric or
alphanumeric format.
- Risk rating. The risk rating for the
risk entry from the PMBOK Guide Probability
and Impact Matrix.
- Risk owner. The owner for the risk, the
project team member charged with monitoring the risk and
implementing the risk response plan should the risk event
occur. It is not necessary to enter a person’s name—the
owner’s role in the project will suffice.
- Description. A brief description of the
risk.
- Project objectives impacted. Project
objective—cost, time, scope, or quality—impacted by this
risk. If the risk impacts more than one objective, provide
a risk register entry for only the highest-impact
objective.
- Risk probability. The probability, pR that the risk event will
occur. 0.0 ≤ pR ≤1.0.
- Risk impact. The impact value of the
risk, from the table on the ‘Quantifying Impact’ slide.
- Potential triggers or precursors. List
any identified triggers or precursors for the risk event.
- Potential mitigation. List any ways
that the likelihood of the risk can be reduced or its
impact on the project reduced.
- Potential responses. List any risk
event responses identified. These need not be detailed
risk response plans, but should be a description of what
would be done should the risk response event occur.
- Root causes. If it is possible to
identify root causes for the risk, list them here, each
with a brief description.
- Cause-and-effect (fishbone
or Ishikawa) Diagram. For your risk with the highest
risk rating (the second item in the risk
register entry above), produce a cause-and-effect diagram.
See Lecture 6, Slides 50-52 for examples. It is
rumored that VISIO has a stencil for this diagram. [VISIO is
available for download through CDM.] This diagram MUST be
embedded in the risk register document. The last page of Risk Register template
is a landscape orientation page and is reserved for the
diagram.
- How many risks should we
identify? Six or seven major risks.
- The submission should have your name, date and assignment
title on the front page. Use headers and footers, and have
your name and title on each page.
4. Submission Requirements:
- All assignments must be submitted electronically through
Desire2Learn (D2L) and are due at 11:59 pm CT on the due date.
- The documents may be in Microsoft Word (.doc) format, or
Adobe PDF.
Comments
It
seems as if many students have no real life experience with
software development. So ...
Typically a software development effort has one machine per
developer. In Windows environments, there is at least one server
holding the applications used in development. There may well be
additional machines providing change management (e.g. Perforce)
and test machines. All are connected on a local area network
(LAN). The files on the server(s) are shared on the developers
machines using a Windows share capability.
Our system is similar except everything was made by Sun
Microsystems and ran a version of UNIX. The size of the cluster of
developers machines is much larger (100) and the number of servers
are more (10) and there are dedicated machines to perform
builds(8). A build machine is much the same as a developer machine
except with more memory and more CPUs and faster. Builds take 4-6
hours each to compile, link and process the code. With several
parallel development efforts going on, it is typical to have as
many as eight builds running (one per machine).
Each machine mounts the various file shares from the servers using
NFS (a Sun product). In fact the developer machines are all
identical. The users' files are also held on the servers and
remotely shared on the desktop machines and remotely mounted.
One machine (xxxx1) holds the cluster's configuration files (Sun's
NIS) including the DNS information. It also runs the License
Manager with its magic files. [Much of the software is COTS
proprietary and comes with a license for N simultaneous users. If
there is no license available you cannot run the software. This
included the compilers, the UML modeling tool as well as the
documentation tool (FrameMaker was used instead of Word).]
One machine (xxxx2) holds the application software for developers
to use: compilers, modeling tools, debuggers, IDE, etc.
One machine (xxxx3) was the mail server and mailing list manager.
One machine (xxxx4) held the home (working) directories of all the
developers.
One machine (xxxx5) held the web server.
One machine (xxxx7) held the source configuration management
system (ClearCase).
Each machine (above) exports several directory structures to be
shared. All of the above machines mount (share) the exported
directories.
The build machines were configured slightly differently. Due to
the performance penalty in using NFS, the compiler files were
cloned to a local directory on each build machine. This was a
special step that needed to be done every time the vendor sent
updates. I am not sure the person who took over my duties really
understood this detail. [Needless to say there was no step by step
procedure provided.]
Each machine xxxx[1-7] had daily backups using a DLT tape machine
(sort of a cassette Jukebox). The source was put on a RAID system.
All machines were kept in a single, locked, air-conditioned
room. There was a dual-UPS with an hour support time.
To my knowledge, the project did not have a formal risk management
program. The Department Head was a old school developer and did
not really believe in such things (despite having an MS).
The software development environment was put together by a
office-mate of mine. This office-mate (VM) learned as he put it
together. There was no public document on the configuration of the
cluster. When he left, a new administrator (PW) came in from
another group. He had ran that group's cluster, but it was not
quite as big. The procedures the system administrators performed
were those they had done on previous systems they had worked on.
There was no formal risk plan (to my knowledge). Hardware/software
support was by contract with Sun and the various software vendors.
The department had a pressure atmosphere. There was constant
pressure to work overtime and meet deliveries. Engineers left if
they could find another opening.
The senior system administrator left after the technical managers
and the department head started micro managing the cluster and
reorganizing it without his knowledge or input.
Does the system also include a
backup system?
There was a DLT jukebox for each
server. The ClearCase machine (SCM) had a RAID system as well.
There was some talk about using a journaling file system but
management did not want to spend the money. There was one spare
(test) machine.
Servers were backed up weekly with daily incrementals to tape
using CRON jobs. The jukebox changed tapes as needed. To my
knowledge there was no attempt to verify the backups were
successful (i.e. one could restore from them). There was no
storage of backups off-site. It is not clear if people checked
to see that the backups occurred as scheduled and completed
successfully.
Is
there any documentation (or logs) about the backup available?
Might have been a log, not sure.
How
many
people were actually involved in developing the system
successfully in the better days of the company (good old days)?
The original team that built the
cluster consisted of VM and two others. Those eventually left
and then VM was forced out later and PW came in to take over.
The crew normally numbered three to four, but one of those also
supported the Managers' Windows boxes.
What do you think is the minimum
of people it takes to develop and implement the new system?
For that size of a cluster a
minimum of three with at least one fully qualified (Sun
Certified Administrator). Additional people to support the
Windows boxes and the testing labs (not using Suns). At the time
of the disaster there were two, both very inexperienced (one
admitted it to me).
Is there a project plan
available for the new system development?
No, the department head did not
believe in that. They had a schedule for the project and that
was about it. There were formal procedures for the software
development. But no formal plan. Note this project covered five
locations with five department heads, our department head (JC)
was in charge but spent much time clashing with the other peers
on the project. As far as the computer cluster it was managed
more as a side effort that happened because no one wanted to
turn control over to the corporate CIO/IT people. At the time
under discussion the computing environment was under a novice
tech manager (PI) who had previously worked in System
Integration. He (PI) had about five months experience as Tech
Manager at the time.
What security measures are built
into the system?
General UNIX/Sun security. Inside
the corporate firewall. Inside a locked computer center. Decent
password security.
Who is in change
of the change management system? is it the system administrator?
Both the system admin and the
chief load builder. The system admin does the backup and general
support. The load builder manages the change management system
for the source code. There are other areas under change
management: the entire set of software development tools
(compilers, etc.) and the project documentation, as well as a
change request system. Each of those has some one supposed to
keep it well and up to date.
How many risks should we identify?
Six or seven major risks.
Builds are
running constantly? what does this mean. People are constantly
kicking off builds. Is there a nightly build process? Does the
build process also bring up the system they are building? Or do
the developers themselves bring up there development environment?
The build team starts the builds
manually. Usually twice or three times a day on each of the
eight build servers. In addition, developers can perform builds
on their private workstations. The builds result in a binary
image that is "walked" over to the testing lab, loaded onto
flash memory and inserted into a test board for test. Remember
these are 64MB embedded systems.