SE 477: Software and Systems Project Management: Assignment 4

SE 477: Software and Systems Project Management
Assignment 4, Due: February 22, 2017

Assignment: Risk Management

Description – Perform a risk management assessment and a risk mitigation plan for the computing and software tools infrastructure for the following project.

Motivation – use a real world example to assess risk management.

Summary – Develop a risk management plan for the software development infra-structure of a project (Identify risks; estimate risk probability and impact; identify potential for risk mitigation; identify potential risk responses).

1. Reading

Notes: Lecture 6

The following articles listed in reading list:

PMBOK-SWE Ch. 11 Intro & Ch. 11.1-11.6, PMBOK Ch. 11

PMP Study Guide: Chapter 6

Kerzner: Chapter 17

Taylor: Chapter 7

Taylor (Survival Guide): Chapter 13

Give three reason why backups are important? <http://wiki.answers.com/Q/Give_three_reason_why_backups_are_important>

2. Files

Risk Register MS Word document template.

3. Assignment: Risk Management

Problem Statement: Perform a risk management assessment and a risk mitigation plan for the computing and software tools infrastructure for the following project. We are interested only in the software development and computing environment! Develop a risk management plan for the software development infra-structure of a project (Identify risks; estimate risk probability and impact; identify potential for risk mitigation; identify potential risk responses).

Description: this is a major development project with over 100 developers. It is being performed on a cluster of 120 Sun Workstations and Sun Servers running UNIX and a distributed file system over a large LAN. There are 10 servers doing builds and several used for file systems, for development tools such as a CASE tool, a UML tool, compilers and linking tools, and a change management system. It also supports a web server, email server and provides the working home directories for all desktop systems.

Areas of concern:

Integrity of the cluster

Integrity of the change management data base (holds all code)

Integrity of software tools

Maximum availability of system, tools and code

Context:

The computer center has been built on an adhoc basis and is managed by a team of internal people and not by the normal CIO/IT group.

The project is under delivery pressure and people are already working 10-12 hours a day.

The builds are being done constantly using the 10 build servers. [Multiple branches/versions of code.]

There has been a steady departure of people from the software tools, system administration and load building teams. The two-thirds of the software tools team has gone into retirement.

The senior system administrator has just transferred out of the group because of the turmoil and the interference with the cluster configurations by the department chief and technical managers.

There are now two very junior systems administrators and four load builders left.

Hint: what could possibly go wrong?

Power loss

Computer(s) crash or hardware failures

Corrupted disks

Backups?

Loss of essential configuration files

What are they?

Do we have copies?

Network problems

Cluster configuration problems

3. Deliverables:

Layout and format. The layout and format for the assignment are defined in the Risk Register MS Word document template.

Perform a risk assessment on such a complex system and suggest a mitigation plan. Estimate the probability of each event occurring and the impact.

Executive summary. Provide an assessment of current project [the computing environment] and areas of concern. Document the most serious risks. Describe the areas of most concern based on the information above and the probable events that might occur. Do a risk audit and discuss the potential problems. You should add a summary assessment on the current state of the project vs. the ideal state and make recommendations.

Risk Register. Use the Risk Register template to define the risks for this project. Copy and paste the table in the template in order to have a risk register entry for each identified risk. The items in the risk register entry include:

Risk number. A unique number assigned to each risk register entry. Use any suitable numeric or alphanumeric format.

Risk rating. The risk rating for the risk entry from the PMBOK Guide Probability and Impact Matrix.

Risk owner. The owner for the risk, the project team member charged with monitoring the risk and implementing the risk response plan should the risk event occur. It is not necessary to enter a person’s name—the owner’s role in the project will suffice.

Description. A brief description of the risk.

Project objectives impacted. Project objective—cost, time, scope, or quality—impacted by this risk. If the risk impacts more than one objective, provide a risk register entry for only the highest-impact objective.

Risk probability. The probability, p_R that the risk event will occur. 0.0 ≤ p_R ≤1.0.

Risk impact. The impact value of the risk, from the table on the ‘Quantifying Impact’ slide.

Potential triggers or precursors. List any identified triggers or precursors for the risk event.

Potential mitigation. List any ways that the likelihood of the risk can be reduced or its impact on the project reduced.

Potential responses. List any risk event responses identified. These need not be detailed risk response plans, but should be a description of what would be done should the risk response event occur.

Root causes. If it is possible to identify root causes for the risk, list them here, each with a brief description.

Cause-and-effect (fishbone or Ishikawa) Diagram. For your risk with the highest risk rating (the second item in the risk register entry above), produce a cause-and-effect diagram. See Lecture 6, Slides 50-52 for examples. It is rumored that VISIO has a stencil for this diagram. [VISIO is available for download through CDM.] This diagram MUST be embedded in the risk register document. The last page of Risk Register template is a landscape orientation page and is reserved for the diagram.

How many risks should we identify? Six or seven major risks.

The submission should have your name, date and assignment title on the front page. Use headers and footers, and have your name and title on each page.

4. Submission Requirements:

All assignments must be submitted electronically through Desire2Learn (D2L) and are due at 11:59 pm CT on the due date.

The documents may be in Microsoft Word (.doc) format, or Adobe PDF.

Comments

It seems as if many students have no real life experience with software development. So ...

Typically a software development effort has one machine per developer. In Windows environments, there is at least one server holding the applications used in development. There may well be additional machines providing change management (e.g. Perforce) and test machines. All are connected on a local area network (LAN). The files on the server(s) are shared on the developers machines using a Windows share capability.

Our system is similar except everything was made by Sun Microsystems and ran a version of UNIX. The size of the cluster of developers machines is much larger (100) and the number of servers are more (10) and there are dedicated machines to perform builds(8). A build machine is much the same as a developer machine except with more memory and more CPUs and faster. Builds take 4-6 hours each to compile, link and process the code. With several parallel development efforts going on, it is typical to have as many as eight builds running (one per machine).

Each machine mounts the various file shares from the servers using NFS (a Sun product). In fact the developer machines are all identical. The users' files are also held on the servers and remotely shared on the desktop machines and remotely mounted.

One machine (xxxx1) holds the cluster's configuration files (Sun's NIS) including the DNS information. It also runs the License Manager with its magic files. [Much of the software is COTS proprietary and comes with a license for N simultaneous users. If there is no license available you cannot run the software. This included the compilers, the UML modeling tool as well as the documentation tool (FrameMaker was used instead of Word).]

One machine (xxxx2) holds the application software for developers to use: compilers, modeling tools, debuggers, IDE, etc.
One machine (xxxx3) was the mail server and mailing list manager.
One machine (xxxx4) held the home (working) directories of all the developers.
One machine (xxxx5) held the web server.
One machine (xxxx7) held the source configuration management system (ClearCase).

Each machine (above) exports several directory structures to be shared. All of the above machines mount (share) the exported directories.

The build machines were configured slightly differently. Due to the performance penalty in using NFS, the compiler files were cloned to a local directory on each build machine. This was a special step that needed to be done every time the vendor sent updates. I am not sure the person who took over my duties really understood this detail. [Needless to say there was no step by step procedure provided.]

Each machine xxxx[1-7] had daily backups using a DLT tape machine (sort of a cassette Jukebox). The source was put on a RAID system.

All machines were kept in a single, locked, air-conditioned room. There was a dual-UPS with an hour support time.

To my knowledge, the project did not have a formal risk management program. The Department Head was a old school developer and did not really believe in such things (despite having an MS).

The software development environment was put together by a office-mate of mine. This office-mate (VM) learned as he put it together. There was no public document on the configuration of the cluster. When he left, a new administrator (PW) came in from another group. He had ran that group's cluster, but it was not quite as big. The procedures the system administrators performed were those they had done on previous systems they had worked on. There was no formal risk plan (to my knowledge). Hardware/software support was by contract with Sun and the various software vendors.

The department had a pressure atmosphere. There was constant pressure to work overtime and meet deliveries. Engineers left if they could find another opening.

The senior system administrator left after the technical managers and the department head started micro managing the cluster and reorganizing it without his knowledge or input.

Does the system also include a backup system?

There was a DLT jukebox for each server. The ClearCase machine (SCM) had a RAID system as well. There was some talk about using a journaling file system but management did not want to spend the money. There was one spare (test) machine.

Servers were backed up weekly with daily incrementals to tape using CRON jobs. The jukebox changed tapes as needed. To my knowledge there was no attempt to verify the backups were successful (i.e. one could restore from them). There was no storage of backups off-site. It is not clear if people checked to see that the backups occurred as scheduled and completed successfully.

Is there any documentation (or logs) about the backup available?

Might have been a log, not sure.

How many people were actually involved in developing the system successfully in the better days of the company (good old days)?

The original team that built the cluster consisted of VM and two others. Those eventually left and then VM was forced out later and PW came in to take over. The crew normally numbered three to four, but one of those also supported the Managers' Windows boxes.

What do you think is the minimum of people it takes to develop and implement the new system?

For that size of a cluster a minimum of three with at least one fully qualified (Sun Certified Administrator). Additional people to support the Windows boxes and the testing labs (not using Suns). At the time of the disaster there were two, both very inexperienced (one admitted it to me).

Is there a project plan available for the new system development?

No, the department head did not believe in that. They had a schedule for the project and that was about it. There were formal procedures for the software development. But no formal plan. Note this project covered five locations with five department heads, our department head (JC) was in charge but spent much time clashing with the other peers on the project. As far as the computer cluster it was managed more as a side effort that happened because no one wanted to turn control over to the corporate CIO/IT people. At the time under discussion the computing environment was under a novice tech manager (PI) who had previously worked in System Integration. He (PI) had about five months experience as Tech Manager at the time.

What security measures are built into the system?

General UNIX/Sun security. Inside the corporate firewall. Inside a locked computer center. Decent password security.

Who is in change of the change management system? is it the system administrator?

Both the system admin and the chief load builder. The system admin does the backup and general support. The load builder manages the change management system for the source code. There are other areas under change management: the entire set of software development tools (compilers, etc.) and the project documentation, as well as a change request system. Each of those has some one supposed to keep it well and up to date.

How many risks should we identify?

Six or seven major risks.

Builds are running constantly? what does this mean. People are constantly kicking off builds. Is there a nightly build process? Does the build process also bring up the system they are building? Or do the developers themselves bring up there development environment?

The build team starts the builds manually. Usually twice or three times a day on each of the eight build servers. In addition, developers can perform builds on their private workstations. The builds result in a binary image that is "walked" over to the testing lab, loaded onto flash memory and inserted into a test board for test. Remember these are 64MB embedded systems.

SE 477: Software and Systems Project Management
Assignment 4, Due: February 22, 2017

Assignment: Risk Management

1. Reading

2. Files

3. Assignment: Risk Management

3. Deliverables:

4. Submission Requirements:

Comments

SE 477: Software and Systems Project Management Assignment 4, Due: February 22, 2017