DePaul CDM

SE 477: Software and Systems Project Management
Assignment 4,  Due: February 22, 2017

Assignment: Risk Management

1. Reading

2. Files

3. Assignment: Risk Management

  • Problem Statement: Perform a risk management assessment and a risk mitigation plan for the computing and software tools infrastructure for the following project. We are interested only in the software development and computing environment! Develop a risk management plan for the software development infra-structure of a project (Identify risks; estimate risk probability and impact; identify potential for risk mitigation; identify potential risk responses).

  • Description: this is a major development project with over 100 developers. It is being performed on a cluster of 120 Sun Workstations and Sun Servers running UNIX and a distributed file system over a large LAN. There are 10 servers doing builds and several used for file systems, for development tools such as a CASE tool, a UML tool, compilers and linking tools, and a change management system. It also supports a web server, email server and provides the working home directories for all desktop systems.

  • Areas of concern:
  • Context:
  • Hint: what could possibly go wrong?
  • 3. Deliverables:

    4. Submission Requirements:


     Comments

    It seems as if many students have no real life experience with software development. So ...

    Typically a software development effort has one machine per developer. In Windows environments, there is at least one server holding the applications used in development. There may well be additional machines providing change management (e.g. Perforce) and test machines. All are connected on a local area network (LAN). The files on the server(s) are shared on the developers machines using a Windows share capability.

    Our system is similar except everything was made by Sun Microsystems and ran a version of UNIX. The size of the cluster of developers machines is much larger (100) and the number of servers are more (10) and there are dedicated machines to perform builds(8). A build machine is much the same as a developer machine except with more memory and more CPUs and faster. Builds take 4-6 hours each to compile, link and process the code. With several parallel development efforts going on, it is typical to have as many as eight builds running (one per machine).

    Each machine mounts the various file shares from the servers using NFS (a Sun product). In fact the developer machines are all identical. The users' files are also held on the servers and remotely shared on the desktop machines and remotely mounted.

    One machine (xxxx1) holds the cluster's configuration files (Sun's NIS) including the DNS information. It also runs the License Manager with its magic files. [Much of the software is COTS proprietary and comes with a license for N simultaneous users. If there is no license available you cannot run the software. This included the compilers, the UML modeling tool as well as the documentation tool (FrameMaker was used instead of Word).]

    One machine (xxxx2) holds the application software for developers to use: compilers, modeling tools, debuggers, IDE, etc.
    One machine (xxxx3) was the mail server and mailing list manager.
    One machine (xxxx4) held the home (working) directories of all the developers.
    One machine (xxxx5) held the web server.
    One machine (xxxx7) held the source configuration management system (ClearCase).

    Each machine (above) exports several directory structures to be shared. All of the above machines mount (share) the exported directories.

    The build machines were configured slightly differently. Due to the performance penalty in using NFS, the compiler files were cloned to a local directory on each build machine. This was a special step that needed to be done every time the vendor sent updates. I am not sure the person who took over my duties really understood this detail. [Needless to say there was no step by step procedure provided.]

    Each machine xxxx[1-7] had daily backups using a DLT tape machine (sort of a cassette Jukebox). The source was put on a RAID system.

    All machines were kept in a single, locked,  air-conditioned room. There was a dual-UPS with an hour support time.

    To my knowledge, the project did not have a formal risk management program. The Department Head was a old school developer and did not really believe in such things (despite having an MS).

    The software development environment was put together by a office-mate of mine. This office-mate (VM) learned as he put it together. There was no public document on the configuration of the cluster. When he left, a new administrator (PW) came in from another group. He had ran that group's cluster, but it was not quite as big. The procedures the system administrators performed were those they had done on previous systems they had worked on. There was no formal risk plan (to my knowledge). Hardware/software support was by contract with Sun and the various software vendors.

    The department had a pressure atmosphere. There was constant pressure to work overtime and meet deliveries. Engineers left if they could find another opening.

    The senior system administrator left after the technical managers and the department head started micro managing the cluster and reorganizing it without his knowledge or input.

    Does the system also include a backup system?

    There was a DLT jukebox for each server. The ClearCase machine (SCM) had a RAID system as well. There was some talk about using a journaling file system but management did not want to spend the money. There was one spare (test) machine.

    Servers were backed up weekly with daily incrementals to tape using CRON jobs. The jukebox changed tapes as needed. To my knowledge there was no attempt to verify the backups were successful (i.e. one could restore from them). There was no storage of backups off-site. It is not clear if people checked to see that the backups occurred as scheduled and completed successfully.
    Is there any documentation (or logs) about the backup available?
    Might have been a log, not sure.
    How many people were actually involved in developing the system successfully in the better days of the company (good old days)?
    The original team that built the cluster consisted of VM and two others. Those eventually left and then VM was forced out later and PW came in to take over. The crew normally numbered three to four, but one of those also supported the Managers' Windows boxes.

    What do you think is the minimum of people it takes to develop and implement the new system?

    For that size of a cluster a minimum of three with at least one fully qualified (Sun Certified Administrator).  Additional people to support the Windows boxes and the testing labs (not using Suns). At the time of the disaster there were two, both very inexperienced (one admitted it to me).

    Is there a project plan available for the new system development?

    No, the department head did not believe in that. They had a schedule for the project and that was about it. There were formal procedures for the software development. But no formal plan. Note this project covered five locations with five department heads, our department head (JC) was in charge but spent much time clashing with the other peers on the project. As far as the computer cluster it was managed more as a side effort that happened because no one wanted to turn control over to the corporate CIO/IT people. At the time under discussion the computing environment was under a novice tech manager (PI) who had previously worked in System Integration. He (PI) had about five months experience as Tech Manager at the time.

    What security measures are built into the system?

    General UNIX/Sun security. Inside the corporate firewall. Inside a locked computer center. Decent password security.
    Who is in change of the change management system? is it the system administrator?
    Both the system admin and the chief load builder. The system admin does the backup and general support. The load builder manages the change management system for the source code. There are other areas under change management: the entire set of software development tools (compilers, etc.) and the project documentation, as well as a change request system. Each of those has some one supposed to keep it well and up to date.

    How many risks should we identify?
    Six or seven major risks.
    Builds are running constantly? what does this mean. People are constantly kicking off builds. Is there a nightly build process? Does the build process also bring up the system they are building? Or do the developers themselves bring up there development environment?
    The build team starts the builds manually. Usually twice or three times a day on each of the eight build servers. In addition, developers can perform builds on their private workstations. The builds result in a binary image that is "walked" over to the testing lab, loaded onto flash memory and inserted into a test board for test. Remember these are 64MB embedded systems.