Cloud Computing & Governance: Week Eight

Week Eight
This week I decided to dive into incident management. I chose this topic because the company I work for takes this topic very seriously. Every minute operations are impacted means money lost. According to the ITIL an 'Incident' is any event which is not part of the standard operation of the service and which causes, or may cause, an interruption or a reduction of the quality of the service. This is how my workplace also defines an incident.

The objective of Incident Management is to restore normal operations as quickly as possible with the least possible impact on either the business or the user, at a cost-effective price. For example, at my workplace, every server has a backup server. When the primary fails, we are able to move operations to a backup server. Cost-effective? Perhaps not. But it minimizes impact on the business, and the impact does not reach customers. When operations are moved off of the problem server it is able to be investigated and repaired with as much time as necessary. This reduces room for error and allows for an in-depth "what happened" analysis to be completed.

The chart below is the way that my work handles incident type situations.

1. Triage - contact the correct application teams and notify them of the problem. Start a phone bridge to have application support all in one place.
2. Analysis - What is the issue? What caused the issue? Who is impacted?
3. Reaction - Do we fix the issue? Can we restore the application? Should we switch from primary to secondary locations?
4. Restore - Resolve the issue by the solution determined in step 3.
5. Post-Mortem - What happened? What can we have done differently? What can we do to prevent this from happening again?

Sources:
http://www.itlibrary.org/?page=Incident_Management
http://www.cisco.com/web/about/security/intelligence/worm-mitigation-whitepaper.html

Cloud Computing & Governance

Friday, July 25, 2014

Week Eight

No comments:

Post a Comment