For those that might be interested, the following is a copy of an internal post with our latest update with regards to Problem Management. Any questions - just ask!
Development Problem UpdateADD SummaryGreat progression and the plan is set to continue and improve the progress going forward! Problems down 15% and mostly falling week on week except last week, when the numbers rose by 15 along side the rollout of WLR. This was also heavily impacted by a number of complicated P1s and a reduced development team due to holidays. Development team has now been ramped up to a total of between 9 and 11 developers (11, but some have commitments on NADs and project rollouts). Problems expected to fall heavily in October.
Problem Out of Hours Push has been successful, but requires more support. This is being reviewed with the Development Managers as all out of hours support has provided fantastic results.
Detailed Report- Problems reduced by 15% from 434 to 371
- 577 new problems raised
- A staggering 640 problems closed
Resolved: 25% less in-hour problem resource than planned- Restricted resource of 6 developers per week rather than the planned 8 had a major impact on our ability to drive the numbers down further. This has been resolved, with a team of between 9 and 1 now supporting the push
Out of Hours Push To be Better Supported- On average, 5 developers per week have provided weekend problem push support, with an average of 6 problems closed per weekend and 6 progressed
- This is to be ramped up and a clear rota created by development managers
Processes fully Reviewed and Streamlined- Hopper reintroduced and the daily hopper meeting reintroduced with fantastic turnout. The entire business has access to define the business need and ensure the success of the problem resolution effort
- Comms team now control the validation of problems and assignment into pools. QA still support the process and add technical advice where required
- A new tool entitled the WPMT and Content ‘Task Hopper’ has been created to ensure content rollouts are timely, centrally managed and visible
- LCS Team have clear and focused objectives, expectations and communication channels
Problem Resolution Graph
Problems dropped significantly with 8 developers. Number stagnated and then rose slightly with only 6 developers. Between 9 and 11 allocated developers are now expected to have a major impact on the numbers of closed problems.
Network Problems – The Story So far
SummaryWithin Networks pre-July problems were out of control causing major problems for CSC and our customer base. This caused the business to step back and undertake several major changes to the way networks operated within the business and interacted with CSC.
- Pre-July - over 50 problems regularly. Nagios Alerts 50+
- July-August – Problems down to 43. Nagios Alerts 30+
- September – Problems sub 20. Nagios alerts now Green with new alerts actioned immediately.
The Transformation BeganIn July, the network Operations team adopted a new way of working. This entailed allocating 2 resources to work on network problems
in rotating one-week shifts - this proved successful in as much as it had the effect of getting the problem queue down from above the regular level of 50, to between 35 and 43. Whilst this was a marked improvement networks were still prone to fire fighting as problems were being raised quicker than the team could close them down.
Problems were gradually reducing due to:
- Increased training
- Improved awareness of internal systems - mentoring
- Defined handover process from development.
That said networks were still struggling to maintain a stable platform, and get the problem total down to a sustainable level. The underlying issue was that with a one-week rotation, "hard" or “long term” problems were being overlooked in favor of problems that could be closed down within the weekly shift pattern. This cherry picking led to several problems lying in the problem pool with no progress.
In Life EvolutionIn September, following a network reorganisation, further operational process change was undertaken to improve personal accountability and problem ownership – This process was to make nearly the entire team responsible for problems every day for the morning, and then focus on operational improvements in the afternoons, with management able to pull people for Ops work in the Mornings if needed and problem Management able to take people for P1s in the afternoons. This also encompassed improving housekeeping – doing the daily things that prevent and pre-empt issues occurring. Nagios is now being used as it should be, to identify and prevent problems before they impact our customers.
This has had a fantastic effect on the team, as it has instilled accountability and gave the opportunity to individuals to work a problem through from start to end, no matter how long it takes.
The team is now working smarter as a result of this, with team moral at a high as they now feel they are getting on top of the main issues that are affecting the business and more importantly our customers. So lets ride this high and get to the next level…
The results Problems are now down to around 20, and stable at this number, a reduction of approx 50%
In the coming 3 weeks, we hope to get this down to a stable 15, with a projected target to get down to below 10 and remain at this level.
Nagios remains at green, and is actioned immediately an alert is raised. An 99% improvement
2 Full time resources focusing upon network House Keeping at any one time.
In ConclusionA real improvement over the last couple of weeks, but still a massive effort still required. Tickets remain high in the CSC so as a business we need to ALL pull together to fix our problems and reduce the ticket backlog.