Hi Guys,
I know it’s time we gave everyone an update on the progress with the mail platform, and where we are with the migration of data to the new platform. I’ve detailed our plans to resolve the ongoing mail issues here and build a platform that, unlike what we have today, will be able to handle anything that gets thrown at it.
The original decision to implement a new mail platform was taken based on the fact that the existing NetApp mail storage platform was running out of space. The reason we were running out of space was that some new procedures regarding what could be done to manage the mail storage platform had been introduced that effectively meant that the platform was not being managed as it should have been. Those controls were imposed as a result of a single user error on the platform that resulted in the deletion of a single customers email. The mail was restored from the snapshot but it was felt that it was right for the networks team not to have the power to decide how the platform should be managed. That decision meant that the platform was filling up and a new platform should be purchased that would provide the room for growth. The new Sun storage platform was selected and purchased.
We began the migration of customer’s mail to the new Sun Network Attached Storage (NAS) six weeks ago. We successfully moved all of Force9, Free-Online and Metronet customer mail on to the storage and saw no discernible difference in performance between the Sun and the old NetApp platform (At least, no difference that could not be put down to monitoring anomalies).
At around this point, errors started to appear on the mirrored link between the sites where each half of the storage is installed. This could have caused the storage to get out of synch and at this point we took the decision to stop moving data to the new system until we got the issue sorted. We engaged Sun, and during the diagnosis, we experienced a disk failure at the master site. The RAID configuration in use means each site has 2 hot spares and the system pulled in one of the hot spares as a replacement. Unfortunately, the process of doing this resulted in performance of the platform deteriorating significantly, to the point where email collection for customers became impossible.
The issue of the mirror link has been resolved now, and having liaised with Sun over many weeks, they are stating that the hardware we bought should be capable of the performance we expect and require. However there is little trust in the Networks team and across the business that this hardware will ever be in a place to run the platform. So we are left with the fact that the new mail platform, which I believed had been specified to be the right solution for our next generation mail storage, has a performance flaw that means it is in not capable of doing the job we expect within our environment. As such, it is clear that another answer is needed – and quickly.
Firstly and most importantly we have changed the procedures regarding how the mail storage can be managed by the team that has most understanding of how to manage a mail storage platform, the Networks Team. As a result of that we have been doing a lot of work on the freeing up space on the existing NetApp platform. As an example of the kind of progress we have made with managing this, we have found over 300GB of mail stored in mail boxes that have not been looked at by customers for 6 months or more. This mail is being archived off and that will will free up enough space to move all the non-PlusNet VISP mail back to the NetApp platform. This is being worked on as I type and is planned for completion within 2 weeks. Going forward we will manage the mail storage to ensure that capacity is managed correctly and the platform will be scaled as appropriate.
With regard to scaling the platform, we have been speaking with most of the leading vendors in mail storage in the last few weeks, and have now got a very clear idea on what the right platform would be going forward. Once we are confident that we are managing the platform as it should be, we will have a clear idea of what we should be buying in terms of capacity.
So how did we end up buying a new platform that was not suitable to our needs? I am not going to play the blame game, we all signed off the solution we went for, and the fact is that we got it wrong.
The facts are that the system was specified about 8 months ago by a Senior Engineer, in consultation with Sun. The PlusNet engineer who designed and procured the system is no longer with us, and as such, he can’t talk us through the conversations had with Sun. The issue here is that there was a level of trust in both the engineer and in Sun that has proven to be inappropriate and misjudged. That has naturally been very uncomfortable for all the people involved, but it is an unavoidable truth.
Our purchasing policy has been overhauled as a direct result of the error we made here. Any major decision now requires sign off from two senior engineers, a Senior Architect, the Technical Director and the Board. We are also now engaging our co-design team in all such decisions, which brings together expertise from other areas of the business, many of who would not have been involved in networks purchasing decisions before.
Added to this (as you have already seen from previous announcements), senior personnel in several senior positions at Plusnet are no longer with us. The decision to change the procedure and policy for managing the mail storage was taken during their tenure, as was the the decision to purchase the Sun, as was the decision to continue moving customers to the Tiscalli LLU platform in the way we did, and the way we moved customers to the Max products. It was also the same period in time when we allowed the ticket and problem back-log to reach the excessive levels that they did.
What this all boils down to, and I know you've been hearing this a lot at the recently, is that things are very different at PlusNet in recent months. We lost our way in a big way. There is however now an undeniable feeling that things are being done differently now. If you want some example of the change that has occurred since the structural changes discussed here:
http://www.plus.net/features/news/reorganise.shtmlI’ve included what is known as our internal RAG (Red Amber Green) document below. This shows areas of concern for us in networks and what we have fixed since we started tracking things in this way.