Home   Help Search Login Register  
You are not logged in. To get the full experience of these forums, we recommend you log in or register
Plusnet Usergroup » All Users - The Open Forum » Plusnet Network and Technical Issues » Email Storage Platform Update
Pages: [1]
  Print  
Author Topic: Email Storage Platform Update  (Read 5638 times)
Simon Day

Posts: 263


« on: November 02, 2006, 04:59:48 pm »

Hi Guys,

I know it’s time we gave everyone an update on the progress with the mail platform, and where we are with the migration of data to the new platform. I’ve detailed our plans to resolve the ongoing mail issues here and build a platform that, unlike what we have today, will be able to handle anything that gets thrown at it.

The original decision to implement a new mail platform was taken based on the fact that the existing NetApp mail storage platform was running out of space. The reason we were running out of space was that some new procedures regarding  what could be done to manage the mail storage platform had been introduced that effectively meant that the platform was not being managed as it should have been.  Those controls were imposed as a result of a single user error on the platform that resulted in the deletion of a single customers email. The mail was restored from the snapshot but it was felt that it was right for the networks team  not to have the power to decide how the platform should be managed.  That decision meant that the platform was filling up and a new platform should be purchased that would provide the room for growth. The new Sun storage platform was selected and purchased.

We began the migration of customer’s mail to the new Sun Network Attached Storage (NAS) six weeks ago. We successfully moved all of Force9, Free-Online and Metronet customer mail on to the storage and saw no discernible difference in performance between the Sun and the old NetApp platform (At least, no difference that could not be put down to monitoring anomalies).

At around this point, errors started to appear on the mirrored link between the sites where each half of the storage is installed. This could have caused the storage to get out of synch and at this point we took the decision to stop moving data to the new system until we got the issue sorted. We engaged Sun, and during the diagnosis, we experienced a disk failure at the master site. The RAID configuration in use means each site has 2 hot spares and the system pulled in one of the hot spares as a replacement. Unfortunately, the process of doing this resulted in performance of the platform deteriorating significantly, to the point where email collection for customers became impossible.

The issue of the mirror link has been resolved now, and having liaised with Sun over many weeks, they are stating that the hardware we bought should be capable of the performance we expect and require. However there is little trust in the Networks team and across the business that this hardware will ever be in a place to run the platform. So we are left with the fact that the new mail platform, which I believed had been specified to be the right solution for our next generation mail storage, has a performance flaw that means it is in not capable of doing the job we expect within our environment. As such, it is clear that another answer is needed – and quickly.

Firstly and most importantly we have changed the procedures regarding how the mail storage can be managed by the team that has most understanding of how to manage a mail storage platform, the Networks Team. As a result of that we have been doing a lot of work on the freeing up space on the existing NetApp platform. As an example of the kind of progress we have made with managing this, we have found over 300GB of mail stored in mail boxes that have not been looked at by customers for 6 months or more. This mail is being archived off and that will will free up enough space to move all the non-PlusNet VISP mail back to the NetApp platform. This is being worked on as I type and is planned for completion within 2 weeks. Going forward we will manage the mail storage to ensure that capacity is managed correctly and the platform will be scaled as appropriate.

With regard to scaling the platform, we have been speaking with most of the leading vendors in mail storage in the last few weeks,  and have now got a very clear idea on what the right platform  would be going forward. Once we are confident that we are managing the platform as it should be, we will have a clear idea of what we should be buying in terms of capacity.

So how did we end up buying a new platform that was not suitable to our needs? I am not going to play the blame game, we all signed off the solution we went for, and the fact is that we got it wrong.

The facts are that the system was specified about 8 months ago by a Senior Engineer, in consultation with Sun. The PlusNet engineer who designed and procured the system is no longer with us, and as such, he can’t talk us through the conversations had with Sun. The issue here is that there was a level of trust in both the engineer and in Sun that has proven to be inappropriate and misjudged. That has naturally been very uncomfortable for all the people involved, but it is an unavoidable truth.

Our purchasing policy has been overhauled as a direct result of the error we made here. Any major decision now requires sign off from two senior engineers, a Senior Architect, the Technical Director and the Board. We are also now engaging our co-design team in all such decisions, which brings together expertise from other areas of the business, many of who would not have been involved in networks purchasing decisions before.

Added to this (as you have already seen from previous announcements), senior personnel in several senior positions at Plusnet are no longer with us. The decision to change  the procedure and policy for managing the mail storage was taken during their tenure, as was the the decision to purchase the Sun,  as was the decision to continue moving customers to the Tiscalli LLU platform in the way we did, and the way we moved customers to the Max products. It was also the same period in time when we allowed the ticket and problem back-log to reach the excessive levels that they did.

What this all boils down to, and I know you've been hearing this a lot at the recently, is that things are very different at PlusNet in recent months. We lost our way in a big way. There is however now an undeniable feeling that things are being done differently now. If you want some example of the change that has occurred since the structural changes discussed here:

http://www.plus.net/features/news/reorganise.shtml

I’ve included what is known as our internal RAG (Red Amber Green) document below. This shows areas of concern for us in networks and what we have fixed since we started tracking things in this way.




Simon Day
Network Improvement Consultant
PlusNet Plc
Oldjim

Posts: 937


« Reply #1 on: November 02, 2006, 06:14:01 pm »

Without wishing to put words into anyone's mouth this scenario as described is one I have come across many times in my career where a small number of staff at a senior level were given authority to make decisions for which they did not have the knowledge and experience to fully understand.
This explanation and the actions now being taken give me a degree of confidence that things are going the right way.
The next thing which now needs to be announced is the ongoing capacity purchasing program as it is very clear that the PlusNet infrastructure is creaking at the seams (some would say bursting) due to a lack of investment in the past to reflect the increase in customers numbers and the ongoing changes in the user demand pattern.

Jim PlusNet PAYG RIN (no longer)
Now changed to BBYW Option 2 and seeing what difference it makes
Wish I had changed earlier as I have seen very little difference
MauriceB
Administrator

Posts: 3733

« Reply #2 on: November 02, 2006, 06:41:33 pm »

I didn't find a reference to the USENET platform in the RAG spreadsheet Simon?  Oversight or unplanned?

Maurice
Simon Day

Posts: 263


« Reply #3 on: November 02, 2006, 07:44:57 pm »

I didn't find a reference to the USENET platform in the RAG spreadsheet Simon?  Oversight or unplanned?

Oversight, my bad! Will pick this up in the morning. Usenet is being worked at the moment, so, not being included on here is no reflection of the effort being put into it.

Simon Day
Network Improvement Consultant
PlusNet Plc
Tam

Posts: 1188


100Mb via Enta.net :D

« Reply #4 on: November 02, 2006, 08:19:53 pm »

Capacity upgrades Huh Not seen any sign of when your going to do that stick another 622 pipe in, that will fix most of your complaints straight away!


Funny how i see traffic management as "green" on that spreadsheet ... certainly not what i think most people will agree  is the correct colour for it!
 

grumps

Posts: 3

« Reply #5 on: November 02, 2006, 08:33:57 pm »

I read what your saying Simon. However, being a simple minded guy I dont understand most of the techi stuff although I do 'get' the priciple of the thing.

Basically, however, all I know is that e-mails on my webmail timed received at 16.47 Tuesday 31.10.06 have only just arrived today on Thursday 02.11.06 timed received at 16.43. and being a tidy boy I delete mails from webmail after two days, so its very possible I have lost some in these trying times.

I look forward to a quick and succesful solution.
mikeb

Posts: 656


« Reply #6 on: November 04, 2006, 09:40:54 am »


We began the migration of customer’s mail to the new Sun Network Attached Storage (NAS) six weeks ago. We successfully moved all of Force9, Free-Online and Metronet customer mail on to the storage and saw no discernible difference in performance between the Sun and the old NetApp platform (At least, no difference that could not be put down to monitoring anomalies).

However there is little trust in the Networks team and across the business that this hardware will ever be in a place to run the platform. So we are left with the fact that the new mail platform, which I believed had been specified to be the right solution for our next generation mail storage, has a performance flaw that means it is in not capable of doing the job we expect within our environment. As such, it is clear that another answer is needed – and quickly.

As a result of that we have been doing a lot of work on the freeing up space on the existing NetApp platform. As an example of the kind of progress we have made with managing this, we have found over 300GB of mail stored in mail boxes that have not been looked at by customers for 6 months or more. This mail is being archived off and that will will free up enough space to move all the non-PlusNet VISP mail back to the NetApp platform. This is being worked on as I type and is planned for completion within 2 weeks. Going forward we will manage the mail storage to ensure that capacity is managed correctly and the platform will be scaled as appropriate.

(heavily snipped)

Erhm, sorry, I know it's a bit early in the morning and all that but I'm confused !  How does this stack up with the recent service.status announcement which seems to imply that stuff is being moved from the old platform to the new platform ... or is that something completely different (i.e. server as opposed to storage ) ?

Quote
Detailed description of work to be performed:-
We will be migrating off of our older incoming main platform, to a new hardware build based on Sun T1000 servers. This will see a long term improvement with respect to the recent problems encountered with delayed email.

--
WARNING: The e-mail address on my profile is not my usual address, all messages sent via this site have been redirected elsewhere for test purposes. This could result in messages not being received in a timely manner or potentially not being received at all.
wildmind
Guest
« Reply #7 on: November 04, 2006, 09:59:01 am »

There's two main elements to the mail platform AFAICT - MXCore is the processing platform and then there's the mail storage platform.

Simons announcement is to do with Mail STORE - the SS was to do with MX CORE (e.g. the processing platform)
Simon Day

Posts: 263


« Reply #8 on: November 06, 2006, 12:59:45 pm »


Simons announcement is to do with Mail STORE - the SS was to do with MX CORE (e.g. the processing platform)


Exactly right Mike.

Simon Day
Network Improvement Consultant
PlusNet Plc
Pages: [1]
  Print  
 
Jump to: