Update on Network Issues - March 2006
Article posted by csogilvie on Monday, 13-Mar-2006 14:32 PM
PlusNet have prepared the following email notice to be sent to all customers concerning the earlier reported network problems.
This email will, however, take 3-4 days to be sent to all customers, although this process has now started.
Overview of network issues
On behalf of the management team at PlusNet, I would like to sincerely apologise for the inconvenience caused due to problems experienced on our broadband network last week.
During last weekend and again later in the week, we saw intermittent issues with one of our network suppliers. This may have caused difficulties in accessing some Internet sites. Additionally, we experienced a major network failure on the morning of Tuesday 7th March. This caused loss of connectivity for customers over several hours and resulted in very high call volumes to our customer service centre, resulting in some customers being unable to make contact with us.
Please accept our apologies and be assured that our team is working tirelessly to protect your customer experience. Through careful planning and investments in our network, we know that we can quickly improve on the problems experienced this week and ensure that they do not occur again. In the meantime, we have included below further information about what went wrong last week, how we fixed it and what we’re doing to minimise the chances of future related issues.
Head of Customer Services
PlusNet – The Smarter Way to Broadband
Service Details – Network issues 3rd – 10th March
Tuesday 7th March – RADIUS Problems
Early last Tuesday morning, we completed routine upgrades on our core network and started the process of re-authenticating all our customers onto our Broadband platform.
When we perform these types of software upgrades, we need to ensure our entire network behaves consistently following the work. This occasionally means we have no option but to affect all of our customers who are connected at that time. This is a procedure that has been carried out regularly in the past without issues.
The re-authentication process involves disconnecting all customers connected through each of our pipes in a staggered manner, waiting for the customers from one pipe to reconnect to the network, and then starting the next. Connection attempts are monitored and authorised by our RADIUS platform. The problem occurred during the re-authentication of customers from the final BT Central pipe, which is currently operating as part of a BT Wholesale trial and has more customers connected than all other similar pieces of equipment on our network.
We operate two distinct RADIUS platforms, each one made up of numerous servers, some of which are dedicated to different tasks (Database, Front-end, Accounting etc). When we went to re-connect the customers on our most heavily subscribed Central Pipe (At that time in the morning, this was approaching 30,000 live user sessions), the authentication front end server ‘core dumped’ (crashed) and stopped responding to new connection requests. Although the secondary RADIUS platform did not crash, the extra load generated caused a severe performance issue, due to the sheer volume of connection requests. The problem was exacerbated by the way many customers’ routers are configured to constantly retry should the connection attempt fail. Ultimately it was these retried attempts that kept the platform under severe load, preventing us from successfully bringing back the primary server, and meaning more failed connection attempts and a loop forming that was difficult to break.
In order to resolve the problem, our engineers had to re-assign additional network resources from other areas of our network, and two new servers were built and permanently added to the RADIUS platform. As a result of the problem, around a third of our customers were unable to connect to the Internet until late morning, and a handful of customers were unable to connect until the early afternoon.
We do understand and acknowledge the obvious frustrations that this incident caused for many people, and we are now putting extra steps in place to ensure that an incident like this does not occur again. This will include further backup authentication servers to improve the resiliency of our network. Additionally, improved flow control for RADIUS packets will ensure that the RADIUS platform itself cannot become overloaded due to too many connection attempts at once.
We do take this type of incident very seriously and work is now in progress to implement revisions to our maintenance strategy in order to ensure that any chance of customer impacting problems are kept to an absolute minimum. In this instance we did not provide the right level of customer experience, and as well as the technical issues, we recognise that the response from our customer support team also failed to meet our customers’ expectations.
Customer Support Centre Response
During the problems last Tuesday morning, we believe approximately 15,000 calls to our support centre were attempted. This is vastly more than we have ever received over such a short period of time. As a result of this very high volume, a number of these calls will have been met with engaged tones. We had already identified this as a limitation of our current telephone system, and can confirm that a new and much improved system is already being installed within our Customer Support Centre. There was also additional impact on support response times during the following because of the high number of customer fault tickets raised during the incident.
As well as reviewing our network resiliency since the event, we are now also carrying out a review of our Service Status communication mechanisms, including the positioning of recordings on our phone system and the functions of our web based service status posting tool. We are also conscious of the need for more detailed planned maintenance notices that give ample information about the work taking place and any possible risks that might be involved.
Details of the intermittent issues when accessing some web sites
Since Saturday 4th March some customers have experienced intermittent issues when trying to access some Internet sites. This has been caused as a result of problems at Abovenet, one of our primary network connectivity suppliers. Their major network failure and subsequent performance issues were initially diagnosed as a DNS issue, but later identified as failures within the routing of multiple core routers, and it was noted that this problem impacted general Internet routing.
The problem affected many UK and US sites, and other UK ISPs who are customers of Abovenet also reported problems. Although we were able to manually alter our routing tables in order to alleviate the issues caused to some customers, the intermittent nature of the Abovenet fault has continued to cause problems throughout the week. We are working with Abovenet and our other Internet connectivity suppliers to ensure this is fully resolved and that there is no potential within their network for further similar issues to be repeated.
PlusNet Customer Support