The trouble with IT networks
IT networks are troublesome; there are so many potential points of failure, so much monitoring, analysis and troubleshooting required. Ask anyone from any business, not just IT professionals, and they’ll tell you that any network downtime spells trouble.
To gauge just how much trouble, in 2011 the Ponemon Institute conducted a survey of 32 separate organisations representing 41 data centres in the US, with a view to determining the true cost of data centre outage. The study used a framework of nine core process-related activities that drive a range of expenditures associated with a company’s response to a data outage.
Specifically, data was gathered on costs around:
- detection, including all costs associated with the initial discovery and subsequent investigation of the partial or complete outage incident;
- containment expenses to enable reasonable prevention of an outage from spreading, worsening or causing greater disruption;
- recovery costs associated with bringing networks and core systems back to a state of readiness;
- ex-post response expenditures which incorporate all after-the-fact incidentals;
- equipment including new purchases, repairs and refurbishments;
- IT productivity relating to IT personnel downtime;
- user productivity, which includes time and related expenses of end users; and
- third-party expenditures, which includes contractors, consultants, auditors and other specialists engaged to help resolve unplanned outages.
In addition, the study factored in opportunity costs associated with data outage, including lost revenues from customers and potential customers because of an inability to access core systems, as well as business disruption expenditure, including reputational damages, customer churn and lost business.
The study determined that certain causes of outage are more expensive than others - IT equipment failure being the most expensive root cause and accidental/human error being the least - and that the average cost per incident was US$505,500. Half a million bucks. No small number; and no small wonder, because data networks can be complex beasts.
Who to blame when things go wrong
Due to that complexity, there is a tendency for ‘finger pointing’ in any downtime or failure scenario. If you’ve ever been one of multiple trades on a project when something has gone down, you know how this plays out. Everyone is so busy blaming everyone (or anyone) else that the problem generally takes even longer to be rectified.
The IT world is no different - if there are issues on the network, they could be coming from anywhere. Consider the following scenario: a remote user is finding the network is slow. There are too many questions that need to be asked before the problem can be isolated; is the problem occurring on-site, on the WAN or at the data centre? Is the firewall blocking something or is the local site congested? Is it a carrier problem? Has Qos/Cos been misconfigured? Has a server been overloaded? Is the user load being properly distributed and is authentication being successfully carried out? The questions never stop and, in many cases, the resolution is too long in coming.
Additional stresses
It’s not all about downtime, either, there’s pressure from other angles. Capacity management can be an issue; as in any distributed organisation, the cost of connectivity is always difficult to quantify. Traffic on the network tends to expand to fill the available bandwidth, whether or not that traffic is crucial to the business. Usage of business-critical applications is also difficult to measure. It may be within expected parameters for 90% of the time, but no business wants to experience degraded performance for the remaining 10%.
Performance requirements put constant pressure on IT to increase bandwidth in order to provide an effective service to the business, while cost management requires that this is limited, or even reduced. These represent conflicting sides, to say the least.
Then there is voice over IP (VoIP), which puts additional strain on the network. While it may be an attractive option to management in terms of cost reduction, the issues with utilising a standard network to transport voice calls are obvious from a traffic perspective. Namely, that VoIP uses the User Datagram Protocol (UDP), which is inherently connectionless. What this means is, if a packet is lost, or delivery is taking too long, the sender of the data has no mechanism to resend or adjust the rate by which the data is sent.
When an environment adds a significant number of VoIP users, this can impact current utilisation of network segments, reducing call quality and the speed at which standard Transmission Control Protocol (TCP) applications perform. TCP, when faced with additional delay, can time out and resend data when not appropriately acknowledged, as it is a connection-oriented protocol.
So, how can a company manage all of these conflicting pressures on a network and ensure that downtime is kept to a minimum?
The Ponemon study found that data centres are attributed to one of four drivers:
- Increasing IT demands/exceeding data centre capacity: Additional equipment due to increased IT demand can stress infrastructure resulting in failure.
- Rising rack densities: Introduction of blade servers and other high-performance equipment creates more heat and requires precision cooling. Water incursion becomes a greater risk.
- Data centre efficiency: Power draw is often the target of facilities and operations staff. Trying to minimise energy consumption at the expense of availability is a risky business, particularly where critical data centres are concerned.
- Need for infrastructure management and control: This is the key driver, as all of the preceding items can be addressed through a system of management and control.
Performance monitoring for modern IT
While various stand-alone performance monitoring and troubleshooting options for networks and IT systems have been around for some time, Fluke Networks has recently unveiled a truly unified network and application performance troubleshooting appliance, which it has labelled Visual TruView.
TruView is a single appliance that leverages key data sources such as packet, transaction, flow and SNMP to present a correlated view of performance in a single, easy-to-understand view. These correlated views assist in seeing how the infrastructure is transporting applications and how well those applications are performing in context of the end-user experience.
In simple terms, IT teams no longer have to find and solve problems, as everyone has a complete view of performance across the entire application and network. The device provides enterprise-wide visibility, isolates problems down to infrastructure device, interface, transaction or packets associated with any performance event and delivers a retrospective analysis with event reconstruction.
Analytics are time-correlated within a single dashboard, so the workflow to understand the problem domain is simplified - finger pointing between IT teams is virtually eliminated. Dashboards are customised from a library of views and measurements, meaning that the user only sees what they need to see.
TruView is fast, offering automated application discovery speeds set-up and ongoing management; it is intelligent, incorporating self-learning baseline capability; and, it is complete - from monitoring to troubleshooting, it offers insight into everything from traffic flows to individual client transactions.
The device also provides advanced VoIP analysis, giving a visual indicator of individual user call quality and associated degradation factors. Additionally, the in-depth level of analysis allows businesses to identify critical business usage vs non-critical and recreational usage on the network, clearly identifying current bandwidth requirement and permitting accurate forecasting for future needs.
So, while it may not prevent human error quite yet, a truly integrated approach to data network performance management lessens the likelihood of costly outages and lost productivity, and negates the need for finger-pointing between departments. A simple solution to a complex and expensive problem, TruView delivers an extra degree of reassurance when it comes to keeping the heart of any business beating.
Powering data centres in the age of AI
As data centres are increasingly relied upon to support power-hungry AI services and...
Smart cities, built from scratch
With their reliance on interconnected systems and sustainable technologies, smart cities present...
Smart homes, cities and industry: Wi-Fi HaLow moves into the real world
Wi-Fi HaLow's reported advantages include extended ranges and battery life, minimised...