Find and fix issues on the data route

Find and fix issues on the data route

Introduction

Within my work in the Application Performance Management business I often see very interesting things.

Some things are pretty annoying (performance of ad-services) because they happen every day/hour/minute/second. And it seems to be senseless to fight for better quality here.
It seems as if the Ad-Market is aware that it is responsible for keeping the internet as we know it up and running by sponsoring “free content”.

Some things are pretty rare but happen from time to time.
One of these things recently was brought to my attention by a cool NOC guy who requested help from me to find out what currently happens with the performance of some of their offerings – and why.

I digged a bit into the data and found the issue and was able to propose a solution which helped to resolve the issue partly. After seeing that the performance got better I felt enough inspired to share this knowledge with you #webperf guys.

Here is the case – and how it got fixed. Enjoy reading

Enhanced Data Analysis – finding Issues outside of your data center

You have high frequent monitor set up from various fixed locations. (see below methodology)
With a get method you are requesting the base element of your internet offering (or more but only one domain).
Let’s say: every 5 minutes you monitor from 8 locations (key locations) your domain(s) by get requests (and only the base page!)
The monitoring agents are located in backbone locations of main network carrier. This is to exclude bandwidth as a limiting factor and have a clean view on the systems health status

To differentiate you own performance you are monitoring one of your main competitors with the same method.

After base lining performance you can set up alerts and notifications and maybe leave the monitor running as it is and check metrics only in case of alerts.

All the sudden – without any changes anywhere – you get a daily alert. You check the monitor and it looks like this for some of your monitored key pages base elements.

Scatter Chart showing spikes during day time

Scatter Chart showing spikes during day time

This chart represents a week of monitoring with having every single measure displayed as a dot. The two different colors are for the two domains tested.
Clearly you can see a performance decrease as a daily pattern and spikes during that time span too.

It is also visible that the base line (best performance) is not affected. So somehow you are not getting really slower but you have some issues.

To drill down to the cause of what makes the spikes you should remember your test-setup. You might first want to know if this issue is a general issue or a localized issue.

Scatter Chart by Monitoring location

Scatter Chart by Monitoring location – indicator for having a local issue

Both Tests are now shown in a different way. Not by Test itself but by monitoring location.
Clearly we can see a lot of spikes at every location but main interest should be spent on the measurements taken from the Munich node.

Back to the first view – but filtered by only the measures taken from the Munich monitoring location looks like this.

Scatter Chart from the issueing location only

Scatter Chart from the issueing location only – clearly the baseline is lost.

This is a very nice chart, since it shows a very clear pattern – even if that pattern is a bad one. Losing the baseline completely and having a decrease of performance by >100% is not a good thing.
Even more interesting – this behavior is focused on only one monitoring location.
There are now three options:
1.    Something is causing an issue which is not in my hands
2.    Something is wrong with the monitoring location
3.    Something might cause only this location to hit a bad infrastructure node in your data center (load balancer)

To check if something is wrong with the monitoring location it might be a good idea to simply chart the competitor’s site which we have in monitor from the same monitoring location only.

Competitor chart - same location same settings

Competitor chart – same location same settings. Showing not the same performance behavior.

Clearly we cannot see the curve as it is for our own content. The problem in the one monitoring location is with us – not the backbone.

To start any further thoughts of what might cause it we should have a look to the other data the monitor provides us to exclude issues on specific components of the network communication.

The break down by network component might look like this:

Chart of the Network Components

Chart of the Network Components show spikes in every component. Not very common sighting

We see the decrease of performance in every single component. Most impacted is the content time as well as the DNS time. Higher latency and longer content times might be indicators to some kind of packet loss. But why is it only occurring during day time and not at night? Maybe somehow we are impacted by the daily internet-traffic in general. But why only at this one location and not in general (The same tests from a different location look pretty good and consistent and give us the proof that the issue is a local one)? Why only us? There is only one proper answer to the question:

The problem is NOT located at the location itself but on the way of our data to the location – on the route.

To proof this we can run a trace-route on port 80 (priority!!) from the monitoring location to our data center (if the monitoring tool provides a feature like that).

The result of the traces confirms the suspicion. One route went nuts and takes journey cross many carrier networks.

Example one good location:

Tracing route to 194.xx.xxx.xxx on port 80
Over a maximum of 30 hops.
1 2 ms 2 ms 2 ms 139.4.34.xxx [Dortmund.de.ALTER.NET]
2 2 ms 2 ms 2 ms 149.227.16.13 [so-2-3-0.CR1.DTM1.ALTER.NET]
3 6 ms 6 ms 15 ms 149.227.17.34 [so-1-3-0.XT2.FFT1.ALTER.NET]
4 6 ms 6 ms 6 ms 146.188.6.106 [xe-11-0-0.BR1.FFT1.ALTER.NET]
5 6 ms 6 ms 6 ms 4.68.63.77 [xe-9-2-2.edge4.frankfurt1.level3.net]
6 7 ms 6 ms 6 ms 4.69.154.135 [ae-3-80.edge3.Frankfurt1.Level3.net]
7 25 ms 6 ms 6 ms 212.xxx.xxx.xxx [domain-G.edge3.Frankfurt1.Level3.net]
8 11 ms 11 ms 11 ms 194.xx.xxx.xxx [final.destination.server.net]
9 Destination Reached in 12 ms. Connection established to 194.xx.xxx.xxx
Trace Complete.

The Trace from the Munich monitoring location looks like this.

Tracing route to 194.xx.xxx.xxx on port 80
Over a maximum of 30 hops.
1 2 ms 4 ms 2 ms 213.20.169.162 [xmws-mnch-de01-vlan-209.nw.mediaways.net]
2 2 ms 2 ms 48 ms 195.71.233.250 [rmwc-mnch-de01-ge-0-3-0-0.nw.mediaways.net]
3 12 ms 14 ms 12 ms 195.71.212.185 [rmwc-brln-de01-so-4-0-0-0.nw.mediaways.net]
4 12 ms 12 ms 12 ms 62.53.167.230 [rmwc-brln-de02-xe-0-1-0-0.nw.mediaways.net]
5 16 ms 16 ms 16 ms 195.71.212.133 [rmwc-frnk-de02-chan-4-0.nw.mediaways.net]
6 16 ms 20 ms 51 ms 62.52.50.162 [xmws-frnk-de16-chan-0.nw.mediaways.net]
7 16 ms 20 ms 25 ms 78.xxx.xx.xx [tge4-4.fra02-1.de.domain.net]
8 23 ms 23 ms 25 ms 78.xxx.xx.xx
9 23 ms 21 ms 21 ms 78.xxx.xx.xx [6-16.core1.Amsterdam..domain.net]
10 31 ms 34 ms 29 ms 78.xxx.xx.xx [12-2.edge1.Amsterdam2. domain.net]
11 40 ms 29 ms 29 ms 78.xxx.xx.xx [5-76.edge1.Amsterdam5. domain.net]
12 22 ms 24 ms 23 ms 78.xxx.xx.xx
13 24 ms 24 ms 25 ms 194.xx.xxx.xxx [final.destination.server.net]]
14 Destination Reached in 22 ms. Connection established to 91.xxx.xxx.xx
Trace Complete.

From Munich the request goes to Berlin than via Frankfurt to Amsterdam to one of the two Datacenters.
No question that this route must have impact on the speed of content to be delivered. The timings in between the hops are speaking their own language.

With that knowledge you can start immediate actions or do some other validations.
Is one monitoring location really an issue for real users? (synthetic monitoring locations are never ever real users – even if some vendors are trying to tell this story). An issue like this must be validated in regards to the real impact on your users.
Whether with tests from various end-user locations or with real user data (if real user monitoring is in place).

The delay we see is only 100 ms – only for the root element. How much is the delay for all files delivered from your DC to the client?
The more files delivered the higher the impact the long route has on performance.

To proof the issue for severity we used synthetic tests we did from real end user locations (not real user monitoring).

Again we put the focus on the root object only since this one is always delivered from oujr data center

Average Load Time of one week Testing from an end user perspective

Average Load Time of one week Testing from an end user perspective. Using synthetic tests from end user machines (Last Mile)

The root object has a pretty high failure rate.

Failure types and amount with only delivering the base element of the web-page

Failure types and amount with only delivering the base element of the web-page

Looking at the errors we see socket receive timeouts (Connection gets dropped since no response was receive within time out limits defined by server or client), socket connection resets (bad thing) and a few timeouts.

By checking who is affected by the socket receive timeout its easy to see that the vast majority are users from the Telefonica Network – The provider the monitoring agent is stored at:

It turns out by further filtering that the domain is less available by users with the ISP Telefonica and the load time of the base object is significantly slower.

All ISP’s excluding Telefonica/Alice:
Load-Time = 0.557 Seconds

Telefonica / Alice only:
Load-Time = 0.815 Seconds

This is props the thesis that there is a good chance to have an issue which affects almost all users using the ISP Telefonica (and local providers using their network).

Action is necessary.

What kind of action can be checked here by an article published by o’reilly in 2002.(maybe something has changed – but the principals should be clear)

http://oreilly.com/catalog/bgp/chapter/ch06.html

Result:
After setting other priorities for the routing of the outgoing traffic on November 6th.the performance was way more consistent.

Performance first day after changing routing tables

Performance first day after changing routing tables for outgoing traffic.

Methodology

Giving the fact that web-performance today has an enormous impact on the success of a website most of the readers of this blog entry know the various schools and rules on optimizing the speed of a webpage.
Steve Souders as well a well-known and respected leader in web-performance has published books on how to optimize web-pages.
Mainly these optimizations take place in the area which are known as the presentation layer – or frontend.

Of course before performance becomes important the user must be able to reach your content in the first place.

To control the availability and to underline your performance optimizations you will have a monitoring tool in place that allows you to see if you are performing well and you are up and running say 99.9 % of the time for your customers.

Depending on the matureness of the web ops or performance team you can have various tactics to monitor both of the requirements (availability and performance).

One common and valid strategy for monitoring availability is the “high-frequent request” to your webservers. Availability testing is mainly done by a synthetically triggered request to the domain or IP of one or more specific domains or IP-Addresses.

The request can be done by “ping” or by “get”. You should be aware of two important things when using a ping to check for availability:
1.    ping signals get a lower priority in networks. (coming from the ping to death times)
2.    ping responses can’t tell you if your application is running

If you want to know if the response of your webserver is correct you must use at least a get request to the base element of the webpage (maybe without parsing it any further) with the option to parse the server feedback for matches of specific words (Like: must exist: ”Welcome”, must not exist: ”Sorry, we have a problem”).

By “get http://” method we get a lot of answers:
1.    Server is up and running (with common port 80 prioritized HTTP traffic)
2.    Base Content is delivered correctly
3.    Time in which the base content is delivered completely

Based on the location and capabilities of your “synthetic monitor” you might get some more information by default:
1.    Geographic issues with latency
2.    DNS time
3.    Connect time
4.    Time from connect to 1st Byte delivery
5.    Content download time

So with having a high frequent monitor set up with a get on port 80 for the root element of your web-page we are pretty good in “Monitoring our offerings availability”.
Caution: This does not apply for Ajax or RIA driven web applications. This methodology will also not tell us if the web-page as seen by the user is loaded completely – and what time it took to download all content completely.

With this kind of monitoring you will be able to see spikes in your datacenter performance – you might be able to find patterns and/or temporary spikes. You will be able to determine impact of changes in datacenter architectures and many things more. But still be aware: only for the root element of the page. This is more than nothing – and should be done by everyone.

If you use tools like Nagios placed in your own data center you might not get the full truth told by this monitoring method. Same applies if you use an external monitoring tool with limited network locations.

If you monitor from inside of your data center you might not be able to test your ISP capability to deliver content. If you have more than one uplink to the internet by your ISP you will not be able to see if all uplinks are working well.
For external monitoring tools with a small network you might miss the performance timings in your key traffic geographies as well as the uplink Information.