Browser Statistics

Which is the fastest Browser ? Which is the most used Browser ?
Find attached a report with Browser Data coming from 178 Portals – cross industry – world wide.

Reflecting more than 4.400.000.000 Page Views in averages

LoadTimeBrowsers (XLS Sheet)

The data has been captured with using Real User Monitoring.
Time frame: 20th June 2013 – 18th July 2013

Learn more about the technology used here.

Explainer:
Page Views: Amount of Pages clicked by Real Users using that specific Web-Client
Redirect Time: Not really important to have that shown in Browser stats. Interesting enough that Redirects seems to be used often by Page-Providers
DNS Lookup Time: Time it takes to find an IP Address for a domain name
Connection Time: Time it takes to connect a server with hosting the page
1st Byte Time: Time it takes to deliver the first byte of the html code of the page
PR Time: Percieved render time = Time it takes to render the “above the fold” area in the client.
Load Time: Time it takes to onLoad event happening
Response Time: Time it takes from clicking on a link to have the target pages onload event happening
Apdex: esoteric value of users happiness with loading a page. Explained here
Abandonment rate: Percentage of Pageviews where the user leave the page before “onload” event has happened.**

*Not necessarily went to competition…might have found the link to click on or logged in or….

Find and fix issues on the data route

Introduction

Within my work in the Application Performance Management business I often see very interesting things.

Some things are pretty annoying (performance of ad-services) because they happen every day/hour/minute/second. And it seems to be senseless to fight for better quality here.
It seems as if the Ad-Market is aware that it is responsible for keeping the internet as we know it up and running by sponsoring “free content”.

Some things are pretty rare but happen from time to time.
One of these things recently was brought to my attention by a cool NOC guy who requested help from me to find out what currently happens with the performance of some of their offerings – and why.

I digged a bit into the data and found the issue and was able to propose a solution which helped to resolve the issue partly. After seeing that the performance got better I felt enough inspired to share this knowledge with you #webperf guys.

Here is the case – and how it got fixed. Enjoy reading

Enhanced Data Analysis – finding Issues outside of your data center

You have high frequent monitor set up from various fixed locations. (see below methodology)
With a get method you are requesting the base element of your internet offering (or more but only one domain).
Let’s say: every 5 minutes you monitor from 8 locations (key locations) your domain(s) by get requests (and only the base page!)
The monitoring agents are located in backbone locations of main network carrier. This is to exclude bandwidth as a limiting factor and have a clean view on the systems health status

To differentiate you own performance you are monitoring one of your main competitors with the same method.

After base lining performance you can set up alerts and notifications and maybe leave the monitor running as it is and check metrics only in case of alerts.

All the sudden – without any changes anywhere – you get a daily alert. You check the monitor and it looks like this for some of your monitored key pages base elements.

1r in Find and fix issues on the data route

Scatter Chart showing spikes during day time

This chart represents a week of monitoring with having every single measure displayed as a dot. The two different colors are for the two domains tested.
Clearly you can see a performance decrease as a daily pattern and spikes during that time span too.

It is also visible that the base line (best performance) is not affected. So somehow you are not getting really slower but you have some issues.

To drill down to the cause of what makes the spikes you should remember your test-setup. You might first want to know if this issue is a general issue or a localized issue.

2r in Find and fix issues on the data route

Scatter Chart by Monitoring location – indicator for having a local issue

Both Tests are now shown in a different way. Not by Test itself but by monitoring location.
Clearly we can see a lot of spikes at every location but main interest should be spent on the measurements taken from the Munich node.

Back to the first view – but filtered by only the measures taken from the Munich monitoring location looks like this.

3r in Find and fix issues on the data route

Scatter Chart from the issueing location only – clearly the baseline is lost.

This is a very nice chart, since it shows a very clear pattern – even if that pattern is a bad one. Losing the baseline completely and having a decrease of performance by >100% is not a good thing.
Even more interesting – this behavior is focused on only one monitoring location.
There are now three options:
1.    Something is causing an issue which is not in my hands
2.    Something is wrong with the monitoring location
3.    Something might cause only this location to hit a bad infrastructure node in your data center (load balancer)

To check if something is wrong with the monitoring location it might be a good idea to simply chart the competitor’s site which we have in monitor from the same monitoring location only.

4r in Find and fix issues on the data route

Competitor chart – same location same settings. Showing not the same performance behavior.

Clearly we cannot see the curve as it is for our own content. The problem in the one monitoring location is with us – not the backbone.

To start any further thoughts of what might cause it we should have a look to the other data the monitor provides us to exclude issues on specific components of the network communication.

The break down by network component might look like this:

5r in Find and fix issues on the data route

Chart of the Network Components show spikes in every component. Not very common sighting

We see the decrease of performance in every single component. Most impacted is the content time as well as the DNS time. Higher latency and longer content times might be indicators to some kind of packet loss. But why is it only occurring during day time and not at night? Maybe somehow we are impacted by the daily internet-traffic in general. But why only at this one location and not in general (The same tests from a different location look pretty good and consistent and give us the proof that the issue is a local one)? Why only us? There is only one proper answer to the question:

The problem is NOT located at the location itself but on the way of our data to the location – on the route.

To proof this we can run a trace-route on port 80 (priority!!) from the monitoring location to our data center (if the monitoring tool provides a feature like that).

The result of the traces confirms the suspicion. One route went nuts and takes journey cross many carrier networks.

Example one good location:

Tracing route to 194.xx.xxx.xxx on port 80
Over a maximum of 30 hops.
1 2 ms 2 ms 2 ms 139.4.34.xxx [Dortmund.de.ALTER.NET]
2 2 ms 2 ms 2 ms 149.227.16.13 [so-2-3-0.CR1.DTM1.ALTER.NET]
3 6 ms 6 ms 15 ms 149.227.17.34 [so-1-3-0.XT2.FFT1.ALTER.NET]
4 6 ms 6 ms 6 ms 146.188.6.106 [xe-11-0-0.BR1.FFT1.ALTER.NET]
5 6 ms 6 ms 6 ms 4.68.63.77 [xe-9-2-2.edge4.frankfurt1.level3.net]
6 7 ms 6 ms 6 ms 4.69.154.135 [ae-3-80.edge3.Frankfurt1.Level3.net]
7 25 ms 6 ms 6 ms 212.xxx.xxx.xxx [domain-G.edge3.Frankfurt1.Level3.net]
8 11 ms 11 ms 11 ms 194.xx.xxx.xxx [final.destination.server.net]
9 Destination Reached in 12 ms. Connection established to 194.xx.xxx.xxx
Trace Complete.

The Trace from the Munich monitoring location looks like this.

Tracing route to 194.xx.xxx.xxx on port 80
Over a maximum of 30 hops.
1 2 ms 4 ms 2 ms 213.20.169.162 [xmws-mnch-de01-vlan-209.nw.mediaways.net]
2 2 ms 2 ms 48 ms 195.71.233.250 [rmwc-mnch-de01-ge-0-3-0-0.nw.mediaways.net]
3 12 ms 14 ms 12 ms 195.71.212.185 [rmwc-brln-de01-so-4-0-0-0.nw.mediaways.net]
4 12 ms 12 ms 12 ms 62.53.167.230 [rmwc-brln-de02-xe-0-1-0-0.nw.mediaways.net]
5 16 ms 16 ms 16 ms 195.71.212.133 [rmwc-frnk-de02-chan-4-0.nw.mediaways.net]
6 16 ms 20 ms 51 ms 62.52.50.162 [xmws-frnk-de16-chan-0.nw.mediaways.net]
7 16 ms 20 ms 25 ms 78.xxx.xx.xx [tge4-4.fra02-1.de.domain.net]
8 23 ms 23 ms 25 ms 78.xxx.xx.xx
9 23 ms 21 ms 21 ms 78.xxx.xx.xx [6-16.core1.Amsterdam..domain.net]
10 31 ms 34 ms 29 ms 78.xxx.xx.xx [12-2.edge1.Amsterdam2. domain.net]
11 40 ms 29 ms 29 ms 78.xxx.xx.xx [5-76.edge1.Amsterdam5. domain.net]
12 22 ms 24 ms 23 ms 78.xxx.xx.xx
13 24 ms 24 ms 25 ms 194.xx.xxx.xxx [final.destination.server.net]]
14 Destination Reached in 22 ms. Connection established to 91.xxx.xxx.xx
Trace Complete.

From Munich the request goes to Berlin than via Frankfurt to Amsterdam to one of the two Datacenters.
No question that this route must have impact on the speed of content to be delivered. The timings in between the hops are speaking their own language.

With that knowledge you can start immediate actions or do some other validations.
Is one monitoring location really an issue for real users? (synthetic monitoring locations are never ever real users – even if some vendors are trying to tell this story). An issue like this must be validated in regards to the real impact on your users.
Whether with tests from various end-user locations or with real user data (if real user monitoring is in place).

The delay we see is only 100 ms – only for the root element. How much is the delay for all files delivered from your DC to the client?
The more files delivered the higher the impact the long route has on performance.

To proof the issue for severity we used synthetic tests we did from real end user locations (not real user monitoring).

Again we put the focus on the root object only since this one is always delivered from oujr data center

6r in Find and fix issues on the data route

Average Load Time of one week Testing from an end user perspective. Using synthetic tests from end user machines (Last Mile)

The root object has a pretty high failure rate.

7r in Find and fix issues on the data route

Failure types and amount with only delivering the base element of the web-page

Looking at the errors we see socket receive timeouts (Connection gets dropped since no response was receive within time out limits defined by server or client), socket connection resets (bad thing) and a few timeouts.

By checking who is affected by the socket receive timeout its easy to see that the vast majority are users from the Telefonica Network – The provider the monitoring agent is stored at:

It turns out by further filtering that the domain is less available by users with the ISP Telefonica and the load time of the base object is significantly slower.

All ISP’s excluding Telefonica/Alice:
Load-Time = 0.557 Seconds

Telefonica / Alice only:
Load-Time = 0.815 Seconds

This is props the thesis that there is a good chance to have an issue which affects almost all users using the ISP Telefonica (and local providers using their network).

Action is necessary.

What kind of action can be checked here by an article published by o’reilly in 2002.(maybe something has changed – but the principals should be clear)

http://oreilly.com/catalog/bgp/chapter/ch06.html

Result:
After setting other priorities for the routing of the outgoing traffic on November 6th.the performance was way more consistent.

8r in Find and fix issues on the data route

Performance first day after changing routing tables for outgoing traffic.

Methodology

Giving the fact that web-performance today has an enormous impact on the success of a website most of the readers of this blog entry know the various schools and rules on optimizing the speed of a webpage.
Steve Souders as well a well-known and respected leader in web-performance has published books on how to optimize web-pages.
Mainly these optimizations take place in the area which are known as the presentation layer – or frontend.

Of course before performance becomes important the user must be able to reach your content in the first place.

To control the availability and to underline your performance optimizations you will have a monitoring tool in place that allows you to see if you are performing well and you are up and running say 99.9 % of the time for your customers.

Depending on the matureness of the web ops or performance team you can have various tactics to monitor both of the requirements (availability and performance).

One common and valid strategy for monitoring availability is the “high-frequent request” to your webservers. Availability testing is mainly done by a synthetically triggered request to the domain or IP of one or more specific domains or IP-Addresses.

The request can be done by “ping” or by “get”. You should be aware of two important things when using a ping to check for availability:
1.    ping signals get a lower priority in networks. (coming from the ping to death times)
2.    ping responses can’t tell you if your application is running

If you want to know if the response of your webserver is correct you must use at least a get request to the base element of the webpage (maybe without parsing it any further) with the option to parse the server feedback for matches of specific words (Like: must exist: ”Welcome”, must not exist: ”Sorry, we have a problem”).

By “get http://” method we get a lot of answers:
1.    Server is up and running (with common port 80 prioritized HTTP traffic)
2.    Base Content is delivered correctly
3.    Time in which the base content is delivered completely

Based on the location and capabilities of your “synthetic monitor” you might get some more information by default:
1.    Geographic issues with latency
2.    DNS time
3.    Connect time
4.    Time from connect to 1st Byte delivery
5.    Content download time

So with having a high frequent monitor set up with a get on port 80 for the root element of your web-page we are pretty good in “Monitoring our offerings availability”.
Caution: This does not apply for Ajax or RIA driven web applications. This methodology will also not tell us if the web-page as seen by the user is loaded completely – and what time it took to download all content completely.

With this kind of monitoring you will be able to see spikes in your datacenter performance – you might be able to find patterns and/or temporary spikes. You will be able to determine impact of changes in datacenter architectures and many things more. But still be aware: only for the root element of the page. This is more than nothing – and should be done by everyone.

If you use tools like Nagios placed in your own data center you might not get the full truth told by this monitoring method. Same applies if you use an external monitoring tool with limited network locations.

If you monitor from inside of your data center you might not be able to test your ISP capability to deliver content. If you have more than one uplink to the internet by your ISP you will not be able to see if all uplinks are working well.
For external monitoring tools with a small network you might miss the performance timings in your key traffic geographies as well as the uplink Information.

An Ad is an Ad is an Ad is an Ad

Did you ever wonder why your Ranking in Google is bad while your engineers telling you that all systems run stable with light speed performance ?

And did you ever wonder why your page views in total differ so much from your ad impressions ?

Yes ? Than you might be a victim of being “overADded”. Really. haha. You might have too many rubbish AD-Delivery instances serving content to your page. Or by far too many redirects or your marketer doesn’t care about the quality of ad-networks they are working with.

Being overADded is a very bad thing. It makes all your efforts and energy having a fast page non-sense. Also your SEM and SEO programs are equal to throwing money out of the window. It makes users leaving the page before all page logic ran (and your guys with the freaking good affiliate-program get really pissed when the logic which effects them isn’t working).

Am I overADded ?

What are the symptoms of being overADded?

To diagnose overADism is pretty simple. To exclude real bad issues (like your DevOps doing nothing but playing table soccer) you should baseline your staffs work.

Lets start with your page without ADs (and any other content) on it to have a base time of what “speed it could be”:

Page Without Ads in An Ad is an Ad is an Ad is an Ad

Page statistics with the “Base-ad-script” blocked.

So – this is a pretty common statistic (using various host names for image or script servers).

In this specific case I simply blocked one single content: The script of the marketeer to handle ads placed on pages ad spaces.
Fair enough to say that the page is a bit heavy weight but – honestly – the page is build for some business which needs nice and fancy graphics -one can connive on that fatness fact.

Now the same statistics of the same page including the base ad script (loaded under exactly the same conditions):

Page Ads in An Ad is an Ad is an Ad is an Ad

In total 38 host more are requested for a single page when the base script for ads is included

There is a freaking high amount of  hosts apparently initiated by the base-ad-script coming from redirects to Ad-networks redirecting to Ad-bidders redirecting to Ad-geolocators redirecting to yieldmanagers redirecting to (extend by your own demand) ….the Ad itself. And all this – spiced with rocket science – in real time…..

Ok ok. Not in real time…there is a gap of 12 Seconds….12 Seconds is an aeon in the web performance perspective.

One might argue: This was  just one measure from your local perspective! And I have to admit, that the argument is pretty valid.

But lets check the permanent difference of the page load time serverd to average bandwithes in the main business country of this page. Will it look different from this momentary snapshot from above or will it look worse ?

Last Mile Chart-e1341861304109 in An Ad is an Ad is an Ad is an Ad

Every edge represents 20 Measurements. (click makes it big)

The measurements were taken from all over one Country and various real Desktop machines (or Laptops) over the time frame of one week. The yellow line represents the measurements excluding the base-ad-script. The blue line the overADded page. And – what a shame the average is telling ~12 Seconds delay is the “common” time lost on ads.

But I bet there are some guys around here noticing spikes and with these they notice that the difference in load time spikes partly over hours. And they notice that the load time within these hours is way over 12 Seconds. (Luckily sometimes it is way faster (so lets hope that at those times the search engine bots are active)).

Unfortunately the spikes appear during your main business hours – when most user hitting your page.

You loose business when most traffic is running on your page. Maybe even users get bored serving your page (and you know you are not the only one fighting in this area of interest).

But you don’t care – it ran always this way. Your staff doesn’t care (“Not my business”). You do not respond to emails. You are too rich – are you ? If you don’t care, do your stakeholders might want to know what money they loose ?

Don’t be arrogant – take care of your business. It is your staff which is blamed when a page is loading slow. It is your money thrown out of the window for “Make us faster Projects”.

Monitor your business – even more important when your business is based on other peoples power to serve. Where the hack are the Busyness-Opps?

 

 

The Risks within Publishers Business

Within the following writing I will try to explain the risk of the business which Publishers are confronted with – or at least the risk of having no influence in the business. Deviant from other Business Models (Such as Software as a Service, Classifieds and classic retail e-Business) Publishers business is mainly not depending on the application owned by themselves.

Well, let me try to get started with a myth of the informative internet: The internet is costless.
By far it isn’t true. Whether you like to publish information about your life on whatever platform or you want to consume information – everything it has its price.

The place where your published meanings are consumed are very expensive to keep alive – whether it is twitter or Facebook or Google-plus. Even if it is not your intention to do money – it is the intention of the places where you post your “content” to do money and the time you spent to post something takes time and time = money (ok ok – sometimes this time can be fun also).

Usually the “hosters” do money with advertising. Advertising also drives the business of “free” consumable information such as provided by news portals. Not only that it is very expensive to host such news content, all the journalists like to get some bread between their teeth too. And maybe the chief of the news portal would like some very excellent wine in the evening.

Making money mainly with Advertising means in the today’s publishing space to put all the power of the core business into the hands of other companies (3rd Party). There is nearly 0% percent of the Ad-driven Business that is in publishers own hands – from the perspective of “controlling” it.
Vice versa this is adoptable to those who like to advertise something as it is for those who bring both parties together (marketers).

Advertising is multi-billion dollar market and one of – if not THE – the main drivers of the internet.

North of 90% of the Google Money is based on Advertisement-Market. And where there is a multi-billion dollar market there is usually some room for mud.

What do I mean with “control” (not to mix up with “configure”)?

If you are a publisher – can you currently control if your business is running as expected and do you get the income from your webpage as it is “configured”? Hard to say – or ? Especially when everything – including the configuration – happens on 3rd Party Services. You still might answer with: Yes – sure I can.

To be honest with you – the only thing where I agree with is that one might can control the part of content that he is producing and which is served by an excellent IT-Services. This content needs to be reachable and performing well.  Without this root requirement one will clearly do no online-business at all. Shame on those who have no control over this too. But this is the part of your business which costs money with the intention to get more money in return.

Most Publishers can not control their business – because they do not own the business. The business is driven by – from the publishers perspective – 3rd Party, totally. (In cases where you have a powerful monitoring in place I am wrong of course).

3rd Party:

Big news portals or other publishing web-pages need some parameters to estimate what money to expect from the published content. Long term trends are used to build a strong  strategy.  Without these estimates future investments can not be planned and they can’t plan to leverage the reach of the offerings. Simply because reach is the key to money for publishers.

To be able to do some trending usually the offerings contain one or more tracking pixel or analytic tools telling them how much page impressions/unique Visitors/Visits they currently have or had in the past. Nearly all of these “tracking methods” are 3rd Party.

In addition in some countries there are (private) institutions which counts publishers reach additionally and publish these numbers (in Germany it is IVW in the US (correct me if I am wrong) IAB) for public use. These tracking instances are 3rd Party as well. Advertisers check Webpage-Ranking Portals or the institutional sources to search for space where it is worth to spent money to run campaigns.

So what do we have currently: Some content and various tracking 3rd Party sources (mainly 3 to 4 various analytic tools). And 3rd Party sources for Numbers.

You might be not astonished to hear that it is common sense that all of these numbers differ – maybe by self experience. There is a good reason for that: Various tools = various counting technology. (You might want to read more about this here). All tracking methods are more or less not reflecting the reality. Things like wrong positioning of a tracking pixel can result in lots of percent not tracked users.

As mentioned above Publishers leave their business into the hands of others which is good. There are specialists around who are able to manage and tie together those who want to spend money (Advertisers) with you (Publishers). Marketers are specialists doing this kind of management. The main work of a marketer -for publishers – is to get loads of Ad-Spaces rented by Companies who want to let their campaign seen by your audience (generally speaking – there are some other factors adoptable here). This sounds easy but there are so many extra rules which require absolute dedicated knowledge.

Renting Publishers Ad-Space costs money for Advertisers. If you are a Publisher – we now talk about your income.

Business Metrics:

Depending on the offerings of the marketers your Ad-Space rents are calculated based on:

  • Reach (Size of the Audience)
  • Frequency (Amount of views per individual)
  • Gross Rating Points (GRPs) which is a calculation of both factors above
  • Target Rating Points (TRPs) which is a calculation of both factors above plus other target specified criteria
  • Impressions how often has an ad been shown (currently – to lower costs – these metrics are built to prove an ad at the bottom of a page has been really impressed)
    • Cost per thousand (CPM)
    • Cost per percent Points (CPP)

To make it more complex the CPM and CPP can be bundled to a “one time price” with a “guaranteed” CMP or CPP. Like: We guarantee 800.000 unique visitors per day – and this costs you amount xx.xxx of whatever currency. Some “click through” participation is mainly driven by affiliated business. I keep that out – else it gets to complex for me.

So far there has been no secret told. We all know the metrics well we also know what makes publishers getting more money: It is the amount of “counted” impressions. The higher this number or the guaranteed impressions are the more money are Advertisers willing to spend/rent and the more Money Publishers produce with their webpage.

Risk:

Until now – still every metric named is not under your control – even worth all measures are taken by others (3rd 4th 5th Parties)

Who is keeper of the metrics of “Ad-Impressions”, “Page-Views”, “Reach” ? All these data – defining your Money are taken by others.

Usually these metrics were taken by the marketer or the Ad-Trackers which are requested with an Ad-Request (yes two independent requests) or and Ad-Delivery (two independent requests). Beside of that special trackers areserved with the Ad itself  (comscore, adclear…..). Btw. none of the trackers can prove that an Ad has really been delivered – but thats different story. As long as it is counted as delivered the publisher should be fine.

These 4rth and 5th party metrics – all collected independently – always differ from your other 3rd party metrics (3rd party tracking).

This is still common knowledge in the Online-Marketing space. Therefor and because of all these metric variations marketer often put a range of 10% aditionally to the amount of the publisher earnings to avoid discussions of “3-7%” of indifference of tracking pixels. 10% can be a lot of money – depending on your coverage.

BUT what if the difference is north of 20%?

What if your Ad-Impressions (counted by 3rd Party) is way below your (other 3rd party) counted “Page Impressions” by a different 3rd Party. And the marketer says: Sorry – you can no longer deliver the “guaranteed” traffic – you Ad-Space is to offer way cheaper now – because xyz metrics is saying that.

OUCH ! This is no good! Business is at risk ? No, some business is already lost – unchangeable. Next month cards will be riffled again.

Believe me when I tell you that I have seen very annoying things around the advert-delivery-chain – way more often than expected (I have called this mud earlier).

It might be worth to spend some thoughts on this and rethink the option if it is worth to put an instance in place to control that business. Simply because: All the tracking instances are calculating the business (yours) based on requests and responses. They reflect with 0% their own availability or performance for whatever reason.

Unavailability and Performance issues happen every single day – in this current moment while I write these lines – one or two Ad-Server do not deliver Ads, two or three Ad-Trackers reject connection attempts or FF13.0.1 is throwing a JavaScript exception instead of requesting a javascript redirect of an ad broker. This is the current state.
How often have you heard about high quality Service outages lately (Amazon Cloud, Facebook as example). One should better not believe or trust that companies serving spare parts of the Ad-Delivery-Chain do better work. And there are a freaky load of companies trying to get their piece of the multi-billion dollar cake (geo-targeters, affiliates, real time bidders, re-targeters, behavior targeters, ad-storages, ad networks, ad-exchanges, yield optimization…) even if it is only some new real time something.

Without fingerpointing – this is good as it is – where there is money there is movement. This all keeps the internet “costless” for users.

Marketers in charge of the Publishers “Ad-Space” can only work with metrics delivered from their own Ad-Servers? There is no way to control the other parts of the delivery chain.

Now after all that written. Do you still think Publishers are able to control their Business ?

What is necessary to control the Business ? Or at least to tighten estimates and be aware of Business in risk before it is too late?

First of all: It is only possible to control business when it is measured. Therefore you need an instance which allows you to:

  • Measure the ability of the Ad-Trackers to get called by EVERY Browser (often JavaScript dysfunction does not fire the count request to the Ad-counters)
  • Measure the ability of the Ad-Trackers to be able to receive requests AND to send a sense full response (503, 410, 500 HTTP Returns are no good – or even worth the SSL certificate is outdated)
  • Measure the complete Ad-Delivery chain – starting with the reach-ability of your marketers base JavaScript (in which commonly rules are set how to handle defined Ad-Space) to Ad-Servers where campaigns are stored to ad-networks where other factors are added to the final Ad-Storage where the Ad-Content is stored.
  • Measure the difference of successful Ad-Deliveries compared to Ad-Counts.
  • Also it might be interesting to know if your “Tracking” Pixel – or the Public Ranking Pixel is delivered well. Often these Rankings are the first address for Advertisers to check for reach.
  • Measure the users affected by 3rd Party issues. That should help you to stay “real”.
  • Use a Monitoring instance which supports the dynamic Ad-Delivery-Chain (the closer you monitor from the End-User-Perspective the better) to reflect: Geo Targets, Content Delivery networks of Ad-Storages, etc.

On one single Day of monitoring one of the german most famous News-Portal I have counted 250 different Hosts delivering content. Excluding the Host of the News Portal itself.  In total 85 different Domains were requested/redirected to/delivering content.

85 third-party domains – doing nothing else but content enrichment and showing ads. And this is not a very exquisite case.

I think it is worth and imporatant for Publishers to start to get control over their 3rd party delivered business.

Real Life Example (Bigpoint):

I have a “real life” example on how 3rd Party inability to serve content (tracking pixel) lead to bad Press and bad statistics – and even more worse – decreasing Ad-Revenue.

Bigpoint – one of the world biggest browser game producer and hoster is making parts of their money by Advertisements (multimillon dollars per year).

In a very famous media news portal (Meedia.de) – exclusively was stated that “Bigpoint” had 22 % less visits in May 2012 compared to April 2012 ! 22% less (sh.t, thats 12.000.000 visits less). These numbers are based on the “official” Data published for public use from IVW. IVW is the number-one-address in Germany to check for “world wide coverage” of german publishers. Tracked by a pixel which every single Publisher in Germany knows as “IVW Pixel”. These public available Numbers are THE numbers for Advertisers to plan and sell their campaigns – they are trusted and respected source.

But lets check if the 22% are reflecting the truth. Is there a chance that this number is simply wrong and base on wrong data ? 

I have monitored the German portal of Bigpoint over the last 3 days from the “End User Perspective” – which is a true Last Mile monitoring. A mass of Monitoring instances (about 526 different locations within 3 days) called the de.bigpoint.com portal with a Firefox browser. All instances calling the page used for sure the same technology. And I was really a bit shocked what I have seen.

We can see an up to 800% decrease in response time of the IVW-Pixel in the evening hours – those hours where games are online  (explicitly filtered by host name ). This is what I would name “drastic” (Google i. e. loads within 2 Seconds measured from the last mile).

Ivwbigpoint in The Risks within Publishers Business

The graph shows averages – one point reflects 20 Measurements. Very inconsistent performance – especially during the main traffic time of the Bigpoint Portal. Source: Gomez – APM

So first sighting is – a part of the amount of 22% might be because of slow response time of the pixel. The users might have clicked on a link and left the page before the pixel has been loaded. Just an idea. To verify I have to check its position in the waterfall chart or firebug…) But lets check the second important metric, the availability of the IVW Pixel.

Ivwbigpointava in The Risks within Publishers Business

availability decreased down to 80% in the key traffic times. hmm… Source: Gomez – APM

This means with other words: Currently the pixel get loaded way slower when there are freak loads of users PLUS there are a huge amount load of users not counted because the pixel of IVW not reachable and has not been delivered (in this case the request timed out).

For sure this sighting will result in a decrease of user in the public available status -and for sure the data will not reflect the truth.
I think this makes one look differently to the 22% decrease of the traffic – right?

It is not a secret that IVW numbers are the numbers in Germany making Ad-Space prices. Bad numbers – less money to make. Business in Risk? No – Business lost, simply because of a bad press and questionable online statistics where your guaranteed traffic and reach can be read from.

Exactly this is a very current case the headline is about. Publishers (Bigpoint) Business is in risk if they trust in false data instead of controlling the truthworthiness.

 

 

Weighting Incident Impact using Web Statistics = Failure

Introduction:

This blog entry from the fabulous dynaTrace Blog inspired me to write this blog about an experience I had just a few days ago with a customer. Luckily I was able to guide the customer into the right direction to calculate the user impact of an incident correctly.

I bet you read this blog because you are a web performance professional and you take care of your or anyone’s web page performance and availability. This means you somehow measure the performance.

Hopefully you use a tool that allows you to drill down into the object level because it is necessary to see which of the contents of your webpage is slowing down performance or make transactions fail during monitoring. This is pretty important – else you might fail with weighting the user and business impact of incidents – because you miss helpful base knowledge.

Why that?

Given the fact that you are using a count pixel for your Web Statistics you must know when and how the statistics tool receives the appropriate information from the client. Statistics usually are hosted services (uh ah…cloud…) and send user information via get request sharp before onload or with/after the onload event happens.

To get the knowledge of the count pixel logic and load order of your web page I would recommend a tool like httpwatch, firebug, dynaTrace ajax edition – or as mentioned above – the waterfall diagram from your monitoring tool.

Now you can easily figure out various important zones:

Counted-or-not2 in Weighting Incident Impact using Web Statistics = Failure

Be aware that the order of the zones might differ by the various vendors

  • Zone of Incident usually happen: This is the area of loading all objects necessary to get the “onload” event in the browser fired. If one of the objects fails to load in an appropriate time the onload event will happen delayed by this time.
  • Zone of where you get counted: The area where your statistics tool is getting the user information. Usually this happens via a parameterized get requests to the host of the statistics server. This can happen sharply before or sharp after the onload event.
  • Zone of things happening after onload: Not much to say here if you have read the dynaTrace’s blog mentioned above.

 

Incident:

You get aware that you have an incident happening on your web page. In the best case you get the information very early by an alert sent from your external synthetic monitoring tool. The other excellent option is to get the alert early from your internal passive monitoring tool.

If a very critical third party is causing the issue your internal tools might not get aware of the incident status.

There could be of course several kinds of incidents – some impacting business more some less. Some of them might be issued by your application or network environment and some might happen because of third party tools.

The incidents which I see mainly at customers from Compuware are of that nature that the web page speed goes down. The time consuming content mainly is located in the zone of incident . Unavailability happens often too but by far less often.

Provoked are the incidents mainly by connectivity issues (connections getting not directly established and it takes 3, 9 or 21 seconds)  or by a long time to first byte – which means the content is requested by the client but the servers requested need some time to send the content.

Another issue that we see more and more often is the time for SSL handshake. This is caused by the actual browsers checking certificates status by obtaining the revocation status of the SSL key at its origin (this is called OCSP).

Whatever it is – the web page is or was slow and you want to find out what this means for business.

Like in my customer’s case:

Incident2 in Weighting Incident Impact using Web Statistics = Failure

Incident: Page was available but load time exploded.

 

Area of impact weighting failure:

Usually – after or during the incident –  you would now check your statistics tool for Page impressions and user behavior. Intention to do so is the request to weight the incident by user and business impact.

  • Was my conversion rate effected
  • How many users in total were effected
  • Was the amount of total conversions effected

Unfortunately your statistics are not telling the truth at this moment. When you use these numbers you use numbers which are wrong and are not reflecting the real user numbers.

The longer the page takes to load – the less users will get recognized by the statistics tool.

The reason is pretty simple. During load-time-incidents users leave the page way before the browser is firing the onload event. Many users never reach the “zone of being counted”. Or depending on the statistics technology they were counted without being able to use the page. This happens when the zone of being counted is located prior to the zone of incident. In both cases your statistics are simply wrong – whether false negative or false positive).

Waterfall2 in Weighting Incident Impact using Web Statistics = Failure

Monitoring was able to direct to a global problem. If you check your page only from one location (like your own workstation) you can not specify if the problem is a local problem or a global problem.

In the sighting I had with my customer we monitored the above shown problem. A 3rd Party was not able to deliver a content which was absolutely necessary to load the page. It was very aweful because the content blocked the rest of the page to load.

Ask yourself – would you have waited 41 Seconds on a half ready loaded page where you were not able to do what you wanted to do by visiting the page ? I bet the answer would be: No.

In most cases you would leave the page or abandon the load.

Leaving and abandonment explained:

  • Page issue was that bad that the application was not usable and the user closed the browser or went to a different page
  • Page was usable and the users found what they were looking for (clicked on a link, entered the search word and pressed search) – before onload event was fired.
  • The user hit the reload button or F5 key because the page looks broken and they really hope to get the complete page with simply reloading the page.

Another problem with a minor severity regarding the statistics is the fact that they fail to count every user. Whether the get request to the server never happens or the users using tools to not get counted.

 

Do you know your users behaviorism on your web page?

I guess you do – partly. What I mean is mainly: do you know if you have fast or slow acting users.

Have you ever compared the log files of your server with the numbers of Page Impressions from your statistics tool? I bet you will figure out that there is a gap which is more or less big.

When I talk about the log files of the servers I mean the amount of get requests for the base page object or root object (jsp, php, asp, html and so forth).

The common user (like you and me) leave a page as soon as we they have found what they are looking for.

What I am trying to point out is: You must know about the amount of users staying on the page until they have reached the zone of being recognized by the statistics.

Not during incidents. It is necessary to know this for the very common status.

If you have a knowledge gap here your baseline for any calculation is wrong.

 

How to calculate the real impact?

The customer where we did the incident analyses with is using a Browser based real user monitoring that enables him to directly see the “abandonments” on his web page. Common status is around 7% in abandonment rate in average.

During the incident the load time exploded and therefor way less users reached the zone of being counted – because they abandoned the further load process (reasons named above).

Abandonment22 in Weighting Incident Impact using Web Statistics = Failure

We can see the common abandonment of 7% and the explosion of the bandonment when the incident started

 

The customer’s real-user monitoring in place is the Gomez Real-User Monitoring – Browser. Technically it is a javascript tag implemented into the web page which grabs all metrics of the real users experience directly from the Browser.

The page load time is reflecting the timing when the “onload” event happens. The abandonment rate is calculated by the amount of users visiting the page versus the amount of counted onload events.

By knowing that the load order of the web page is ->zone of incident ->zone of calculation we know that it is most likely that many users not have been counted in the statistics.

A feature of the Real-User Monitoring  – Browser allows us to compare the amount of users opening the page with the amount of calls of the counter pixel (eTracker) directly.

Chart2 in Weighting Incident Impact using Web Statistics = Failure

Difference between Page views and Tracking pixel success.

During the “normal” times we see a difference of 12% between these requests which should be equal .

Reason: Users abandon page from loading before the request is sent to eTracker + amount of errors while requesting the host of etracker.

During the incident we can see the decrease of the call of the statistics tool. Raising down partly to 30% less calls.

Conclusion:

It is absolutely necessary to know your real users common behavior and the the users behavior during incidents. Else you would not be able to write a statement for the user impact if your analyses relies on your web-statistics tool.

You must know where in the load order the users get counted to make sure all users are counted or if they are counted correctly.

You must know when users abandon and how much users abandon the load of the page.

You must know that the longer the load time of a page takes – the less users “undergo” onload events. This means all the things happening in the zone of “after onload” are risked to apply. If you do business logic and business in that zone – the business is effected.

Abandonment3 in Weighting Incident Impact using Web Statistics = Failure

Page view compared with page load abandonments.

Most of all: You need to know very early that your web page is not loading properly and in which matter this impacts your business.

Oh no, not again a 3rd Party performance blog..

I have noticed that in the past 3 month that the topic “3rd Party performance” is getting more and more visiblity in the field.
Not that there was noone ever spent any attention to it before – there was, but in the past few months there were constantly raising blog entries and twitter entries and the performance companies (those who are able to monitor 3rd Party) invited to “3rd Party monitoring” webinars.

And now here is the next entry about 3rd Party performance. Just because I really believe that this is fucking important.

Honestly! I think that 3rd Party monitoring is absolutely necessary if your businessmodel relies on others to deliver content. Wheter it is a CDN you are using where you have signed strikt SLAs with or if your webpage is marketed. The marketing aspect is the the topic of this writing.

I lately noticed on a webpage that only lives from being marketed that some areas for advertisement are not filled when I opened up the page (which i do very often).

There could be three reasons for not having the space filled:

  1. no advertiser booked the space
  2. the ad was not delivered by some issues in the advertisement delivery chain
  3. the webmaster programmed bullshit

By knowing that the page is very “succesfull” in Germany it would be very estonishingly if there are times with no one booked available ad space.

Therefore I decided to put the page into Monitoring and found the following (Monitored with a FireFox Synthetic Monitoring Agent – 1 Test per each of 6 locations in Germany per hour – Agents located at main Carrier Networks)

Metrics in Oh no, not again a 3rd Party performance blog..

2 days of measurements show huge gaps in availability of 3rd party

By not telling a secret, that some of the facebook servers are delivering not as expected, we also can recognize that in some cases availability is below 80%. Yes – not only 99.8% but check the numbers out on yourself by clicking on the image.

I have to admit, that I have filtered out those 3rd Parties which were available <99.8 %.

What does availability mean in this case: Well, the content could not be delivered due to:

  • invalid SSL keys
  • connection interruptions (client abort)
  • service unavailable
  • connection timeouts
  • 404
  • and many more

The main “unavailability” is measured during the peak user hours. Imagine how many Ad-Impressions got lost ? How many users were not able to click on an Ad ? Does the page look broken or unfinished for them ? Has the wait time for the Ad blocked the rest of the page from loading?

You can imagine that Ad-Availability is hard to control. Maybe one little server as part of the advertisement delivery chain might have issues. Talking about the Ad-delivery-chain I mean something like publisher – adserver – adnetwork – adnetwork two – geo-targeting-adservice – realtime bidding – and many many more – and of course the final Server where the ad is located at.

Well, by knowing how much traffic the page got and by knowing how many Ads have not been delivered, it is easy to calculate the amount of money getting lost. And believe me – I talk about big money in the case above.

So, what do you think – would it be worth to know how your 3rd Party is delivering content ?

Ad-Ausspielung – Tagessgeschehenabhängig

Diesen kuriosen Ad fand ich heute auf GuteFrage.net.

Google scheint nun Ads auch in Abhängigkeit vom Tagesgeschehen auszuspielen.
Ob das so wirklich Absicht war ?

Rassezucht in Ad-Ausspielung - Tagessgeschehenabhängig

 

Is your 3rd Party SSL key trusted ?

Today I got aware of an issue that might effect a bunch of websites.

Imagine you have a shop and imagine you do some affiliate marketing. This means you or your marketeer have to include a bunch of 3rd Party into your website. Even in the payment process where the “conversion funnel” recognises the click origin.

Now imagine one – or more – of these content partners participating in the SSL secured process of the payment do not have a valid SSL key.

Will this effect your business ?
The answer is: YES

Modern Browsers double check the validity of SSL keys by asking the key originator if this key is valid (Wikipedia on OCSP).
This means: For every 3rd party the browser checks if their key is valid. If you do a lot of affiliate programs, this could be quiet a few. A few OCSP checks definiatly cause longer load times. Even one check could cause longer load times. As long as the key is not approved the content will not be loaded by the browser.

Trustedornot in Is your 3rd Party SSL key trusted ?

OCSP takes a partly very long to be processed - depending on the SSL key originators ability to answer the browsers request for key aproval

Even more important is what happens if one or more of the 3rd parties key is not trusted or its validity can not be proved (due to OCSP Server outage)!
Usually the Browser will abort the connection (to the affiliates server) -and the affiliate will not get recognised.

You might have sometimes HTTPWatch open or work with Firebug and you have seen something like “HTTP 200 abort”. The web client fires an abort if the content is not coming with a valid key.

Even more worse (depends on where the 3rd party content is called) the page (your page) will return an Security Error and the user can not buy what he was supposed to by (from you).

(Check here how a failure can look like)

It is not a bad decision to permanently keep an eye on this these days where SSL key vendors being under permanent DoS attacks or being hacked.

 

Bizarr Incident

Today I got a call from a friend. He reported current issues with performance on all if his pages -  honestly: the page was out of busines.

All of a sudden the load time of the webpage increase without any changes in configuration, release change and without any measureable hardware defect.

Clipboard01 in Bizarr Incident

The friend is using various technologies to monitor – such as real user monitoring from the browser of his users and synthetic external monitoring beside of various internal monitors.

Our first intention was thinking about a hack attack with a bot that is not able to render javascript (else the amount of pageviews would have rised in the javascript based browser monitoring).
Hack attacks could also easily be excluded by checking the logfiles for a special agent string or dedicated amout of ip-ranges trying to access the page.
Than we excluded other issues by checking the synthetic monitoring which clearly showed no issues with bandwith (long content download times), connectivity and DNS times. Symptoms were high CPU, long first byte times and a feaking load of traffic from all over Germany.

After excluding all issues which could happen from inside we started to investigate time for external issues. But again – none of the included 3rd parties cause any issues (not the ads or the CDN)

By checking the analysis tool (tracking pixel) for referrers we finally found the issue!

One issue wich is really bizarr – and not really measureable.

One advertiser included the complete webpage instead of the banner on the adserver. So with every call of other pages (where the ad should appear) the complete webpage of the customer was requested….

Like a DDoS attack provoked by a little mistype.

Strange things can happen all the time……

 

Der “Online Marketing” Effekt

Worum es geht:

Wenn man heute auf die Landingpage eines beliebigen E-Commerce Angebots browsed, wird man mit sehr großer Wahrscheinlichkeit, über verschiedene Redirects und Javascripte von verschiedenen Affiliate Trackern erfasst. Dies passiert in der Regel, während die Landingpage tatsächlich geladen wird und ohne dass man es mitbekommt. Ob nun über ein tatsächliches Affiliate Programm hierher gelangt ist oder nicht, spielt dabei primär überhaupt gar keine Rolle, da ja jeder Besucher von einem “Vermarkter” kommen könnte (wie das Prinzip funktioniert ist hier recht verständlich erklärt).

Ebenso wird man mit einer sehr großen Wahrscheinlichkeit gefragt, ob man die Seite “liked” und wie viele andere Menschen die Seite schon mögen. Auch diese Einbindung erfolgt meist über Javascripte.

Was aber alle diese Javascripte von Affiliates und Social Networks gemeinsam haben ist folgendes: Sie lösen eine Anfrage (Request) auf weiteren Server aus, welcher mehr oder weniger Inhalte zurück liefert. Abhängig von der Antwortkapazität dieser Drittserver, wird jeder dieser Aufrufe zu einer businesskritischen Größe für die Ladezeit der eigenen Webseite.

Das Problem:

Gerade in den letzten Tagen gab es immer wieder Probleme mit den Inhalten, die Facebook liefert. Bedenkt man, dass jeder Content – insbesondere externe Content einen “overhead” bei der Ladezeit liefert, dann können Ausreißer kritische Businessgrößen erreichen.

Facebook in Der Online Marketing Effekt

Facebookausreißer in den vergangenen 7 Tagen (Webseite: Zalando)

Kann hier (Chart oben – Facebook liefert erst nach 25 Sekunden) beurteilt werden, welchen Einfluss diese Ausreißer auf das eigene Business haben?

Nein. Per Se nicht.

Dazu muss man beantworten können, wieviele User denn davon betroffen sind, und ob der weitere Ladevorgang der Seite durch dieses Problem blockiert wird.

Betrachtet man nun neben dem “Social Marketing” auch noch die Affilates und anderen Werbepartner, dann wird man sehr schnell gewahr, dass diese Werbepartner – so sehr sie auch für Reichweite und Bekanntheit sorgen – eben so schnell das Geschäft gründlich vermiesen können.

Affiliates in Der Online Marketing Effekt

Werbepartner die nicht liefern - Gefahr für das Business

In dem konkreten Beispiel von Zalando.de sieht man gleich zwei Content-Partner, die Performanceprobleme hatten. Zur gleichen Zeit. Man darf ja nicht vergessen, dass es sich hierbei nur um “kleine”, “versteckte” Inhalte handelt.

[Update 12.09:

Wie auf dem Chart deutlich  zu sehen, handelte es sich  um einen zeitlich begrenzten Ausfall [06.09 15:00 Uhr bis 23:00 Uhr] und nicht um einen permanenten Status. Ursächlich für die Probleme ist vermutlich ein Bagger gewesen. Lesen Sie auch “Bagger durchtrennt…“. Mit herzlichen Dank an Dr. Thomas Nicolai.

Ende Update]

Allein Anhand des Beispiels von Zalando zeige ich 3 Incidents innerhalb der vergangenen 7 Tage auf (1. Facebook; 2. Avazudsp; 3. Sociomantic). Von Zufall kann hier keine Rede sein – Ausfälle und Probleme dieser Qualität sind täglich zu beobachten. Zufall ist lediglich, dass sich zwei Probleme “subsumiert” haben.

Es ist sehr wahrscheinlich, dass ALLE, die jene Dienstleister eingebunden haben von dem Problem betroffen sind.

Im Fall von Zalando war z. B. ein Tag lang das Geschäft gefährdet, weil Marketing- oder Affiliatepartner Inhalte nicht geliefert haben. Mitten in der Hauptgeschäftszeit – und vermutlich hat es stundenlang niemand gemerkt. Sehr wahrscheinlich ging zu dem Zeitpunkt „lediglich“ die Conversionrate deutlich zurück.

Bei Webseiten, deren Inhalte so sehr auf Marketing und Verbreitung ausgelegt sind, wie Zalando, bei denen alleine 30 ! 3rd Party Contentlieferanten eingebunden sind, ist es zwingend notwendig, ein Auge auf die Performance zu haben und die Möglichkeit hier schnell Rückschlüsse auf das eigene Geschäft ziehen zu können.

Auch bei dieser schlechten Lieferzeit der Affiliates/Werbepartner kann noch kein Rückschluss auf einen echten Businessimpact gegeben werden. Die zentrale Frage lautet also: Wie viele User sind betroffen und konnten diese mit dem Browsen fortfahren?  Sollte die ein Einkaufen trotz der Probleme erfolgreich abgeschlossen werden können, muss gefragt werden: Was geht mir verloren, wenn ein User eines ganzen Tages nicht mit den Cookies der Partner versorgt wurden?

Die entscheidenden Punkte für den Businessimpact sind also:

* Welchen Einfluss hat die Nicht-Auslieferung des Contents auf mein Business?
* Wie viele User sind betroffen?
* Können die betroffenen User weiterbrowsen?
* Wieviel Umsatz verliere ich gerade?
* Und vor allem: Was kann ich tun, um wieder Herr der Lage zu werden?

Es ist also zwingend notwendig, dass die Möglichkeit besteht, seine eigen Präsenz permanent zu kontrollieren und die Contentlieferanten zu monitoren. Technologien dafür gibt es. Ebenso müssen mit den Lieferanten Service Level Agreements vereinbart werden, die bei Ausfällen aber auch bei Performanceschwierigkeiten Penalties mit sich führen.

Zal in Der Online Marketing Effekt

Zalando - Wasserfallchart zeigt deutlich die Blockade durch den Contentlieferanten Sociomantic (click zum Vergrößern)

Verlangsamen die Inhalte der Affiliates/Werbepartner – wie im Beispiel von Zalando deutlich zu erkennen – die Auslieferung des geschäftskritischen Inhalts, werden hierdurch alle Bemühungen, die eigene Seite schnell und zuverlässig zu gestalten, torpediert. Die verlangsamte Antwortzeit der Affiliates begleitet zudem als ständiges Grundrauschen die eigene Seite und wird von dem User wahrgenommen. Das Ergebnis sind schlechte Kritiken, Abwanderung von Kunden und langwierige Fehlersuchen im Unternehmen/IT-Abteilung, die, wenn man Affiliates nicht als potentielle Fehlerquelle in Betracht zieht, erfolglos verläuft und sinnlos Budget verbrennt.

Lösung?

Vielen Unternehmen stehen bereits geeignete Monitoringmaßnahmen zur Verfügung. Diese könnten ihnen entsprechende Auswertungen und Alarme senden – nur werden diese Möglichkeiten so gut wie nie ausgeschöpft.

Wie viele Contentlieferanten melden sich proaktiv, wenn sie Lieferschwierigkeiten haben oder senden proaktiv Reports über die Performance und Verfügbarkeit? Wieviele Contentlieferanten monitoren überhaupt ihre eigene Servicequalität?

Fragen die unbedingt gestellt werden sollten, wenn man eine Partnerschaft eingeht, die ja das Geschäft fördern und nicht blockieren soll.

Medienwebseiten, die Bannerplätze vermieten stehen vor einem ähnlichen Problem. Adserver Technologien liefern Inhalte in die Webseite, welche wiederum weitere Bannerlieferanten anfragen. Hier kommt es auch sehr häufig zu Lieferproblemen. Dazu aber mehr in einem weiteren noch folgenden Blogeintrag.

Vielen Dank an Daniel für das Lektorat.

←Older