Weighting Incident Impact using Web Statistics = Failure
This blog entry from the fabulous dynaTrace Blog inspired me to write this blog about an experience I had just a few days ago with a customer. Luckily I was able to guide the customer into the right direction to calculate the user impact of an incident correctly.
I bet you read this blog because you are a web performance professional and you take care of your or anyone’s web page performance and availability. This means you somehow measure the performance.
Hopefully you use a tool that allows you to drill down into the object level because it is necessary to see which of the contents of your webpage is slowing down performance or make transactions fail during monitoring. This is pretty important – else you might fail with weighting the user and business impact of incidents – because you miss helpful base knowledge.
Given the fact that you are using a count pixel for your Web Statistics you must know when and how the statistics tool receives the appropriate information from the client. Statistics usually are hosted services (uh ah…cloud…) and send user information via get request sharp before onload or with/after the onload event happens.
To get the knowledge of the count pixel logic and load order of your web page I would recommend a tool like httpwatch, firebug, dynaTrace ajax edition – or as mentioned above – the waterfall diagram from your monitoring tool.
Now you can easily figure out various important zones:
- Zone of Incident usually happen: This is the area of loading all objects necessary to get the “onload” event in the browser fired. If one of the objects fails to load in an appropriate time the onload event will happen delayed by this time.
- Zone of where you get counted: The area where your statistics tool is getting the user information. Usually this happens via a parameterized get requests to the host of the statistics server. This can happen sharply before or sharp after the onload event.
- Zone of things happening after onload: Not much to say here if you have read the dynaTrace’s blog mentioned above.
You get aware that you have an incident happening on your web page. In the best case you get the information very early by an alert sent from your external synthetic monitoring tool. The other excellent option is to get the alert early from your internal passive monitoring tool.
If a very critical third party is causing the issue your internal tools might not get aware of the incident status.
There could be of course several kinds of incidents – some impacting business more some less. Some of them might be issued by your application or network environment and some might happen because of third party tools.
The incidents which I see mainly at customers from Compuware are of that nature that the web page speed goes down. The time consuming content mainly is located in the zone of incident . Unavailability happens often too but by far less often.
Provoked are the incidents mainly by connectivity issues (connections getting not directly established and it takes 3, 9 or 21 seconds) or by a long time to first byte – which means the content is requested by the client but the servers requested need some time to send the content.
Another issue that we see more and more often is the time for SSL handshake. This is caused by the actual browsers checking certificates status by obtaining the revocation status of the SSL key at its origin (this is called OCSP).
Whatever it is – the web page is or was slow and you want to find out what this means for business.
Like in my customer’s case:
Area of impact weighting failure:
Usually – after or during the incident – you would now check your statistics tool for Page impressions and user behavior. Intention to do so is the request to weight the incident by user and business impact.
- Was my conversion rate effected
- How many users in total were effected
- Was the amount of total conversions effected
Unfortunately your statistics are not telling the truth at this moment. When you use these numbers you use numbers which are wrong and are not reflecting the real user numbers.
The longer the page takes to load – the less users will get recognized by the statistics tool.
The reason is pretty simple. During load-time-incidents users leave the page way before the browser is firing the onload event. Many users never reach the “zone of being counted”. Or depending on the statistics technology they were counted without being able to use the page. This happens when the zone of being counted is located prior to the zone of incident. In both cases your statistics are simply wrong – whether false negative or false positive).
In the sighting I had with my customer we monitored the above shown problem. A 3rd Party was not able to deliver a content which was absolutely necessary to load the page. It was very aweful because the content blocked the rest of the page to load.
Ask yourself – would you have waited 41 Seconds on a half ready loaded page where you were not able to do what you wanted to do by visiting the page ? I bet the answer would be: No.
In most cases you would leave the page or abandon the load.
Leaving and abandonment explained:
- Page issue was that bad that the application was not usable and the user closed the browser or went to a different page
- Page was usable and the users found what they were looking for (clicked on a link, entered the search word and pressed search) – before onload event was fired.
- The user hit the reload button or F5 key because the page looks broken and they really hope to get the complete page with simply reloading the page.
Another problem with a minor severity regarding the statistics is the fact that they fail to count every user. Whether the get request to the server never happens or the users using tools to not get counted.
Do you know your users behaviorism on your web page?
I guess you do – partly. What I mean is mainly: do you know if you have fast or slow acting users.
Have you ever compared the log files of your server with the numbers of Page Impressions from your statistics tool? I bet you will figure out that there is a gap which is more or less big.
When I talk about the log files of the servers I mean the amount of get requests for the base page object or root object (jsp, php, asp, html and so forth).
The common user (like you and me) leave a page as soon as we they have found what they are looking for.
What I am trying to point out is: You must know about the amount of users staying on the page until they have reached the zone of being recognized by the statistics.
Not during incidents. It is necessary to know this for the very common status.
If you have a knowledge gap here your baseline for any calculation is wrong.
How to calculate the real impact?
The customer where we did the incident analyses with is using a Browser based real user monitoring that enables him to directly see the “abandonments” on his web page. Common status is around 7% in abandonment rate in average.
During the incident the load time exploded and therefor way less users reached the zone of being counted – because they abandoned the further load process (reasons named above).
The page load time is reflecting the timing when the “onload” event happens. The abandonment rate is calculated by the amount of users visiting the page versus the amount of counted onload events.
By knowing that the load order of the web page is ->zone of incident ->zone of calculation we know that it is most likely that many users not have been counted in the statistics.
A feature of the Real-User Monitoring – Browser allows us to compare the amount of users opening the page with the amount of calls of the counter pixel (eTracker) directly.
During the “normal” times we see a difference of 12% between these requests which should be equal .
Reason: Users abandon page from loading before the request is sent to eTracker + amount of errors while requesting the host of etracker.
During the incident we can see the decrease of the call of the statistics tool. Raising down partly to 30% less calls.
It is absolutely necessary to know your real users common behavior and the the users behavior during incidents. Else you would not be able to write a statement for the user impact if your analyses relies on your web-statistics tool.
You must know where in the load order the users get counted to make sure all users are counted or if they are counted correctly.
You must know when users abandon and how much users abandon the load of the page.
You must know that the longer the load time of a page takes – the less users “undergo” onload events. This means all the things happening in the zone of “after onload” are risked to apply. If you do business logic and business in that zone – the business is effected.
Most of all: You need to know very early that your web page is not loading properly and in which matter this impacts your business.