Weighting Incident Impact using Web Statistics = Failure

Dez 23rd, 2011

Introduction:

This blog entry from the fabulous dynaTrace Blog inspired me to write this blog about an experience I had just a few days ago with a customer. Luckily I was able to guide the customer into the right direction to calculate the user impact of an incident correctly.

I bet you read this blog because you are a web performance professional and you take care of your or anyone’s web page performance and availability. This means you somehow measure the performance.

Hopefully you use a tool that allows you to drill down into the object level because it is necessary to see which of the contents of your webpage is slowing down performance or make transactions fail during monitoring. This is pretty important – else you might fail with weighting the user and business impact of incidents – because you miss helpful base knowledge.

Why that?

Given the fact that you are using a count pixel for your Web Statistics you must know when and how the statistics tool receives the appropriate information from the client. Statistics usually are hosted services (uh ah…cloud…) and send user information via get request sharp before onload or with/after the onload event happens.

To get the knowledge of the count pixel logic and load order of your web page I would recommend a tool like httpwatch, firebug, dynaTrace ajax edition – or as mentioned above – the waterfall diagram from your monitoring tool.

Now you can easily figure out various important zones:

Counted-or-not2 in Weighting Incident Impact using Web Statistics = Failure

Be aware that the order of the zones might differ by the various vendors

  • Zone of Incident usually happen: This is the area of loading all objects necessary to get the “onload” event in the browser fired. If one of the objects fails to load in an appropriate time the onload event will happen delayed by this time.
  • Zone of where you get counted: The area where your statistics tool is getting the user information. Usually this happens via a parameterized get requests to the host of the statistics server. This can happen sharply before or sharp after the onload event.
  • Zone of things happening after onload: Not much to say here if you have read the dynaTrace’s blog mentioned above.

 

Incident:

You get aware that you have an incident happening on your web page. In the best case you get the information very early by an alert sent from your external synthetic monitoring tool. The other excellent option is to get the alert early from your internal passive monitoring tool.

If a very critical third party is causing the issue your internal tools might not get aware of the incident status.

There could be of course several kinds of incidents – some impacting business more some less. Some of them might be issued by your application or network environment and some might happen because of third party tools.

The incidents which I see mainly at customers from Compuware are of that nature that the web page speed goes down. The time consuming content mainly is located in the zone of incident . Unavailability happens often too but by far less often.

Provoked are the incidents mainly by connectivity issues (connections getting not directly established and it takes 3, 9 or 21 seconds)  or by a long time to first byte – which means the content is requested by the client but the servers requested need some time to send the content.

Another issue that we see more and more often is the time for SSL handshake. This is caused by the actual browsers checking certificates status by obtaining the revocation status of the SSL key at its origin (this is called OCSP).

Whatever it is – the web page is or was slow and you want to find out what this means for business.

Like in my customer’s case:

Incident2 in Weighting Incident Impact using Web Statistics = Failure

Incident: Page was available but load time exploded.

 

Area of impact weighting failure:

Usually – after or during the incident –  you would now check your statistics tool for Page impressions and user behavior. Intention to do so is the request to weight the incident by user and business impact.

  • Was my conversion rate effected
  • How many users in total were effected
  • Was the amount of total conversions effected

Unfortunately your statistics are not telling the truth at this moment. When you use these numbers you use numbers which are wrong and are not reflecting the real user numbers.

The longer the page takes to load – the less users will get recognized by the statistics tool.

The reason is pretty simple. During load-time-incidents users leave the page way before the browser is firing the onload event. Many users never reach the “zone of being counted”. Or depending on the statistics technology they were counted without being able to use the page. This happens when the zone of being counted is located prior to the zone of incident. In both cases your statistics are simply wrong – whether false negative or false positive).

Waterfall2 in Weighting Incident Impact using Web Statistics = Failure

Monitoring was able to direct to a global problem. If you check your page only from one location (like your own workstation) you can not specify if the problem is a local problem or a global problem.

In the sighting I had with my customer we monitored the above shown problem. A 3rd Party was not able to deliver a content which was absolutely necessary to load the page. It was very aweful because the content blocked the rest of the page to load.

Ask yourself – would you have waited 41 Seconds on a half ready loaded page where you were not able to do what you wanted to do by visiting the page ? I bet the answer would be: No.

In most cases you would leave the page or abandon the load.

Leaving and abandonment explained:

  • Page issue was that bad that the application was not usable and the user closed the browser or went to a different page
  • Page was usable and the users found what they were looking for (clicked on a link, entered the search word and pressed search) – before onload event was fired.
  • The user hit the reload button or F5 key because the page looks broken and they really hope to get the complete page with simply reloading the page.

Another problem with a minor severity regarding the statistics is the fact that they fail to count every user. Whether the get request to the server never happens or the users using tools to not get counted.

 

Do you know your users behaviorism on your web page?

I guess you do – partly. What I mean is mainly: do you know if you have fast or slow acting users.

Have you ever compared the log files of your server with the numbers of Page Impressions from your statistics tool? I bet you will figure out that there is a gap which is more or less big.

When I talk about the log files of the servers I mean the amount of get requests for the base page object or root object (jsp, php, asp, html and so forth).

The common user (like you and me) leave a page as soon as we they have found what they are looking for.

What I am trying to point out is: You must know about the amount of users staying on the page until they have reached the zone of being recognized by the statistics.

Not during incidents. It is necessary to know this for the very common status.

If you have a knowledge gap here your baseline for any calculation is wrong.

 

How to calculate the real impact?

The customer where we did the incident analyses with is using a Browser based real user monitoring that enables him to directly see the “abandonments” on his web page. Common status is around 7% in abandonment rate in average.

During the incident the load time exploded and therefor way less users reached the zone of being counted – because they abandoned the further load process (reasons named above).

Abandonment22 in Weighting Incident Impact using Web Statistics = Failure

We can see the common abandonment of 7% and the explosion of the bandonment when the incident started

 

The customer’s real-user monitoring in place is the Gomez Real-User Monitoring – Browser. Technically it is a javascript tag implemented into the web page which grabs all metrics of the real users experience directly from the Browser.

The page load time is reflecting the timing when the “onload” event happens. The abandonment rate is calculated by the amount of users visiting the page versus the amount of counted onload events.

By knowing that the load order of the web page is ->zone of incident ->zone of calculation we know that it is most likely that many users not have been counted in the statistics.

A feature of the Real-User Monitoring  – Browser allows us to compare the amount of users opening the page with the amount of calls of the counter pixel (eTracker) directly.

Chart2 in Weighting Incident Impact using Web Statistics = Failure

Difference between Page views and Tracking pixel success.

During the “normal” times we see a difference of 12% between these requests which should be equal .

Reason: Users abandon page from loading before the request is sent to eTracker + amount of errors while requesting the host of etracker.

During the incident we can see the decrease of the call of the statistics tool. Raising down partly to 30% less calls.

Conclusion:

It is absolutely necessary to know your real users common behavior and the the users behavior during incidents. Else you would not be able to write a statement for the user impact if your analyses relies on your web-statistics tool.

You must know where in the load order the users get counted to make sure all users are counted or if they are counted correctly.

You must know when users abandon and how much users abandon the load of the page.

You must know that the longer the load time of a page takes – the less users “undergo” onload events. This means all the things happening in the zone of “after onload” are risked to apply. If you do business logic and business in that zone – the business is effected.

Abandonment3 in Weighting Incident Impact using Web Statistics = Failure

Page view compared with page load abandonments.

Most of all: You need to know very early that your web page is not loading properly and in which matter this impacts your business.

Oh no, not again a 3rd Party performance blog..

Nov 19th, 2011

I have noticed that in the past 3 month that the topic “3rd Party performance” is getting more and more visiblity in the field.
Not that there was noone ever spent any attention to it before – there was, but in the past few months there were constantly raising blog entries and twitter entries and the performance companies (those who are able to monitor 3rd Party) invited to “3rd Party monitoring” webinars.

And now here is the next entry about 3rd Party performance. Just because I really believe that this is fucking important.

Honestly! I think that 3rd Party monitoring is absolutely necessary if your businessmodel relies on others to deliver content. Wheter it is a CDN you are using where you have signed strikt SLAs with or if your webpage is marketed. The marketing aspect is the the topic of this writing.

I lately noticed on a webpage that only lives from being marketed that some areas for advertisement are not filled when I opened up the page (which i do very often).

There could be three reasons for not having the space filled:

  1. no advertiser booked the space
  2. the ad was not delivered by some issues in the advertisement delivery chain
  3. the webmaster programmed bullshit

By knowing that the page is very “succesfull” in Germany it would be very estonishingly if there are times with no one booked available ad space.

Therefore I decided to put the page into Monitoring and found the following (Monitored with a FireFox Synthetic Monitoring Agent – 1 Test per each of 6 locations in Germany per hour – Agents located at main Carrier Networks)

Metrics in Oh no, not again a 3rd Party performance blog..

2 days of measurements show huge gaps in availability of 3rd party

By not telling a secret, that some of the facebook servers are delivering not as expected, we also can recognize that in some cases availability is below 80%. Yes – not only 99.8% but check the numbers out on yourself by clicking on the image.

I have to admit, that I have filtered out those 3rd Parties which were available <99.8 %.

What does availability mean in this case: Well, the content could not be delivered due to:

  • invalid SSL keys
  • connection interruptions (client abort)
  • service unavailable
  • connection timeouts
  • 404
  • and many more

The main “unavailability” is measured during the peak user hours. Imagine how many Ad-Impressions got lost ? How many users were not able to click on an Ad ? Does the page look broken or unfinished for them ? Has the wait time for the Ad blocked the rest of the page from loading?

You can imagine that Ad-Availability is hard to control. Maybe one little server as part of the advertisement delivery chain might have issues. Talking about the Ad-delivery-chain I mean something like publisher – adserver – adnetwork – adnetwork two – geo-targeting-adservice – realtime bidding – and many many more – and of course the final Server where the ad is located at.

Well, by knowing how much traffic the page got and by knowing how many Ads have not been delivered, it is easy to calculate the amount of money getting lost. And believe me – I talk about big money in the case above.

So, what do you think – would it be worth to know how your 3rd Party is delivering content ?

Ad-Ausspielung – Tagessgeschehenabhängig

Nov 16th, 2011

Diesen kuriosen Ad fand ich heute auf GuteFrage.net.

Google scheint nun Ads auch in Abhängigkeit vom Tagesgeschehen auszuspielen.
Ob das so wirklich Absicht war ?

Rassezucht in Ad-Ausspielung - Tagessgeschehenabhängig

 

Tags:

Is your 3rd Party SSL key trusted ?

Okt 31st, 2011
Kommentare deaktiviert

Today I got aware of an issue that might effect a bunch of websites.

Imagine you have a shop and imagine you do some affiliate marketing. This means you or your marketeer have to include a bunch of 3rd Party into your website. Even in the payment process where the “conversion funnel” recognises the click origin.

Now imagine one – or more – of these content partners participating in the SSL secured process of the payment do not have a valid SSL key.

Will this effect your business ?
The answer is: YES

Modern Browsers double check the validity of SSL keys by asking the key originator if this key is valid (Wikipedia on OCSP).
This means: For every 3rd party the browser checks if their key is valid. If you do a lot of affiliate programs, this could be quiet a few. A few OCSP checks definiatly cause longer load times. Even one check could cause longer load times. As long as the key is not approved the content will not be loaded by the browser.

Trustedornot in Is your 3rd Party SSL key trusted ?

OCSP takes a partly very long to be processed - depending on the SSL key originators ability to answer the browsers request for key aproval

Even more important is what happens if one or more of the 3rd parties key is not trusted or its validity can not be proved (due to OCSP Server outage)!
Usually the Browser will abort the connection (to the affiliates server) -and the affiliate will not get recognised.

You might have sometimes HTTPWatch open or work with Firebug and you have seen something like “HTTP 200 abort”. The web client fires an abort if the content is not coming with a valid key.

Even more worse (depends on where the 3rd party content is called) the page (your page) will return an Security Error and the user can not buy what he was supposed to by (from you).

(Check here how a failure can look like)

It is not a bad decision to permanently keep an eye on this these days where SSL key vendors being under permanent DoS attacks or being hacked.

 

Bizarr Incident

Okt 18th, 2011

Today I got a call from a friend. He reported current issues with performance on all if his pages -  honestly: the page was out of busines.

All of a sudden the load time of the webpage increase without any changes in configuration, release change and without any measureable hardware defect.

Clipboard01 in Bizarr Incident

The friend is using various technologies to monitor – such as real user monitoring from the browser of his users and synthetic external monitoring beside of various internal monitors.

Our first intention was thinking about a hack attack with a bot that is not able to render javascript (else the amount of pageviews would have rised in the javascript based browser monitoring).
Hack attacks could also easily be excluded by checking the logfiles for a special agent string or dedicated amout of ip-ranges trying to access the page.
Than we excluded other issues by checking the synthetic monitoring which clearly showed no issues with bandwith (long content download times), connectivity and DNS times. Symptoms were high CPU, long first byte times and a feaking load of traffic from all over Germany.

After excluding all issues which could happen from inside we started to investigate time for external issues. But again – none of the included 3rd parties cause any issues (not the ads or the CDN)

By checking the analysis tool (tracking pixel) for referrers we finally found the issue!

One issue wich is really bizarr – and not really measureable.

One advertiser included the complete webpage instead of the banner on the adserver. So with every call of other pages (where the ad should appear) the complete webpage of the customer was requested….

Like a DDoS attack provoked by a little mistype.

Strange things can happen all the time……

 

Der “Online Marketing” Effekt

Sep 7th, 2011

Worum es geht:

Wenn man heute auf die Landingpage eines beliebigen E-Commerce Angebots browsed, wird man mit sehr großer Wahrscheinlichkeit, über verschiedene Redirects und Javascripte von verschiedenen Affiliate Trackern erfasst. Dies passiert in der Regel, während die Landingpage tatsächlich geladen wird und ohne dass man es mitbekommt. Ob nun über ein tatsächliches Affiliate Programm hierher gelangt ist oder nicht, spielt dabei primär überhaupt gar keine Rolle, da ja jeder Besucher von einem “Vermarkter” kommen könnte (wie das Prinzip funktioniert ist hier recht verständlich erklärt).

Ebenso wird man mit einer sehr großen Wahrscheinlichkeit gefragt, ob man die Seite “liked” und wie viele andere Menschen die Seite schon mögen. Auch diese Einbindung erfolgt meist über Javascripte.

Was aber alle diese Javascripte von Affiliates und Social Networks gemeinsam haben ist folgendes: Sie lösen eine Anfrage (Request) auf weiteren Server aus, welcher mehr oder weniger Inhalte zurück liefert. Abhängig von der Antwortkapazität dieser Drittserver, wird jeder dieser Aufrufe zu einer businesskritischen Größe für die Ladezeit der eigenen Webseite.

Das Problem:

Gerade in den letzten Tagen gab es immer wieder Probleme mit den Inhalten, die Facebook liefert. Bedenkt man, dass jeder Content – insbesondere externe Content einen “overhead” bei der Ladezeit liefert, dann können Ausreißer kritische Businessgrößen erreichen.

Facebook in Der Online Marketing Effekt

Facebookausreißer in den vergangenen 7 Tagen (Webseite: Zalando)

Kann hier (Chart oben – Facebook liefert erst nach 25 Sekunden) beurteilt werden, welchen Einfluss diese Ausreißer auf das eigene Business haben?

Nein. Per Se nicht.

Dazu muss man beantworten können, wieviele User denn davon betroffen sind, und ob der weitere Ladevorgang der Seite durch dieses Problem blockiert wird.

Betrachtet man nun neben dem “Social Marketing” auch noch die Affilates und anderen Werbepartner, dann wird man sehr schnell gewahr, dass diese Werbepartner – so sehr sie auch für Reichweite und Bekanntheit sorgen – eben so schnell das Geschäft gründlich vermiesen können.

Affiliates in Der Online Marketing Effekt

Werbepartner die nicht liefern - Gefahr für das Business

In dem konkreten Beispiel von Zalando.de sieht man gleich zwei Content-Partner, die Performanceprobleme hatten. Zur gleichen Zeit. Man darf ja nicht vergessen, dass es sich hierbei nur um “kleine”, “versteckte” Inhalte handelt.

[Update 12.09:

Wie auf dem Chart deutlich  zu sehen, handelte es sich  um einen zeitlich begrenzten Ausfall [06.09 15:00 Uhr bis 23:00 Uhr] und nicht um einen permanenten Status. Ursächlich für die Probleme ist vermutlich ein Bagger gewesen. Lesen Sie auch “Bagger durchtrennt…“. Mit herzlichen Dank an Dr. Thomas Nicolai.

Ende Update]

Allein Anhand des Beispiels von Zalando zeige ich 3 Incidents innerhalb der vergangenen 7 Tage auf (1. Facebook; 2. Avazudsp; 3. Sociomantic). Von Zufall kann hier keine Rede sein – Ausfälle und Probleme dieser Qualität sind täglich zu beobachten. Zufall ist lediglich, dass sich zwei Probleme “subsumiert” haben.

Es ist sehr wahrscheinlich, dass ALLE, die jene Dienstleister eingebunden haben von dem Problem betroffen sind.

Im Fall von Zalando war z. B. ein Tag lang das Geschäft gefährdet, weil Marketing- oder Affiliatepartner Inhalte nicht geliefert haben. Mitten in der Hauptgeschäftszeit – und vermutlich hat es stundenlang niemand gemerkt. Sehr wahrscheinlich ging zu dem Zeitpunkt „lediglich“ die Conversionrate deutlich zurück.

Bei Webseiten, deren Inhalte so sehr auf Marketing und Verbreitung ausgelegt sind, wie Zalando, bei denen alleine 30 ! 3rd Party Contentlieferanten eingebunden sind, ist es zwingend notwendig, ein Auge auf die Performance zu haben und die Möglichkeit hier schnell Rückschlüsse auf das eigene Geschäft ziehen zu können.

Auch bei dieser schlechten Lieferzeit der Affiliates/Werbepartner kann noch kein Rückschluss auf einen echten Businessimpact gegeben werden. Die zentrale Frage lautet also: Wie viele User sind betroffen und konnten diese mit dem Browsen fortfahren?  Sollte die ein Einkaufen trotz der Probleme erfolgreich abgeschlossen werden können, muss gefragt werden: Was geht mir verloren, wenn ein User eines ganzen Tages nicht mit den Cookies der Partner versorgt wurden?

Die entscheidenden Punkte für den Businessimpact sind also:

* Welchen Einfluss hat die Nicht-Auslieferung des Contents auf mein Business?
* Wie viele User sind betroffen?
* Können die betroffenen User weiterbrowsen?
* Wieviel Umsatz verliere ich gerade?
* Und vor allem: Was kann ich tun, um wieder Herr der Lage zu werden?

Es ist also zwingend notwendig, dass die Möglichkeit besteht, seine eigen Präsenz permanent zu kontrollieren und die Contentlieferanten zu monitoren. Technologien dafür gibt es. Ebenso müssen mit den Lieferanten Service Level Agreements vereinbart werden, die bei Ausfällen aber auch bei Performanceschwierigkeiten Penalties mit sich führen.

Zal in Der Online Marketing Effekt

Zalando - Wasserfallchart zeigt deutlich die Blockade durch den Contentlieferanten Sociomantic (click zum Vergrößern)

Verlangsamen die Inhalte der Affiliates/Werbepartner – wie im Beispiel von Zalando deutlich zu erkennen – die Auslieferung des geschäftskritischen Inhalts, werden hierdurch alle Bemühungen, die eigene Seite schnell und zuverlässig zu gestalten, torpediert. Die verlangsamte Antwortzeit der Affiliates begleitet zudem als ständiges Grundrauschen die eigene Seite und wird von dem User wahrgenommen. Das Ergebnis sind schlechte Kritiken, Abwanderung von Kunden und langwierige Fehlersuchen im Unternehmen/IT-Abteilung, die, wenn man Affiliates nicht als potentielle Fehlerquelle in Betracht zieht, erfolglos verläuft und sinnlos Budget verbrennt.

Lösung?

Vielen Unternehmen stehen bereits geeignete Monitoringmaßnahmen zur Verfügung. Diese könnten ihnen entsprechende Auswertungen und Alarme senden – nur werden diese Möglichkeiten so gut wie nie ausgeschöpft.

Wie viele Contentlieferanten melden sich proaktiv, wenn sie Lieferschwierigkeiten haben oder senden proaktiv Reports über die Performance und Verfügbarkeit? Wieviele Contentlieferanten monitoren überhaupt ihre eigene Servicequalität?

Fragen die unbedingt gestellt werden sollten, wenn man eine Partnerschaft eingeht, die ja das Geschäft fördern und nicht blockieren soll.

Medienwebseiten, die Bannerplätze vermieten stehen vor einem ähnlichen Problem. Adserver Technologien liefern Inhalte in die Webseite, welche wiederum weitere Bannerlieferanten anfragen. Hier kommt es auch sehr häufig zu Lieferproblemen. Dazu aber mehr in einem weiteren noch folgenden Blogeintrag.

Vielen Dank an Daniel für das Lektorat.

Performance Optimization Options – Example: Wikileaks

Sep 6th, 2011

Usually I write articles here in German language and will continue to do so for further entry. In this case I decided to move to English, as the topic might be important or interesting to all of the developers over the planet earth.

This article has no political background but the example used has currently huge political potential and is on main public interest these days – so for me a good choice for writing a small “lesson” on performance optimization.

Lets start with how the performance of wikileaks.org is measured:
I have instrumented measurement agents spread around in Germany on real user pc’s (http://www.gomezpeerzone.com/ as an employee of the company I can do so for testing and training purposes). Currently 10 of these agents randomly selected within the region “Germany” and a connection greater 512kbit/s downstream measure the performance and availability of wikileaks.org per hour with a FireFox 3.6 Browser. This means per day there are aprox. 240 Test performed on wikileaks.org.

Now lets start the analysis:

Chart11 in Performance Optimization Options - Example: Wikileaks

 

 

 

 

 

 

 

(Performance chart of Wikileaks – Load Time/Page complete Time – click for good size)

The graph above shows the performance – or better said the response time – over the past 5 days where the Network named above performed 1290 Tests on the page. It is pretty visible that the performance shows inconsistency in the load time. While the performance is often below 2 Seconds it raises up (unfortunately often enough) to 8 Seconds and above. (Every dot represents 10 Measurements – or one hour)

Chart2 in Performance Optimization Options - Example: Wikileaks

 

 

The average load time of 1290 Tests is pretty good with an average of 3.3 Seconds (click on the image for a good size). The availability is only at 95%. Availability means in this case: The page could not be loaded completely within 60 Seconds. Every 20th call of the webpage ran into an issue.

Drilling now down into the various TCP Layers shows where the performance bottleneck can be identified:

Chart3 in Performance Optimization Options - Example: Wikileaks

 

 

 

 

 

 

 

 

 

 

(TCP component chart for all tests performed on wikileaks.org – click for good size)

Explanation for this chart:
DNS Lookup Time: Time for the clients DNS-Server to translate the URL into an IP to connect to
Initial Connection Time: Time for the client to get a ack/ack sync/sync with the server – or more simple: To get a socket on the server to place a request for getting some data
SSL Encryption: Time to establish the encryption based on an SSL certificate (does not apply in this example)
1st Byte Time: Time that it takes to deliver the first byte of whats requested by the client.
Content Download Time: Time that it takes to recieve all content (text, binaries) from the server
Total Time: Time to complete all requests – or time to download the complete page with all objects included

The component chart shows, that the main Bottleneck is the connect time (yellow line). It seems that the agents often have to wait to get a socket at the server to place a request. Usually Browser re request a socket after various times: after 3, 9, 21 Seconds. If after 21 Seconds still no socket can be reached the socket connection will not be requested again – we call it “timing out”. Reasons can be (do not treat me to go to deep into detail or name all possible conntection symptoms):
- Server refused the connection
- Server reset the connection
- Socket connection times out
- Socket connection aborted

Chart4 in Performance Optimization Options - Example: Wikileaks

 

 

 

 

 

 

 

 

 

 

(Error chart for those tests which failed – click for good size)

So the charts above for the page is telling us clearly there seems to be an issue with connectivity. On September 4th we see a peak in the DNS time too (in the component chart). But this yet not seems to be a constant problem.
Connectivity issues happen mainly for 3 reasons (beside of many more – lets face only the MAIN reasons for it):
- Huge amount of packet loss between client and server
- Server/Firewall does not allow to the specific client to connect
- Server is not able to handle more requests because he is fully loaded and all sockets are seated.

We can exclude the packet loss issue – because all measurements taken coming from a healthy and stable internet infrastructure – also the issue does not show a consistent line.
In this wikileaks example it is most likely that the sockets are blocked or taken by other requests. Whether due to a constant hack attempt or because of the high level of usage by usual real visitors. While wikileaks currently is high in the press and a group of people stated to bring the page down both arguments could be the reason for this issue.

To avoid connectivity issues we have to consider only one thing:
How can we reduce the amount of connections to the specific IP address on our own? Options are:

  • By selecting a third party for serving parts of the contents (i. e. CDN Network, Cloud-storage or Mirrors)
  • By reducing the amount of connection to be established for each single page.

While the first solution is not cheap or not easy to apply the second solution might be an option which is quickly adoptable. But therefore we need to know how can we reduce the amount of connections in this case. So lets take action here and look what is requested from the server. The good thing: The tests taken by the measurement network shows us for each single test what has been requested by the client.

Chart5 in Performance Optimization Options - Example: Wikileaks

 

 

 

 

 

 

 

 

 

 

 

 

 

(Waterfall chart for wikileaks.org main page – click for a good size)

We see: There are only a few objects requested on the Main page – pretty challenging to reduce such a small amount of connections when there are only so few objects. But it is possible and should be considered as at least a solution to reduce the impact of the issue.

The first visible finding is: The transport Protocol for HTTP is 1.0 which does not keep connections alive. The amount of objects and the amount of connections is equal.
HTTP 1.0 protocol might be a good choice and has an advantage in rare cases. So it makes sense not to use HTTP 1.1 with persistent connections (more than one object can be downloaded on one established connection). Reason for the good choice: 1 User can not block many connections on the server (imagine 100 files were downloaded on 6 connections by one user who is badly connected to the internet. 6 connections are blocked for a long time). With HTTP 1.0 Everyone gets a chance to find a “free” socket. As said pages with a big amount of objects HTTP 1.0 should be revied carefully if it does not make sense to switch to 1.1 – especially in Peak hours with slow users HTTP 1.1 can cause issues.
So – lets simply say: None persistency is a good choice for wikileaks  from the performance perspective in general.

Lets take a deeper look. We see one javascript file and one css file – both of them forces the client to establish a connection to the server. The byte size of them both together is just a bit over 7k bytes. To inline them into the html source (10k) might be a good option. We would reduce the amount of connections per request by 2 or even more impressing by nearly 30%. (30% more people can be served – or hack attempts need 30% more power to bring down the page)

The risk by doing so: nearly zero because the component chart shows us that bandwith is not an issue.

The other side of the medal: these files can not be cached – so with every request the 7k of bytes will be delivered.
Same with the images: these can be “inlined” (look at some of the older entries – here you can see how inlining works (not for IE Browsers)). But…ok….lets keep the pictures externally referenced. Maybe one day there is a good guy giving Wikileaks the chance to host the binaries.
In times of trouble it could also be considered if it makes sense to deliver an icon (.ico) for the browser ? We could save another connection to establish – per each request.

What have we learned: Every medal has two sides – considerations have to be made of course. But honestly: whould I prefer to look nice or would I prefer to be available ? I would take the second choice.

Wikileaks…hours of Hack-attack

Aug 31st, 2011

this is the last 24 h – Chart of errors on wikileaks.ch….

 in Wikileaks...hours of Hack-attack

Showing clearly the problem to accept connections and some Gateway timeouts.

(Measurements take place 10 times per hour – that explains the amount of errors per hour)

Tags:

Wikileaks – what the hack….

Aug 31st, 2011

Wikileaks gab ja in den vergangenen Tagen bekannt konstant unter Hackangriffen zu leiden.

Seit einigen Monaten habe ich den Server, der unter wikileaks.ch einen Mirror darstellt, im Monitoring. D. h. ich rufe aus Deutschland in regelmäßigen Abständen mit Messagenten die URL auf und kann damit verschiedene TCP Layer betrachten. Ergebnis: Erschütternd

Wikileaks.ch litt in den vergangenen Wochen massiv unter “Connectivity” Problemen. D. h. ein sync/sync ack/ack konnte nicht oder nur sehr zeitverzögert etabliert werden.

Wie dramatisch sich das auf die Performance auswirkte zeigen die folgenden Charts:

 in Wikileaks - what the hack....

Auf der Grafik ist deutlich zu erkennen, wie sich die Performance seit dem 24.08 verbessert.

 in Wikileaks - what the hack....

Die initiale Connection Time kann hier deutlich als “Bottleneck” identifiziert werden. Ein sehr typisches Verhalten, wenn z. B. auf dem Server so viele Connections etabliert sind, dass sich weitere “Requests” hinten anstellen müssen

 in Wikileaks - what the hack....

Die durchschnittliche “initiale connection time” gibt hier noch die Bestätigung.

 in Wikileaks - what the hack....

Der Verfügbarkeitsverlauf zeigt deutlich auch die Verbesserung zum 24. August.
Verfügbarkeit in dem Zusammenhang des Monitorings sagt: Die Webseite konnte in x% aller Fälle nicht innerhalb von 60 Sekunden geladen werden (inklusive aller Objekte).

Ich möchte an dieser Stelle gerne noch betonen, dass sich hieraus keine Rückschlüsse auf Wikileaks.org treffen lassen.

Mit einer durchschnittlichen Verfügbarkeit von 80% – sicher ein extrem schlechte Erreichbarkeit. Und wenn der Server erreicht wurde, musste im Schnitt (bei exzellenter eigener Internetanbindung) durchschnittlich 10 Sekunden gewartet werden, auf eine Seite, die wie jetzt aktuell, in unter 3 Sekunden fertig im Browser dargestellt ist.

Interessierte und aufmerksame Beobachter werden sicherlich fragen: Wieso ist die Ladezeit gesamt (Response Time) geringer als die initiale Connection time. Die Antwort: Browser öffnen parallel mehrere Connections. Die Connection time wird in der Auswertung addiert und nicht parrallel berechnet. Habe ich als 6 connections zum Host und warte bei jeder 3 Sekunden bis sie etabliert ist, kann ich parallelisiert in 3,2 Sekunden alle Verbindungen hergestellt haben. Addiert ergeben sich aber 12 Sekunden.

 

Börse offline – naja zumindest beim Manager-Magazin.de

Aug 8th, 2011

Vielleicht ist das aber auch ganz gut im Moment. Macht eh keinen Spaß darauf zu schauen.

Wobei andererseits der Werbeplatz dort sicherlich interessant und teuer ist…

 in Börse offline - naja zumindest beim Manager-Magazin.de