We often get tickets to alert us that live betting was down for a brief period of time but has since been manually brought back up. Despite the service being restored, we still need to investigate the cause of the issue to ensure there is not a larger issue down the line.


First line responsibilities for this process will be outlined below:


Brief:

 Ticket Received


 1. - Check emails to ensure BetRadar have not raised an issue

 2. - Check network monitoring (http://netmon-at.stanleybet.com:4808/), look for the VPN connection to affected country

     2a. - If spike in ping or packet loss correlates with outage time:

           - Paste screenshot of graph into ticket, add note advising issue was caused by minor network issues, resolve ticket.

     2b. - If no spike in ping or packet loss can be found:

           - Raise with Application Support, add screenshot of network graph, add note advising we couldn't determine the cause to be network related, request further investigation


Detailed:


The ticket received will look as follows

The time reported will be the local time zone in which the service was lost, e.g. 15:16 local to Belgium (14:16 UK). Keep this in mind when looking at the network monitoring tools as that will report in your local time


In regards to triaging, firstly, check to see if any BetRadar emails have been received advising of any issues on their end. Chances are, if this is the case we would lose the feed for all countries so we'd probably know this pretty early on.


Assuming nothing from BetRadar is received, we then check network monitoring. It's important to check the connection from Austria to the satellite experiencing issues. The URL for this is -- http://netmon-at.stanleybet.com:4808/ -- the login should be your AD credentials 



Network Monitoring Tool:


The homepage will look like this:


The monitoring tool we want to use is the "Statistics" tab on the right hand side of the page. From here, you'll be shown a list of connection names. The ones we're looking for in particular is the affected countries VPN. Once found, click on the server name.


This will display analytics of the connection from Austria to the server in question. By default, this will be filtered to the last 5 minutes of activity. To change this, head to the top right of the page and change the dropdown box to the needed setting.


On the left of the page will be a quick rundown of general performance such as %age packet loss, latency ranges, outages etc. On the right will be graphs displaying the same information. Clicking on these graphs will expand this page and give you the option to drag selections to specific time frames.


What we're looking for is ping spikes or packet loss at the same time the service outage was reported. in this example, the graph would look as follows:


As seen, there was a huge spike in latency at the same time the outage was reported. Due to how refined the services are, minor packet loss or delays in packets being received can cause service outages. Usually, this isn't a sign of any further issues as minor latency spikes and packet loss are expected every now and then. However, if a trend is noticed, it is worth raising this to the attention of Infra for further investigation


In the case of this example, we can see the issue was caused by a latency spike. Regarding the ticket, you can paste a screenshot of this graph in the ticket and add a note advising the cause of the issue and state no further investigation is needed. From here, you can resolve the ticket and leave it at that.


If we could not correlate the service outage to a network issue, we would then need to pass the ticket on to the Application Support Team for further investigation