02/14 - Thursday

How many times did you check if the forum was back up yesterday?


  • Total voters
    55
  • Poll closed .
Status
Not open for further replies.

Metallica

Never free, never me
Contributor
Joined
Jan 17, 2016
Messages
5,319
Reaction score
14,228
Points
1,113
Age
33
Gender
Male
Welcome back mturkcrowd, so what happened? I assume by the message it said, the host provider went down somehow?
Around 1 am yesterday morning, it was reported to me that the forum was inaccessible and was showing a database error landing page. A certificate expired, along with a database memory issue, yet I did not have proper access to rectify it. When I was put onto the DigitalOcean account, I had access to only one server and obviously this became an issue yesterday. We had to touch base with the "old" server guy to get this fixed. He ended up setting everything up as a managed cluster. He had to balance this with his job, which is why it took so long. Regardless, this makes administration of everything a fair bit easier in the long run. I now have access to everything like I should. It goes without saying I am terribly sorry all this happened, but now we should be a bit better off when this happens again. More detailed info below:

We are now using Kubernetes for the orchestration engine - the thing that manages how all the containers work together.

It uses TLS certificates to authenticate different pieces of the puzzle to talk to each other.

The master cert expired and there isn't an easy way (back then in that version of kubernetes) to renew those certs. It's now a fixed thing. Anyway, that expired certificate made the control utility (kubectl) useless so we couldn't restart the failed mysql db container pod.

So, the cluster cert had to be fixed before mysql could be fixed.

This basically meant redoing the kubernetes cluster.

It was decided to use the new managed kubernetes cluster option instead of rolling it again as it was done before to make it easier to admin going forward.

You won't have to worry about expiring certs and such. It's something DO takes care of.

However, this meant there was a need to copy various data from one place to another as well as export and reimport the mysql database as well as recreate the nginx, mysql, php, and memcached deployments.

This gave me an excuse to give it some more visibility to the night crew.
 
  • Like
Reactions: savvy

turkleton

Muddarator
Joined
Jan 12, 2016
Messages
17,317
Reaction score
30,592
Points
1,814
Gender
Male
Around 1 am yesterday morning, it was reported to me that the forum was inaccessible and was showing a database error landing page. A certificate expired, along with a database memory issue, yet I did not have proper access to rectify it. When I was put onto the DigitalOcean account, I had access to only one server and obviously this became an issue yesterday. We had to touch base with the "old" server guy to get this fixed. He ended up setting everything up as a managed cluster. He had to balance this with his job, which is why it took so long. Regardless, this makes administration of everything a fair bit easier in the long run. I now have access to everything like I should. It goes without saying I am terribly sorry all this happened, but now we should be a bit better off when this happens again. More detailed info below:

We are now using Kubernetes for the orchestration engine - the thing that manages how all the containers work together.

It uses TLS certificates to authenticate different pieces of the puzzle to talk to each other.

The master cert expired and there isn't an easy way (back then in that version of kubernetes) to renew those certs. It's now a fixed thing. Anyway, that expired certificate made the control utility (kubectl) useless so we couldn't restart the failed mysql db container pod.

So, the cluster cert had to be fixed before mysql could be fixed.

This basically meant redoing the kubernetes cluster.

It was decided to use the new managed kubernetes cluster option instead of rolling it again as it was done before to make it easier to admin going forward.

You won't have to worry about expiring certs and such. It's something DO takes care of.

However, this meant there was a need to copy various data from one place to another as well as export and reimport the mysql database as well as recreate the nginx, mysql, php, and memcached deployments.

This gave me an excuse to give it some more visibility to the night crew.
Oh, and here I thought it was because we were getting SO much traffic that it crashed the site.
 
Status
Not open for further replies.