implement haproxy check page that can be forced to return a 500 error

Bug #688503 reported by Tom Haddon
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Launchpad itself
Fix Released
High
Francis J. Lacoste

Bug Description

This bug is intended to describe the solution to RT#41503. Essentially what we're looking for is:

- A lightweight page that we can check for in haproxy (this would be being checked every 2 seconds)
- A means of forcing that page to return a 500 (this could be sending a signal, checking for the presence of a file, etc.)

This will allow us to implement "no downtime" rolling upgrades by:

1) Force the check page to return haproxy
2) Watch haproxy to confirm the service is reported as down, and wait til it reports no active connections to the instance in question
3) Stop the instance in question using normal shutdown process
4) Start the service again using new code
5) Haproxy will automatically see the new instance as "up" and start sending traffic to it

Related branches

Tom Haddon (mthaddon)
tags: added: canonical-losa-lp
Gary Poster (gary)
Changed in launchpad-foundations:
status: New → Triaged
importance: Undecided → High
tags: added: bugjam2010
Changed in launchpad:
assignee: nobody → Francis J. Lacoste (flacoste)
status: Triaged → In Progress
Revision history for this message
Launchpad QA Bot (lpqabot) wrote : Bug fixed by a commit
tags: added: qa-needstesting
Changed in launchpad:
status: In Progress → Fix Committed
Gary Poster (gary)
tags: added: qa-bad
removed: qa-needstesting
Revision history for this message
Gary Poster (gary) wrote :

The fix did not work on qastaging. Chex HUPped the appserver processes, but I continued to get the 200/"groovy" responses afterwards from wget. Chex verified that neither haproxy nor squid were in front of qastaging.

lifeless requested that the HUP always turn the appserver into the "broken" state, and not flip back and forth.

I am not sure right now why this did not work.

If I have to be the one to complete it, I will try to get it to work locally, as Francis did, and then, if nothing comes to mind as to what might have gone wrong, maybe add some logging (while I make the change lifeless suggests). I don't know yet if I'll have time for this.

Revision history for this message
Gary Poster (gary) wrote :

I'm changing this to qa-ok because there is no reason I know of for this not to be deployed. It does not break anything, and it does not expose broken functionality to the end user. That said, I'll move this bug back to Triaged.

tags: added: qa-ok
removed: qa-bad
Changed in launchpad:
status: Fix Committed → Triaged
Changed in launchpad:
status: Triaged → Fix Committed
Revision history for this message
Francis J. Lacoste (flacoste) wrote :

OK, it doesn't work in production because the servers are started through nohup... which prevents HUP from being sent to the process!

I need to use another signal, SIGWINCH, SIGRTMIN?

Changed in launchpad:
status: Fix Committed → In Progress
Revision history for this message
Francis J. Lacoste (flacoste) wrote :

It's working fine, actually. The HUP disappearing was because the signal was sent to the wrong process!

Changed in launchpad:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.