Virtual Party Space Devlog #23: Connection interrupted

Spent a solid week tweaking and testing my AWS setup, only to discover mysterious connection issues that would cause all players to get booted off the server if more than a few people connected at a time.

At only 2 weeks before the event was originally scheduled, there’s no realistic way to fix this issue, actually finish the interface, and hold the event as planned. Putting this project on hold for now, mystery connection issue still unresolved.

Log

Thursday, February 4

turns out scripts are constantly scraping my app looking for admin pages or vulns!
- for some reason source IPs are not preserved though
- https://aws.amazon.com/premiumsupport/knowledge-center/elb-capture-client-ip-addresses/
finally put code on github in case something happens to my computer
looks like JVB has a memory leak? memory usage increases permanently after every CPU spike
- https://github.com/jitsi/jitsi-videobridge/issues/1396

Friday, February 5

looked at malleus, going to take some Java work to try to make it test my frontend
manual testing by opening a bunch of windows first
- default config starts dropping connections at >4 people in conference
Can keep a permanent loadbalancer while killing the rest of the stack
- Use x-aws-loadbalancer as a top-level element in your Compose file to set the ARN of an existing LoadBalancer.

Sunday, February 7

trying out new cloudformation configs
- permanent load balancer, set using x-aws-loadbalancer
  - took a long time for this to stabilize after bringing up, but seems to work now
- setting HTTP2Preferred on TLS endpoint
  - broke everything, probably because of the hack to make non-SSL port 443 in nginx.
  - need to figure out how to re-encrypt traffic to instance while still keeping ELB certificate
- greater capacity for JVB server (8Gb RAM)
  - problem seems to be CPU, not RAM, but 8Gb is recommended for high load
  - can’t set both CPU and memory, so going with RAM
  - setting 8Gb forces 1vCPU, rather than default of .25 or .5
  - https://aws.amazon.com/premiumsupport/knowledge-center/ecs-fargate-high-cpu-utilization/

Oh no a big problem

Can’t seem to stably handle more than 4 concurrent users
- jicofo reports that users are sending “MemberLeft” messages “Chat room event ChatRoomMemberPresenceChangeEvent”
- JS: Lots of “Ping timeout” from modules/xmpp/strophe.ping.js
- JS: Often gets stuck in “Bridge Channel send: no opened channel.” from modules/RTC/BridgeChannel.js (groups of 4 every 10s) THEORY: nginx connections get saturated, leading to timeouts
  - default configs for Fargate task of .25 vCPU probably means work gets blocked if co-sharing
  - tried creating a second task instance, but HTTPS traffic was never routed to it (WHY?)
- turning on access logs for ELB

Tuesday, February 9

character creator
- https://www.dropzonejs.com/#usage
- http://camanjs.com/
- no good existing dollmaker open source projects

Thursday, February 11

trying to figure out why connections dropping
- testing locally
  - local build requires local environment variables to be built into JS project
    - add path parameter to Dotenv instantiation in webpack.config to set to .env.local
    - rebuild docker container for local tag docker- build -t sfarqu/party-app:latest .
    - use docker-compose and local .env file docker--compose --env-file .env.local up -d
  - 5 connections locally seem to run fine, but tons of “Bridge Channel send: no opened channel.” errors
    - probably related to “Firefox can’t establish a connection to the server at wss://meet.chicazul.com/colibri-ws/172.25.0.5/…”
    - does this mean that app is failing to set up videobridge connections and is instead using P2P?
      - that could explain why local is way more stable that AWS version
    - NEW INVESTIGATION TRACK: figure out why colibri websocket connections are failing
    - don’t recall consistently seeing that error in AWS version, so may be red herring
- Investigating “connection interrupted” error in lib-jitsi-meet
  - sometimes it was a different error, so it’s not just this thing, but it was an issue
  - both P2P errors and JVB errors throw the same interruption message.
  - in theory interruptions should be recoverable…how?

About this series

Back in mid-December I started an ambitious project to create a custom platform for a virtual birthday party in February. I kept notes on my progress, both for personal reference and to turn into a series of blog posts. It quickly became apparent that I did not have time to both do the project and blog about the project. I have retroactively decided to post my raw notes as a dev log.