#1
by
jkemp
Hi, Im working on an online experiment in oTree. We plan to have 1800 participants take part in the experiment over five hours of participation time, so 360 participants per hour. The experimental game is played over 10 rounds in constant 3-person groups. I expect this to be quite a heavy load, so I tested our experiment with browser bots and Locust, a load testing framework. First, the Heroku server setup: Web-Dyno: Standard-2X Worker-Dyno: Standard-2X Postgresql: Standard-0 These are the highest tiers in the oTreeHub interface. Still, when I'm load-testing with locust and browser bots (201 participants, 1 participant per second joining the experiment, load test took about 15 minutes) I get a median response time of over 10 seconds. I checked the Postgresql database: The index cache hit rate and table cache hit rate are both at 100 %. Under Diagnose -> "Most time consuming" I found no query to exceed an average time per invocation of 5 ms. Thus, I believe the database tier to be properly set. I checked the dynos: My webdyno had a maximum dyno load of 0.97, the load average over the time of the test was about 0.5. My worker dyno had a dyno load of 0. Looking at the locust data confirms the response time problem: The server could handle only about 10 requests per second, and the median response time was 14 seconds in the locust data. No failures or exceptions were raised. My questions are: 1. Why did the worker dyno have no load? Could this explain slow performance? What can I do to shift load to the worker dyno? 2. What can I do to improve performance? Would a higher-tier database or dynos help (e.g. going for performance-l dynos)? Let me know if you need more information. Thanks in advance and best regards Jakob
#2
by
Chris_oTree
Hi, thanks for providing the useful details. In oTree 5, the worker dyno is not used. That is normal and you can actually turn it off. 10 requests per second seems fine to me. Do you expect the experiment to get more than that in practice? It seems many people would need to be clicking through pages very quickly, at the exact same time, to get up to that amount of traffic. In your experiment, how long were the bots staying on each page? I guess about 20 seconds, because if each request takes 10 seconds, then that would be 20 seconds for GET+POST combined. The browser bots produce extreme congestion on the server because they submit the page immediately, but in real life, participants typically spend time on the page, read the content, etc. So if the average page takes 20 seconds or more for a participant to get through, then your current configuration should be fine (assuming 200 participants playing online at the same time). See the note in the performance section here: https://otree.readthedocs.io/en/latest/server/heroku.html#server-performance. The games with the highest server load are the ones with players going through many pages very quickly, like 100+ rounds. In those cases, you can consider switching your game to use live pages instead, which are faster. (I don't know how Locust works with oTree so I can't comment on that part, just normal browser bots.) From what you showed, it seems CPU is the bottleneck. So if the above advice doesn't solve the issue, you can switch to Heroku's more expensive dynos, which can be done through the Heroku dashboard (not oTree Hub since it typically isn't necessary).
#3
by
jkemp
Hi Chris, thank you very much for the in depth reply. I expect at most 200 concurrent participants, so if I assume a more conservative estimate of participants taking 10 seconds per page, I get to 20 requests per second on the server. Because I'd rather err on the side of caution, I will try to use a higher dyno tier for upcoming tests. We only have 10 rounds, so switching to live pages is likely not the solution. I added the script I used to load-test my app with Locust to Github: https://github.com/thegempie/locust-otree Best regards Jakob