OYO Life : Journey from blackbox to stable release…

memories of Memory leak 😎

7 min readMay 29, 2021

Overview

Have you ever worked on Node application or tried to scale it up for larger audience? Hmm..then you know what I am going to express here.

Couple of weeks back i.e in March End,2021 we completed our ~4 months long project of redesigning m-web(mobile site).

Finally, the time came which I guess every developer dream 💭 to release a feature (consumer side) which is going to directly boost the Long term stays of people in India.

Yeah , absolutely correct! The Deployment Day …

On the release day, we were ready with the staging tests and it was like our normal day deployment because till that time we were not aware that D-day is about to begin for our team.

So, the first deployment went into the production on 15th March.

Yes , you read right.. the first deployment 🙈!!

And then Boom 💥 🤯 !!!

After 15 min , All the API calls start failing and slowly slowly when it reaches to maximum audience , our website starts giving 503 i.e Service Temporarily Unavailable .

For Non-tech background peoples or those who haven’t heard about this status code , see this nightmare:

Our whole consumer website was down, after sometime it again gets Up. So the immediate affect was to revert the build and deploy the older build to make things stable.

Now , bombarded with questions , personal messages, mails etc that site was down , why we reverted the code? , why not able to see new m-web? Are we not launching today? website is throwing 503!! bla bla..

In between of all this , what i found is our previous/older build is working fine but when we are releasing our new redesign code it’s crashing the website.

So the culprit is New Redesign Code. Hurray 🎉 …. we conquered!

But wait… ✋

Where to start? What to look ? The whole code of 544 files or the libraries added in the package.json we should investigate 🤔 !!

Liveness probe failed: Get http://{some-ip}: dial tcp {some-ip:port}: connect : connection refused.

From the devops help, we tried to tune the kubeapps parameters like increasing memory , health status check , timeoutSeconds etc. But nothing worked out unfortunately .

Next day , we did some code changes and ready for the Second deployment.

Unfortunately , no good luck… ! 😞

After discussion with our team Manager (currently DoE ), we came up with an approach to deal and debug this situation without impacting our end customer is to create a temp environment similar to production .

Next morning, we setup the temporary ENV . Till the time we were half dipped into frustration and gradually loosing the triumph moment which we had thought before releasing.

Initially on temp. ENV it worked but as the load increased it comes to the same point. So we took help from cross-team devs and did a load testing using JMETER (usually testers use this tool testing use-cases) on our website .

Why we need that? So here load test helped us to capture the min and max of our CPU and Memory usage , if there are any crashes or logs which you want to check without affecting the live user then you can try out.
From here , insights we found that at 8k (~8000) concurrent connection request our site starts crashing.

~~~~~~~~ Weekend Arrive ~~~~~~~

During the weekend , I have enough time to dedicate to resolve this issue as no meetings, no Slack.

From the last lead which we got, i picked things from there to measure the CPU usage and Memory usage by tuning 4k to 8k to 20k connection requests and noted the max/min of it.
Based on that , fine tuned our service yaml file (kubernetes configuration file ) and set some limits/requests of memory and cpu usage. Also reduced the average utilization of CPU/Memory to 50% so that if any of the pod hit the max will auto-scale. (based on suggestions from my manager on 📞 )

Monitored whole weekend and things were fine no restarts in pods, good signal :)
Below command gives you the CPU and Memory usage of current running pods.[realtime]

Again , ready for the next battle because it’s Monday. Since morning , calls and messages started , asking about the release. But i was happy that i have a backup configuration ready which is stable during the whole weekend.

Now in mid-day after doing all kind of testing, got sign-off for the release on prod.
Third release is ready to go and strike in production. Yeah 🤞 !
After release was monitoring everything for 30minutes ,things were stable and normal (no downtime , no restarts till now).
3 hours now, things are still stable and as the time pass we were close to the victory ✌️ .
And yes we did it….

There were few restarts in mid-night but no down-time.

So next day, we again connected with cross-team dev’s whose application are running on Node and got some lead from there to watch the pod logs just before the restarts to look what causes the restart.

Kubernetes command to check logs of terminated pod

𝙳𝚒𝚐𝚐𝚒𝚗𝚐 𝚍𝚎𝚎𝚙 𝚒𝚗𝚝𝚘 𝚌𝚘𝚍𝚎 👨‍💻

Now it’s time dig deep into each function , API calls, Log flows , libraries/packages installed etc.

NPM dependency check command:

npx depcheck

In Node application, Memory management is crucial part.

Overview of Single threaded application

JavaScript automatically allocates memory when objects are created and frees it when they are not used anymore (garbage collection). This automatic process is a potential source of confusion: it can give developers the false impression that they don’t need to worry about memory management.

We were getting Uncaught Exceptions and Unhandled Rejections in logs , so we added this piece in our server file.

process
  .on('unhandledRejection', (reason, p) => {
    console.error(reason, 'Unhandled Rejection at Promise', p);
  })
  .on('uncaughtException', err => {
    console.error(err, 'Uncaught Exception thrown');
    process.exit(1);
  });

Reference: StackOverflow

Just after adding and deploying this code in temp. ENV , we were able to get the last log or reason because of which the restart was happening (as process.exit() will terminate the process there itself).

Javascript Heap Out of Memory Sample

Mark-and-sweep algorithm

This algorithm reduces the definition of “an object is no longer needed” to “an object is unreachable”

The algorithm starts from the root of the application. For the browser, the root is the window, and for Node.js it is the global object.

Using this algorithm, the GC will identify the reachable and unreachable objects. All the unreachable objects will be automatically garbage collected.

Memory Leaks

JavaScript memory leaks are caused by invalid logical flow in the code.

A Memory leak can be defined as a piece of memory that is no longer being used or required by an application but for some reason is not returned back to the OS and is still being occupied needlessly.

So we checked the official documentation of Node Js there we found this:

The max amount of available heap memory can be increased with a flag

We tweaked this parameter according to our service usage and infrastructure.

One strict callout here in the above command, use --max-old-space-size after node command not after the filename index.js.

Summary

Just to summarise here the further steps which we did (we were getting lead one after another) :

Ngnix configuration file changes:

2. Remove all logging/tracking tools like APM , Kafka , UTM tracking, Prerender (SEO) etc. middlewares or turn them off.

3. Now code level debugging: In your main server file , remove unwanted console.log which is just printing a message.

4. Now check for every server route i.e app.get() , app.post() ... below scenarios:

error part: some people define as error or err which creates confusion and mistakes. like this:

remove winston , elastic-epm-node other unused libraries using npx depcheck command.
In the axios service file , check the methods and logging properly or not like :

Save yourself from using stringify , parse etc on accessive large dataset.

Last but not least , every time when your application crashes or pods restarted check the logs. In log specifically look for this section: Security context (refer Javascript Heap Out of Memory Sample screenshot) . This will give you why , where and who is the culprit behind the crash.

And finally on 25th March,2021 we successfully released our Mweb in production which was a Stable Release. 🎉 🥳

Thanks for your time, hope you enjoyed and learnt from this journey! 😉

Do checkout our mobile site 👉🏻 OYO Life . 🕺🏼

References

https://hacks.mozilla.org/2012/11/tracking-down-memory-leaks-in-node-js-a-node-js-holiday-season/
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Memory_Management
https://stackoverflow.com/questions/38558989/node-js-heap-out-of-memory/66914674#66914674
Talk to your teammates /cross-team devs
Do sessions with your Managers
https://blog.sessionstack.com/how-javascript-works-memory-management-how-to-handle-4-common-memory-leaks-3f28b94cfbec