Open Sourcing iris-message-processor | LinkedIn Engineering


One measure of a successful network is uptime – providing consistent, reliable service for members and customers. If there are frequent connection errors or downtime notifications, it becomes difficult to deliver an experience where people can connect and interact with ease. When faced with uptime challenges, being able to quickly escalate issues to network engineers helps ensure that people can work the way that they want to.

At LinkedIn, escalations encompass various events, including alerts, system change notifications, and automated actions that require an engineer’s acknowledgment to proceed. These events follow a customizable escalation plan that generates notifications (often with increasing urgency) until an engineer claims the event or the needed steps are completed.

To manage our on-call escalations, we built Iris and Oncall, two open-source tools that we introduced to the community approximately six years ago. Oncall enables our teams to efficiently handle their on-call shifts through automated scheduling and a suite of calendar management tools. Iris leverages the data provided by Oncall to promptly alert on-call engineers in case of any issues and escalate matters if required. Developers have the ability to create personalized escalation plans and message templates, granting them control over who receives alerts and the specific content delivered in those alerts. Because of its ease of use and flexibility, Iris has also become LinkedIn’s internal message delivery platform, sending out alerts, deployment notifications, security notices, and more.

Together, these tools deliver flexibility, customization, and simplicity in managing on-call escalations and can be used as low-cost replacements for off-the-shelf incident response platforms like PagerDuty. Currently, Iris and Oncall have more than 350 forks and 1,700 stars on Github.

In this post, we’ll discuss how we used Iris to both scale up (~2000%) and speed up (~86x) our incident management system. We’ll also share how this journey resulted in an incredibly robust and effective system that we open-sourced, allowing it to be easily and freely deployed by any company.

Growth of Iris within LinkedIn

When we first open sourced Iris, we published this blog post that included the following graph showing the month-by-month growth of Iris escalations at LinkedIn. At the time, adoption of Iris had steadily been growing and Iris was integrated into the major components of our alerting infrastructure.



Source link