This is my reading notes for Chapter 10 in book “System Design Interview – An insider’s guide (Vol. 1)”.
Overview
A notification system is essential for many applications, enabling them to communicate directly with users through various channels such as push notifications, SMS, and email. The goal is to design a scalable, fault-tolerant, and responsive system that supports multiple notification types while respecting user preferences, handling errors, and operating reliably at scale.
Problem Scope and Requirements
The chapter sets out the basic requirements for the notification system:
- Scale: The system must support millions of notifications daily (e.g., 10 million push notifications, 1 million SMS, and 5 million emails).
- Delivery Time Sensitivity: Notifications must be sent out in real time (e.g., instant alerts for messages) or near-real-time (e.g., promotional emails).
- User Preferences: Users must be able to opt in/out of notifications for specific types of communication, and the system must respect these settings.
- Multi-Platform Support: The system must deliver notifications to multiple platforms such as iOS (APNS), Android (FCM), and web platforms.
Example Requirement: A payment reminder system might need to notify a user via push notification if their payment fails, followed by an email summarizing the issue, and finally, an SMS if they haven’t responded in 24 hours.
High-Level Design
The chapter describes breaking down the system into key components, emphasizing decoupling and scaling:
- Service Layer: Different services within the application (e.g., payment service, messaging service) send requests to the notification system.
- Notification Service: This component handles the actual sending of notifications. It interfaces with third-party services like APNS (iOS), FCM (Android), SMS gateways, and email providers.
- Message Queue: To ensure reliability and fault tolerance, a message queue (e.g., RabbitMQ, Kafka, or AWS SQS) is introduced to handle the asynchronous nature of notification delivery. This allows the system to queue notifications and process them even if some components (like third-party gateways) are temporarily unavailable.
- Worker Service: Dedicated workers are responsible for dequeuing messages from the queue and processing them, including sending the messages to the appropriate third-party service.
Example of the Flow:
- A payment failure triggers an event in the Payment Service.
- The event is sent to the Notification Service, which queues the request.
- A worker retrieves the message from the queue and sends the notification via the appropriate channel (push, SMS, or email) depending on the user’s preferences and the urgency of the message.
Reliability and Fault Tolerance
Single Point of Failure (SPOF)
The initial design could suffer from a single point of failure if, for example, the notification service goes down. To address this, the system is designed to be distributed across multiple servers. Horizontal scaling ensures that as the volume of notifications increases, the system can handle the load by adding more servers.
Retries and Dead Letter Queues
In a real-world scenario, failures happen (e.g., network issues, third-party outages). To ensure reliability, the system needs a retry mechanism. If sending a notification fails, the message is requeued for retry after a short delay. After a predefined number of failed attempts, the message is moved to a dead-letter queue for further analysis.
Example:
- A worker attempts to send a push notification to APNS, but the request fails due to a network issue. The message is requeued for retry after 5 seconds.
- After three failed attempts, the message is moved to a dead-letter queue and logged for investigation, possibly triggering an alert for manual review.
Scalability Considerations
The design must support millions of notifications per day, which requires careful planning of system resources:
- Horizontal Scaling: By decoupling the components (e.g., service layer, queue, worker services), the system can scale each part independently. More workers can be added during peak load times to handle the higher volume of notifications.
- Rate Limiting: To avoid overwhelming users and causing them to disable notifications, the system should implement rate limiting, ensuring that users don’t receive too many notifications in a short period. This is especially important for real-time notifications like those for social media apps, where multiple events could happen in quick succession.
Example: A social media app might rate-limit push notifications to prevent a user from receiving more than 5 notifications in a 10-minute window, even if there are 10 events during that time.
User Preferences and Opt-Outs
A key aspect of the design is respecting user preferences:
- Opt-In/Opt-Out Management: Users should be able to control what types of notifications they receive (e.g., promotional emails vs. security alerts). The notification system must check these preferences before sending any notification.
- Preference Storage: Preferences are stored in a database, allowing the system to query user settings before attempting to send a notification.
Example: A user opts out of promotional emails but continues to receive security alerts. When the system sends a marketing campaign, it first checks the preferences and skips users who have opted out.
Security and Privacy
Since notifications often involve sensitive information, the system must ensure that data is handled securely:
- API Security: Secure communication between the services using AppKey and AppSecret to verify that only authenticated clients can send notifications.
- Data Encryption: Sensitive data (e.g., user phone numbers or email addresses) should be encrypted both in transit and at rest to comply with privacy regulations (e.g., GDPR, CCPA).
Example: The system encrypts email addresses stored in the database to prevent leakage in case of a breach. When sending an email, the system decrypts the address just before delivering the message via the email service provider.
Monitoring and Analytics
The final part of the design involves monitoring the system’s health and performance. Key metrics include:
- Success/Failure Rates: Track the success rates of notifications across different channels (e.g., how many emails were successfully sent vs. failed).
- Latency: Measure how long it takes from the time a notification is triggered to when it is delivered to the user. This is particularly important for real-time notifications.
- User Engagement Metrics: Analyze how users interact with notifications (e.g., click rates for emails or push notifications). This data can help optimize future campaigns and improve the user experience.
Example: An e-commerce platform tracks email open rates for abandoned cart reminders. If engagement is low, the marketing team can adjust the content or timing of these notifications to improve effectiveness.
Takeaways
- Decoupling and Scalability: By decoupling the services and using message queues, the system can scale effectively while maintaining reliability and fault tolerance.
- Retry Logic and Error Handling: Implementing robust retry logic and dead-letter queues ensures that the system can handle failures without losing notifications.
- User Preferences: Respecting user preferences and implementing rate limiting is crucial to maintaining a positive user experience and avoiding notification fatigue.
- Monitoring and Analytics: Tracking key metrics allows for continuous improvement and helps maintain the performance of the system, ensuring notifications are timely and effective.
In summary, the chapter on designing a notification system provides a comprehensive guide on how to architect a scalable, reliable, and user-friendly notification service that can handle millions of messages while respecting user preferences and maintaining performance through monitoring and retries.