Showing posts with label System Architecture. Show all posts
Showing posts with label System Architecture. Show all posts

Sunday, 11 August 2024

Introduction to Microservices


𝐂𝐨𝐦𝐩𝐨𝐧𝐞𝐧𝐭𝐬 𝐨𝐟 𝐌𝐢𝐜𝐫𝐨𝐬𝐞𝐫𝐯𝐢𝐜𝐞𝐬 𝐀𝐫𝐜𝐡𝐢𝐭𝐞𝐜𝐭𝐮𝐫𝐞


Microservices architecture breaks down applications into smaller, independent services. Here's a rundown of the 𝟏𝟎 𝐤𝐞𝐲 𝐜𝐨𝐦𝐩𝐨𝐧𝐞𝐧𝐭𝐬 in this architecture:


1. 𝐂𝐥𝐢𝐞𝐧𝐭

These are the end-users who interact with the application via different interfaces like web, mobile, or PC.


2. 𝐂𝐃𝐍 (Content Delivery Network)

CDNs deliver static content like images, stylesheets, and JavaScript files efficiently by caching them closer to the user's location, reducing load times.


3. 𝐋𝐨𝐚𝐝 𝐁𝐚𝐥𝐚𝐧𝐜𝐞𝐫

It distributes incoming network traffic across multiple servers, ensuring no single server becomes a bottleneck and improving the application's availability and reliability.


4. 𝐀𝐏𝐈 𝐆𝐚𝐭𝐞𝐰𝐚𝐲

An API Gateway acts as an entry point for all clients, handling tasks like request routing, composition, and protocol translation, which helps manage multiple microservices behind the scenes.


5. 𝐌𝐢𝐜𝐫𝐨𝐬𝐞𝐫𝐯𝐢𝐜𝐞𝐬

Each microservice is a small, independent service that performs a specific business function. They communicate with each other via APIs. 


6. 𝐌𝐞𝐬𝐬𝐚𝐠𝐞 𝐁𝐫𝐨𝐤𝐞𝐫

A message broker facilitates communication between microservices by sending messages between them, ensuring they remain decoupled and can function independently.


7. 𝐃𝐚𝐭𝐚𝐛𝐚𝐬𝐞𝐬

Each microservice typically has its database to ensure loose coupling. This can involve different databases for different microservices


8. 𝐈𝐝𝐞𝐧𝐭𝐢𝐭𝐲 𝐏𝐫𝐨𝐯𝐢𝐝𝐞𝐫

This component handles user authentication and authorization, ensuring secure access to services.


9. 𝐒𝐞𝐫𝐯𝐢𝐜𝐞 𝐑𝐞𝐠𝐢𝐬𝐭𝐫𝐲 𝐚𝐧𝐝 𝐃𝐢𝐬𝐜𝐨𝐯𝐞𝐫𝐲

This system keeps track of all microservices and their instances, allowing services to find and communicate with each other dynamically.


10. 𝐒𝐞𝐫𝐯𝐢𝐜𝐞 𝐂𝐨𝐨𝐫𝐝𝐢𝐧𝐚𝐭𝐢𝐨𝐧 (e.g., Zookeeper)

Tools like Zookeeper help manage and coordinate distributed services, ensuring they work together smoothly.


 

Image source: Adnan Maqbool Khan's post on LinkedIn

Thursday, 8 August 2024

Load Balancing Algorithms

Load balancing:
  • Used in distributed systems to distribute incoming network traffic across multiple servers or resources
  • Crucial for optimizing performance and ensuring even distribution of workload
  • Enhances system reliability by ensuring no single server becomes a bottleneck, thus reducing the risk of server overload and potential downtime





 
image source: Post | LinkedIn


Some popular load balancing algorithms:

  • Round Robin
    • distributes incoming requests sequentially to each server in a circular manner
    • simple and easy to implement but may not take into account server load or capacity
    • most used
  • Weighted Round Robin
    • similar to Round Robin, but with the ability to assign different weights to servers based on their capacity or performance
    • Servers with higher weights receive more requests
  • IP Hash
    • Uses the client's IP address to determine which server to send the request to
    • Requests from the same IP address are consistently routed to the same server
  • Least Connections
    • directs incoming requests to the server with the fewest active connections at the time
    • helps distribute the load evenly among servers based on their current workload
  • Least Response Time
    • Routes requests to the server with the lowest response time or latency
    • Aims to optimize performance by sending requests to the fastest server.
  • Random
    • Randomly selects a server from the pool to handle each request
    • While simple, it may not ensure even distribution of load across servers

Each load balancing algorithm has its own advantages and considerations.
The choice of algorithm depends on the specific requirements of the system and the desired load distribution strategy.



Disclaimer:

All credits for the inspiration for the article, an infograph image and part of the content go to Sina Riyahi [https://www.linkedin.com/in/sina-riyahi/].

Monday, 5 August 2024

Introduction to Amazon Simple Queue Service (SQS)



Amazon Simple Queue Service (SQS) is a fully managed message queuing service provided by Amazon Web Services (AWS). It enables decoupling and scaling of microservices, distributed systems, and serverless applications. 


Here's an overview of how Amazon SQS works:

Key Concepts


  • Queue:
    • A queue is a temporary storage location for messages waiting to be processed. There are two types of queues in SQS:
      • Standard Queue: Offers maximum throughput, best-effort ordering, and at-least-once delivery.
      • FIFO Queue: Ensures exactly-once processing and preserves the exact order of messages.
  • Message:
    • A message is the data that is sent between different components. It can be up to 256 KB in size and contains the information needed for processing.
  • Producer:
    • The producer (or sender) sends messages to the queue.
  • Consumer:
    • The consumer (or receiver) retrieves and processes messages from the queue.
  • Visibility Timeout:
    • A period during which a message is invisible to other consumers after a consumer retrieves it from the queue. This prevents other consumers from processing the same message concurrently.
  • Dead-Letter Queue (DLQ):
    • A queue for messages that could not be processed successfully after a specified number of attempts. This helps in isolating and analyzing problematic messages.

Workflow


  • Sending Messages:
    • A producer sends messages to an SQS queue using the SendMessage action. Each message is assigned a unique ID and placed in the queue.
  • Receiving Messages:
    • A consumer retrieves messages from the queue using the ReceiveMessage action. This operation can specify:
      • number of messages to retrieve (up to 10) 
      • duration to wait if no messages are available
  • Processing Messages:
    • After receiving a message, the consumer processes it. The message remains invisible to other consumers for a specified visibility timeout.
  • Deleting Messages:
    • Once processed, the consumer deletes the message from the queue using the DeleteMessage action. If not deleted within the visibility timeout, the message becomes visible again for other consumers to process.
  • Handling Failures:
    • If a message cannot be processed successfully within a specified number of attempts, it is moved to the Dead-Letter Queue for further investigation.


Additional Features

  • Long Polling:
    • Reduces the number of empty responses by allowing the ReceiveMessage action to wait for a specified amount of time until a message arrives in the queue.
  • Message Attributes:
    • Metadata about the message that can be used for filtering and routing.
  • Batch Operations:
    • SQS supports batch sending, receiving, and deleting of messages, which can improve efficiency and reduce costs.


Security and Access Control


  • IAM Policies:
    • Use AWS Identity and Access Management (IAM) policies to control access to SQS queues.
  • Encryption:
    • Messages can be encrypted in transit using SSL/TLS and at rest using AWS Key Management Service (KMS).

Use Cases


  • Decoupling Microservices:
    • SQS allows microservices to communicate asynchronously, improving scalability and fault tolerance.
  • Work Queues:
    • Distributing tasks to multiple workers for parallel processing.
  • Event Sourcing:
    • Storing a series of events to track changes in state over time.

Example Scenario


Order Processing System:

  • An e-commerce application has separate microservices for handling orders, inventory, and shipping.
  • The order service sends an order message to an SQS queue.
  • The inventory service retrieves the message, processes it (e.g., reserves stock), and then sends an updated message to another queue.
  • The shipping service retrieves the updated message and processes it (e.g., ships the item).


By using Amazon SQS, these microservices can operate independently and scale as needed, ensuring reliable and efficient order processing.


Message Queuing Service - Amazon Simple Queue Service - AWS

Thursday, 1 August 2024

Designing Systems Architecture in AWS


In this article I want to explore patterns and building blocks (AWS managed services) used when designing systems in AWS.

Global:

  • Choose region(s)
  • Eeach region contains Availability Zones

Networking:

  • VPC
    • one or more - per Region
    • can be default or nondefault
    • CIDR
      • Default VPC CIDR is 172.31.0.0/16.
      • VPC CIDR needs to be within the allowed range of private IP addresses:
        • 10.0.0.0/8 IP addresses: 10.0.0.0 – 10.255.255.255
        • 172.16.0.0/12 IP addresses: 172.16.0.0 – 172.31.255.255
        • 192.168.0.0/16 IP addresses: 192.168.0.0 – 192.168.255.255
  • Subnets
    • one or more - per AZ
    • CIDR
      •  e.g. VPC is 10.0.0.0/16
        • 10.0.0.0/24 - for range of 256 addresses: 10.0.0.0 to 10.0.0.255 (255 assignable as x.x.x.255 is for broadcast)
        • 10.0.1.0/24 - for range of 256 addresses: 10.0.1.0 to 10.0.1.255
        • 10.0.0.0/20 - to get a bit larger subnet - with 2^(32-20)=2^12=4096 IP addresses. 
          • To calculate the adjacent range: 20 means that first 2 octets and first 4 bits from 3rd octet are fixed. In 3rd octet we have 0000xxxx where xxxx can go from 0000 = 0 to 1111 = 15 so the first subnet is 10.0.0.0 - 10.0.15.255. So, the next subnet is 10.0.16.0/20. Use ipcalc tool for faster results (IP Calculator / IP Subnetting).
    • access to Internet
      • Private
        • assigned a 'private' routing table which routes entire (non-local) traffic to NAT Gateway (they don't have direct routes to IGW)
        • if destination is within local CIDR range, traffic goes to "local"
      • Public
        • assigned a 'public' route tables which routes all (non-local) traffic to IGW so they have direct routes to IGW
        • instances launched into these subnets will be assigned a public IP address (AWS charges for these public IP addresses until instance is terminated and IP address is released)
        • if destination is within local CIDR range, traffic goes to "local"
  • Internet Gateway
    • Attached to VPC (can be default or nondefault)
    • allows instances with public IPs to access the Internet
    • There is no charge for an internet gateway, but there are data transfer charges for EC2 instances that use internet gateways.
  • (Public) NAT Gateway
    • required only if instances in private networks need to access Internet
    • used by instances in private subnets (these instances have no public IP assigned) so they can reach Internet but prevents the Internet from initiating a connection directly to the instances
    • must be created in (it is attached to) a public subnet (so its traffic can be routed to Internet Gateway)
      • that's why this NAT is called a 'public'
      • that's why it's bound to a single AZ
    • has to have Elastic IP Address (public IPv4 address) attached to it
    • NAT Gateway's traffic is routed to Internet via Internet Gateway
  • Routing Tables
    • Route Destination = cidr_block
    • Route Target = gateway_id, nat_gateway_id
    • types by routing to IGW
      • public
        • routes all traffic (0.0.0.0/0) to IGW
      • private
        • routes all traffic (0.0.0.0/0) to NAT GW 
  • Transient Gateways

Compute:

  • EC2
    • standalone or created by ASG
  • ASG
    • gets assigned (operates on) a list of subnets - it will create new instances in them
  • ALB
    • Target Groups:
      • associated with ASG; this is how ALB knows which instances it works with
    • Listeners
  • Lambda
    • API Gateway


Storage:

  • EBS
    • root and data volumes
    • gets mounted to EC2 instances
  • EFS
    • gets mounted to EC2 instances
    • connected to network, has DNS name => can be attached across networks/AWS accounts! (so this is another way to share data across AWS accounts, apart from S3 or DB)
  • RDS
    • MySQL
  • S3
    • global; bucket needs to have a unique name

Logging, Monitoring, Alerting:

  • CloudWatch

Security:

  • IAM
    • users
    • user groups
    • roles
    • policies
  • KMS



image source: NAT gateway use cases - Amazon Virtual Private Cloud




image source: Load balancer subnets and routing - AWS Prescriptive Guidance


image source: Example: VPC with servers in private subnets and NAT - Amazon Virtual Private Cloud



image source: Example: VPC for web and database servers - Amazon Virtual Private Cloud



References:


High Availability, Fault Tolerance and IT disaster recovery


image source: Comparing High Availability Vs Fault Tolerance Vs Disaster Recovery



High availability means that an IT system, component, or application can operate at a high level, continuously, without intervention, for a given time period

High-availability IT systems and services are designed to 

High-availability infrastructure/services are configured/designed to:
  • deliver quality performance
  • handle different loads and failures
  • with minimal or zero downtime - be available 99.999% of the time during both planned and unplanned outages. Known as five nines reliability, the system is essentially always on.

High-availability clusters


High-availability clusters (also known as failover clusters):
  • servers grouped together to operate as a single, unified system
  • share the same storage but use different networks
  • share the same mission, in that they can run the same workloads of the primary system they support

If a server in the cluster fails, another server or node can take over immediately to help ensure the application or service supported by the cluster remains operational. Using high-availability clusters helps ensure there is no single point of failure for critical IT and reduces or eliminates downtime.

High-availability clusters are tested regularly to confirm nodes are always at the ready. IT administrators will often use an open-source heartbeat program to monitor the health of the cluster. The program sends data packets to each machine in a cluster to confirm that it is functioning as intended.


High-availability software


High-availability software:
  • used to operate/enable high-availability clusters
  • typically provide:
    • load balancing and redirecting
    • automatic application failover
    • real-time file replication
    • automatic failback capabilities
In a high-availability IT system, there are different layers (physical, data link, network, transport, session, presentation, and application) that have different software needs.

At the application layer, for example, load-balancing software—which is used to distribute network traffic and application workloads across servers—is considered critical to help ensure high availability of an application.



IT disaster recovery



If critical IT infrastructure fails, but is supported by high availability architecture, the backup system or component takes over. This allows users and applications to keep working without disruption and access the same data available before the failure occurred.

IT disaster recovery refers to the policies, tools, and procedures IT organizations must adopt to bring critical IT components and services back online following a catastrophe. An example of an IT disaster is the destruction of a data center due to a natural event like a major earthquake.

Think of high availability as a strategy for managing small but critical failures in IT infrastructure components that can be easily restored. 

IT disaster recovery is a process for overcoming major events that can sideline entire IT infrastructures.

Both high availability and disaster recovery are important for enhancing business continuity. So, too, is fault tolerance, as described later in this article. Planning for high availability includes identifying the IT systems and services deemed as essential to help ensure business continuity.


Elements of high-availability infrastructure


Redundancy


Redundancy means the IT components in a high-availability cluster, like servers or databases, can perform the same tasks.

High-availability IT infrastructure features:
  • hardware redundancy
  • software and application redundancy
  • data redundancy

Redundancy is also essential for fault tolerance, which complements high availability and IT disaster recovery.

Replication


Replication of data is essential to achieving high availability. Data needs to be replicated and shared with the same nodes in a cluster. The nodes must communicate with each other and share the same information, so that any one of them can step in to provide optimal service when the server or network device they are supporting fails.

Data can also be replicated between clusters to help ensure both high availability and business continuity in the event a data center fails.

Failover


A failover occurs when a process performed by the failed primary component moves to a backup component in a high-availability cluster. A best practice for high availability—and disaster recovery—is to maintain a failover system that is located off-premises.

IT administrators monitoring the health of critical primary systems can quickly switch traffic to the failover system when primary systems become overloaded or fail.

Fault tolerance


High availability and disaster recovery are both important for business continuity. Together, they help organizations to build high levels of fault tolerance, which refers to a system's ability to keep operating without interruption even if multiple hardware or software components fail.

Fault tolerance aims for zero downtime, while high availability is focused on delivering minimal downtime. A high-availability system designed to provide 99.999%, or five nines, operational uptime expects to see 5.26 minutes of downtime per year.

Unlike high availability, delivering high-quality performance is not a priority for fault tolerance. The purpose of fault-tolerance design in IT infrastructure is to prevent a mission-critical application from experiencing downtime.

Fault tolerance is a more expensive approach to ensuring uptime than high availability because it can involve backing up entire hardware and software systems and power supplies. High-availability systems do not require replication of physical components.

High availability and fault tolerance complement each other in that they help to support IT disaster recovery. Most business continuity strategies include high-availability, fault-tolerance, and disaster-recovery measures. These strategies help the organization maintain essential operations and support users when facing any type of critical IT failure, small or large.


References: