My Public Notepad: System Architecture

Showing posts with label System Architecture. Show all posts

Sunday, 11 August 2024

Introduction to Microservices

From Adnan Maqbool Khan's post on LinkedIn:

𝐂𝐨𝐦𝐩𝐨𝐧𝐞𝐧𝐭𝐬 𝐨𝐟 𝐌𝐢𝐜𝐫𝐨𝐬𝐞𝐫𝐯𝐢𝐜𝐞𝐬 𝐀𝐫𝐜𝐡𝐢𝐭𝐞𝐜𝐭𝐮𝐫𝐞

Microservices architecture breaks down applications into smaller, independent services. Here's a rundown of the 𝟏𝟎 𝐤𝐞𝐲 𝐜𝐨𝐦𝐩𝐨𝐧𝐞𝐧𝐭𝐬 in this architecture:

1. 𝐂𝐥𝐢𝐞𝐧𝐭
These are the end-users who interact with the application via different interfaces like web, mobile, or PC.

2. 𝐂𝐃𝐍 (Content Delivery Network)
CDNs deliver static content like images, stylesheets, and JavaScript files efficiently by caching them closer to the user's location, reducing load times.

3. 𝐋𝐨𝐚𝐝 𝐁𝐚𝐥𝐚𝐧𝐜𝐞𝐫
It distributes incoming network traffic across multiple servers, ensuring no single server becomes a bottleneck and improving the application's availability and reliability.

4. 𝐀𝐏𝐈 𝐆𝐚𝐭𝐞𝐰𝐚𝐲
An API Gateway acts as an entry point for all clients, handling tasks like request routing, composition, and protocol translation, which helps manage multiple microservices behind the scenes.

5. 𝐌𝐢𝐜𝐫𝐨𝐬𝐞𝐫𝐯𝐢𝐜𝐞𝐬
Each microservice is a small, independent service that performs a specific business function. They communicate with each other via APIs.

6. 𝐌𝐞𝐬𝐬𝐚𝐠𝐞 𝐁𝐫𝐨𝐤𝐞𝐫
A message broker facilitates communication between microservices by sending messages between them, ensuring they remain decoupled and can function independently.

7. 𝐃𝐚𝐭𝐚𝐛𝐚𝐬𝐞𝐬
Each microservice typically has its database to ensure loose coupling. This can involve different databases for different microservices

8. 𝐈𝐝𝐞𝐧𝐭𝐢𝐭𝐲 𝐏𝐫𝐨𝐯𝐢𝐝𝐞𝐫
This component handles user authentication and authorization, ensuring secure access to services.

9. 𝐒𝐞𝐫𝐯𝐢𝐜𝐞 𝐑𝐞𝐠𝐢𝐬𝐭𝐫𝐲 𝐚𝐧𝐝 𝐃𝐢𝐬𝐜𝐨𝐯𝐞𝐫𝐲
This system keeps track of all microservices and their instances, allowing services to find and communicate with each other dynamically.

10. 𝐒𝐞𝐫𝐯𝐢𝐜𝐞 𝐂𝐨𝐨𝐫𝐝𝐢𝐧𝐚𝐭𝐢𝐨𝐧 (e.g., Zookeeper)
Tools like Zookeeper help manage and coordinate distributed services, ensuring they work together smoothly.

Image source: Adnan Maqbool Khan's post on LinkedIn

Thursday, 8 August 2024

Load Balancing Algorithms

Load balancing:

Used in distributed systems to distribute incoming network traffic across multiple servers or resources
Crucial for optimizing performance and ensuring even distribution of workload
Enhances system reliability by ensuring no single server becomes a bottleneck, thus reducing the risk of server overload and potential downtime

image source: Post | LinkedIn

Some popular load balancing algorithms:

Round Robin

distributes incoming requests sequentially to each server in a circular manner
simple and easy to implement but may not take into account server load or capacity
most used

Weighted Round Robin

similar to Round Robin, but with the ability to assign different weights to servers based on their capacity or performance
Servers with higher weights receive more requests

IP Hash

Uses the client's IP address to determine which server to send the request to
Requests from the same IP address are consistently routed to the same server

Least Connections

directs incoming requests to the server with the fewest active connections at the time
helps distribute the load evenly among servers based on their current workload

Least Response Time

Routes requests to the server with the lowest response time or latency
Aims to optimize performance by sending requests to the fastest server.

Random

Randomly selects a server from the pool to handle each request
While simple, it may not ensure even distribution of load across servers

Each load balancing algorithm has its own advantages and considerations.

The choice of algorithm depends on the specific requirements of the system and the desired load distribution strategy.

Disclaimer:

All credits for the inspiration for the article, an infograph image and part of the content go to Sina Riyahi [https://www.linkedin.com/in/sina-riyahi/].

Monday, 5 August 2024

Introduction to Amazon Simple Queue Service (SQS)

Amazon Simple Queue Service (SQS) is a fully managed message queuing service provided by Amazon Web Services (AWS). It enables decoupling and scaling of microservices, distributed systems, and serverless applications.

Here's an overview of how Amazon SQS works:

Key Concepts

Queue:

A queue is a temporary storage location for messages waiting to be processed. There are two types of queues in SQS:

Standard Queue: Offers maximum throughput, best-effort ordering, and at-least-once delivery.
FIFO Queue: Ensures exactly-once processing and preserves the exact order of messages.

Message:

A message is the data that is sent between different components. It can be up to 256 KB in size and contains the information needed for processing.

Producer:

The producer (or sender) sends messages to the queue.

Consumer:

The consumer (or receiver) retrieves and processes messages from the queue.

Visibility Timeout:

A period during which a message is invisible to other consumers after a consumer retrieves it from the queue. This prevents other consumers from processing the same message concurrently.

Dead-Letter Queue (DLQ):

A queue for messages that could not be processed successfully after a specified number of attempts. This helps in isolating and analyzing problematic messages.

Workflow

Sending Messages:

A producer sends messages to an SQS queue using the SendMessage action. Each message is assigned a unique ID and placed in the queue.

Receiving Messages:

A consumer retrieves messages from the queue using the ReceiveMessage action. This operation can specify:

number of messages to retrieve (up to 10)
duration to wait if no messages are available

Processing Messages:

After receiving a message, the consumer processes it. The message remains invisible to other consumers for a specified visibility timeout.

Deleting Messages:

Once processed, the consumer deletes the message from the queue using the DeleteMessage action. If not deleted within the visibility timeout, the message becomes visible again for other consumers to process.

Handling Failures:

If a message cannot be processed successfully within a specified number of attempts, it is moved to the Dead-Letter Queue for further investigation.

Additional Features

Long Polling:

Reduces the number of empty responses by allowing the ReceiveMessage action to wait for a specified amount of time until a message arrives in the queue.

Message Attributes:

Metadata about the message that can be used for filtering and routing.

Batch Operations:

SQS supports batch sending, receiving, and deleting of messages, which can improve efficiency and reduce costs.

Security and Access Control

IAM Policies:

Use AWS Identity and Access Management (IAM) policies to control access to SQS queues.

Encryption:

Messages can be encrypted in transit using SSL/TLS and at rest using AWS Key Management Service (KMS).

Use Cases

Decoupling Microservices:

SQS allows microservices to communicate asynchronously, improving scalability and fault tolerance.

Work Queues:

Distributing tasks to multiple workers for parallel processing.

Event Sourcing:

Storing a series of events to track changes in state over time.

Example Scenario

Order Processing System:

An e-commerce application has separate microservices for handling orders, inventory, and shipping.
The order service sends an order message to an SQS queue.
The inventory service retrieves the message, processes it (e.g., reserves stock), and then sends an updated message to another queue.
The shipping service retrieves the updated message and processes it (e.g., ships the item).

By using Amazon SQS, these microservices can operate independently and scale as needed, ensuring reliable and efficient order processing.

Message Queuing Service - Amazon Simple Queue Service - AWS

Thursday, 1 August 2024

Designing Systems Architecture in AWS

In this article I want to explore patterns and building blocks (AWS managed services) used when designing systems in AWS.

Global:

Choose region(s)
Eeach region contains Availability Zones

Networking:

VPC

one or more - per Region
can be default or nondefault
CIDR

Default VPC CIDR is 172.31.0.0/16.
VPC CIDR needs to be within the allowed range of private IP addresses:

10.0.0.0/8 IP addresses: 10.0.0.0 – 10.255.255.255
172.16.0.0/12 IP addresses: 172.16.0.0 – 172.31.255.255
192.168.0.0/16 IP addresses: 192.168.0.0 – 192.168.255.255

Subnets

one or more - per AZ
CIDR

e.g. VPC is 10.0.0.0/16

10.0.0.0/24 - for range of 256 addresses: 10.0.0.0 to 10.0.0.255 (255 assignable as x.x.x.255 is for broadcast)
10.0.1.0/24 - for range of 256 addresses: 10.0.1.0 to 10.0.1.255
10.0.0.0/20 - to get a bit larger subnet - with 2^(32-20)=2^12=4096 IP addresses.

To calculate the adjacent range: 20 means that first 2 octets and first 4 bits from 3rd octet are fixed. In 3rd octet we have 0000xxxx where xxxx can go from 0000 = 0 to 1111 = 15 so the first subnet is 10.0.0.0 - 10.0.15.255. So, the next subnet is 10.0.16.0/20. Use ipcalc tool for faster results (IP Calculator / IP Subnetting).

access to Internet

Private

assigned a 'private' routing table which routes entire (non-local) traffic to NAT Gateway (they don't have direct routes to IGW)
if destination is within local CIDR range, traffic goes to "local"

Public

assigned a 'public' route tables which routes all (non-local) traffic to IGW so they have direct routes to IGW
instances launched into these subnets will be assigned a public IP address (AWS charges for these public IP addresses until instance is terminated and IP address is released)
if destination is within local CIDR range, traffic goes to "local"

Internet Gateway

Attached to VPC (can be default or nondefault)
allows instances with public IPs to access the Internet
There is no charge for an internet gateway, but there are data transfer charges for EC2 instances that use internet gateways.

(Public) NAT Gateway

required only if instances in private networks need to access Internet
used by instances in private subnets (these instances have no public IP assigned) so they can reach Internet but prevents the Internet from initiating a connection directly to the instances
must be created in (it is attached to) a public subnet (so its traffic can be routed to Internet Gateway)

that's why this NAT is called a 'public'
that's why it's bound to a single AZ

has to have Elastic IP Address (public IPv4 address) attached to it
NAT Gateway's traffic is routed to Internet via Internet Gateway

Routing Tables

Route Destination = cidr_block
Route Target = gateway_id, nat_gateway_id
types by routing to IGW

public

routes all traffic (0.0.0.0/0) to IGW

private

routes all traffic (0.0.0.0/0) to NAT GW

Transient Gateways

Compute:

EC2

standalone or created by ASG

ASG

gets assigned (operates on) a list of subnets - it will create new instances in them

ALB

Target Groups:

associated with ASG; this is how ALB knows which instances it works with

Listeners

Lambda

API Gateway

For full list of compute services see: AWS Compute Services category icon Compute services - Overview of Amazon Web Services.

Storage:

EBS

root and data volumes
gets mounted to EC2 instances

EFS

gets mounted to EC2 instances
connected to network, has DNS name => can be attached across networks/AWS accounts! (so this is another way to share data across AWS accounts, apart from S3 or DB)

RDS

MySQL

global; bucket needs to have a unique name

Logging, Monitoring, Alerting:

CloudWatch

Security:

IAM

users
user groups
roles
policies

KMS

image source: NAT gateway use cases - Amazon Virtual Private Cloud

image source: Load balancer subnets and routing - AWS Prescriptive Guidance

image source: Example: VPC with servers in private subnets and NAT - Amazon Virtual Private Cloud

image source: Example: VPC for web and database servers - Amazon Virtual Private Cloud

References:

How Amazon VPC works - Amazon Virtual Private Cloud

Provide network connectivity for your Auto Scaling instances using Amazon VPC - Amazon EC2 Auto Scaling

You put what in a public subnet‽

High Availability, Fault Tolerance and IT disaster recovery

image source: Comparing High Availability Vs Fault Tolerance Vs Disaster Recovery

High availability means that an IT system, component, or application can operate at a high level, continuously, without intervention, for a given time period.

High-availability IT systems and services are designed to

High-availability infrastructure/services are configured/designed to:

deliver quality performance
handle different loads and failures
with minimal or zero downtime - be available 99.999% of the time during both planned and unplanned outages. Known as five nines reliability, the system is essentially always on.

High-availability clusters

High-availability clusters (also known as failover clusters):

servers grouped together to operate as a single, unified system
share the same storage but use different networks
share the same mission, in that they can run the same workloads of the primary system they support

If a server in the cluster fails, another server or node can take over immediately to help ensure the application or service supported by the cluster remains operational. Using high-availability clusters helps ensure there is no single point of failure for critical IT and reduces or eliminates downtime.

High-availability clusters are tested regularly to confirm nodes are always at the ready. IT administrators will often use an open-source heartbeat program to monitor the health of the cluster. The program sends data packets to each machine in a cluster to confirm that it is functioning as intended.

High-availability software

High-availability software:

used to operate/enable high-availability clusters
typically provide:

load balancing and redirecting
automatic application failover
real-time file replication
automatic failback capabilities

In a high-availability IT system, there are different layers (physical, data link, network, transport, session, presentation, and application) that have different software needs.

At the application layer, for example, load-balancing software—which is used to distribute network traffic and application workloads across servers—is considered critical to help ensure high availability of an application.

IT disaster recovery

If critical IT infrastructure fails, but is supported by high availability architecture, the backup system or component takes over. This allows users and applications to keep working without disruption and access the same data available before the failure occurred.

IT disaster recovery refers to the policies, tools, and procedures IT organizations must adopt to bring critical IT components and services back online following a catastrophe. An example of an IT disaster is the destruction of a data center due to a natural event like a major earthquake.

Think of high availability as a strategy for managing small but critical failures in IT infrastructure components that can be easily restored.

IT disaster recovery is a process for overcoming major events that can sideline entire IT infrastructures.

Both high availability and disaster recovery are important for enhancing business continuity. So, too, is fault tolerance, as described later in this article. Planning for high availability includes identifying the IT systems and services deemed as essential to help ensure business continuity.

Elements of high-availability infrastructure

Redundancy

Redundancy means the IT components in a high-availability cluster, like servers or databases, can perform the same tasks.

High-availability IT infrastructure features:

hardware redundancy
software and application redundancy
data redundancy

Redundancy is also essential for fault tolerance, which complements high availability and IT disaster recovery.

Replication

Replication of data is essential to achieving high availability. Data needs to be replicated and shared with the same nodes in a cluster. The nodes must communicate with each other and share the same information, so that any one of them can step in to provide optimal service when the server or network device they are supporting fails.

Data can also be replicated between clusters to help ensure both high availability and business continuity in the event a data center fails.

Failover

A failover occurs when a process performed by the failed primary component moves to a backup component in a high-availability cluster. A best practice for high availability—and disaster recovery—is to maintain a failover system that is located off-premises.

IT administrators monitoring the health of critical primary systems can quickly switch traffic to the failover system when primary systems become overloaded or fail.

Fault tolerance

High availability and disaster recovery are both important for business continuity. Together, they help organizations to build high levels of fault tolerance, which refers to a system's ability to keep operating without interruption even if multiple hardware or software components fail.

Fault tolerance aims for zero downtime, while high availability is focused on delivering minimal downtime. A high-availability system designed to provide 99.999%, or five nines, operational uptime expects to see 5.26 minutes of downtime per year.

Unlike high availability, delivering high-quality performance is not a priority for fault tolerance. The purpose of fault-tolerance design in IT infrastructure is to prevent a mission-critical application from experiencing downtime.

Fault tolerance is a more expensive approach to ensuring uptime than high availability because it can involve backing up entire hardware and software systems and power supplies. High-availability systems do not require replication of physical components.

High availability and fault tolerance complement each other in that they help to support IT disaster recovery. Most business continuity strategies include high-availability, fault-tolerance, and disaster-recovery measures. These strategies help the organization maintain essential operations and support users when facing any type of critical IT failure, small or large.

References:

What Is High Availability? - Cisco

Comparing High Availability Vs Fault Tolerance Vs Disaster Recovery

Pages

Sunday, 11 August 2024

Introduction to Microservices

Thursday, 8 August 2024

Load Balancing Algorithms

Disclaimer:

Monday, 5 August 2024

Introduction to Amazon Simple Queue Service (SQS)

Key Concepts

Workflow

Additional Features

Security and Access Control

Use Cases

Example Scenario

Order Processing System:

Thursday, 1 August 2024

Designing Systems Architecture in AWS

Global:

Networking:

Compute:

Storage:

Logging, Monitoring, Alerting:

Security:

References:

High Availability, Fault Tolerance and IT disaster recovery

High-availability clusters

High-availability software

IT disaster recovery

Elements of high-availability infrastructure

Redundancy

Replication

Failover

Fault tolerance

References: