Friday, 31 October 2025

Elasticsearch Nodes


Elasticsearch nodes are individual instances of Elasticsearch servers that are part of a cluster. Each node stores data and participates in the cluster’s indexing and search capabilities, playing a critical role in the distributed architecture of Elasticsearch.​

Key Points about Elasticsearch Nodes:


A node is a single server or instance running Elasticsearch, identified by a unique name.

Nodes collectively form a cluster, which is a group of Elasticsearch nodes working together.

Nodes can have different roles:
  • Master Node: Manages the cluster state and handles cluster-wide actions like adding/removing nodes and creating/deleting indices.
  • Data Node: Stores data and executes data-related operations such as searches and aggregations.
  • Client (Coordinating) Node: Routes requests to the appropriate nodes but does not hold data.
  • Other special roles include ingestion and machine learning nodes.

Nodes communicate through TCP ports (commonly 9200 for REST API and 9300 for node-to-node communication).

Elasticsearch distributes data across nodes using shards, enabling horizontal scalability, fault tolerance, and high availability.​

In essence, nodes are the building blocks of an Elasticsearch cluster, with each node running on a server (physical or virtual) and working in coordination to provide fast search and analytics on distributed data.

To list all nodes with their attributes we can run this command in Kibana DevTools:


GET /_cat/nodes?v

Output example:

ip            heap.percent ram.percent cpu load_1m load_5m load_15m    node.role       master name
10.199.43.136           44          61   5    1.69    1.71     1.51 cdfhilmrstw -      default-2
10.199.6.164            38          55   4    0.96    1.40     1.33 cdfhilmrstw -      default-1
10.199.30.70            25          51   9    1.61    1.57     1.06 cdfhilmrstw -      data-0
10.199.38.215           46         100  13    1.69    1.71     1.51 cdfhilmrstw -      data-1
10.199.1.249            81          76  30    0.96    1.40     1.33 cdfhilmrstw *      monitoring-1
10.199.32.134           75         100  27    1.69    1.71     1.51 cdfhilmrstw -      monitoring-0
10.199.23.94            77         100  26    1.61    1.57     1.06 cdfhilmrstw -      monitoring-2
10.199.18.75            23          91  19    1.61    1.57     1.06 cdfhilmrstw -      default-0
10.199.15.193           59          56   5    0.96    1.40     1.33 cdfhilmrstw -      data-2


---

Elasticsearch Indices




An Elasticsearch index is a logical namespace that stores and organizes a collection of related JSON documents, similar to a database table in relational databases but designed for full-text search and analytics. 

Each index is uniquely named and can contain any number of documents, where each document is a set of key-value pairs (fields) representing your data.​

Key Features of an Elasticsearch Index


  • Structure: An index is comprised of one or more shards, which are distributed across nodes in the Elasticsearch cluster for scalability and resilience.​
  • Mapping and Search: Indexes define mappings that control how document fields are stored and searched.
  • Indexing Process: Data is ingested and stored as JSON documents in the index, and Elasticsearch builds an inverted index to allow for fast searches.​
  • Use Case: Indices are used to organize datasets in log analysis, search applications, analytics, or any scenario where rapid search/retrieval is needed.​

In summary, an Elasticsearch index is the foundational storage and retrieval structure enabling efficient search and analytics on large datasets.


Index Lifecycle Policy (ILM)


An Index Lifecycle Management (ILM) policy defines what happens to an index as it ages — automatically. It’s a set of rules for retention, rollover, shrink, freeze, and delete.

Example:

PUT _ilm/policy/functionbeat
{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": { "max_age": "30d", "max_size": "50GB" }
        }
      },
      "delete": {
        "min_age": "90d",
        "actions": { "delete": {} }
      }
    }
  }
}


This says:
  • Keep the index hot (actively written to) until it’s 30 days old or 50 GB big.
  • Then roll over (create a new index and switch writes to it).
  • After 90 days, delete the old index.

ILM be applied to a standard (non–data stream) index. We can attach an ILM policy to any index, not just data streams. However, there’s a big difference:

  • Rollover alias required:
    • Standard Index:Yes. We must manually set up an alias to make rollover work!
    • Data Stream: No (handled automatically - Elastic manages the alias and the backing indices)
  • Multiple backing indices
    • Standard Index: Optional (via rollover)
    • Data Stream: Always (that’s how data streams work)
  • Simplified management
    • Standard Index: Manual setup
    • Data Stream: Built-in

Index Rollover vs Data Stream


If we have a continuous stream of documents (e.g. logs) being written to Elasticsearch, we should not write them to a regular index as its size will grow over time and we'll need to keep increasing a node storage. Instead, we should consider one of the following options:

  1. Data Stream
  2. Index with ILM policy which defines a rollover conditions

What does rollover mean for a standard index?

When a rollover is triggered (by size, age, or doc count):

  • Elasticsearch creates a new index with the same alias.
  • The alias used for writes (e.g. functionbeat-write) is moved from the old index to the new one.
  • Functionbeat or Logstash continues writing to the same alias, unaware that rollover happened.


Example:

# Initially
functionbeat-000001  (write alias: functionbeat-write)

# After rollover
functionbeat-000001  (read-only)
functionbeat-000002  (write alias: functionbeat-write)


This keeps the write flow continuous and allows you to:
  • Manage old data (delete, freeze, move to cold tier)
  • Limit index size for performance

How to apply ILM to a standard index?

Here’s a minimal configuration:

PUT _ilm/policy/functionbeat
{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": { "max_age": "30d", "max_size": "50GB" }
        }
      },
      "delete": {
        "min_age": "30d",
        "actions": { "delete": {} }
      }
    }
  }
}

PUT _template/functionbeat
{
  "index_patterns": ["functionbeat-*"],
  "settings": {
    "index.lifecycle.name": "functionbeat",
    "index.lifecycle.rollover_alias": "functionbeat-write"
  }
}


The following command creates a new index called functionbeat-000001 (if it doesn’t already exist). If the index does exist, it updates the aliases section. It creates an alias named functionbeat-write that points to this index. (Aliases are like virtual index names — you can send reads or writes to the alias instead of a specific index. They’re lightweight and flexible.). is_write_index: true tells Elasticsearch: “When someone writes to this alias, route the write operations to this index.” If you later have: functionbeat-000001, functionbeat-000002 and both share the alias functionbeat-write, then only the one with "is_write_index": true will receive new documents.

PUT functionbeat-000001
{
  "aliases": {
    "functionbeat-write": { "is_write_index": true }
  }
}


ILM rollover works by:
  • Watching the alias (functionbeat-write), not a specific index.
  • When rollover conditions are met (e.g. 50 GB or 30 days), Elasticsearch:
    • Creates a new index (functionbeat-000002)
    • Moves "is_write_index": true from 000001 to 000002. From that moment, all new Functionbeat writes go to the new index — automatically.
After rollover:
  • functionbeat-000001 becomes read-only, but still searchable.
  • ILM will later delete it when it ages out (based on your policy).

So that last command effectively bootstraps the first generation of an ILM-managed index family.
  • ILM policy: Automates rollover, delete, etc.
  • Rollover action: Creates a new index and shifts the alias
  • Alias requirement: Required, used for write continuity
  • Data stream alternative: Better option, handles rollover and aliasing for you

Index Template

Index templates do not retroactively apply to existing indices. They only apply automatically to new indices created after the template exists.

When we define an index template like:

PUT _index_template/functionbeat
{
  "index_patterns": ["functionbeat-*"],
  "template": {
    "settings": {
      "index.lifecycle.name": "functionbeat"
    }
  }
}


That template becomes part of the index creation logic.

So:

When a new index is created (manually or via rollover),
→ Elasticsearch checks all templates matching the name.
→ The matching template(s) are merged into the new index settings.

Existing indices are not touched or updated.

If we already have an index — e.g. functionbeat-8.7.1 — that matches the template pattern, it won’t automatically get the template settings.

We need to apply those manually, for example:

PUT functionbeat-8.7.1/_settings
{
  "index.lifecycle.name": "functionbeat",
  "index.lifecycle.rollover_alias": "functionbeat-write"
}

Now the existing index is under ILM control (using the same settings the template would have applied if it were created fresh).

Elasticsearch treats index templates as blueprints for new indices, not as live configurations.
This is intentional — applying settings automatically to existing indices could cause:
  • unintended allocation moves,
  • mapping conflicts,
  • or lifecycle phase resets.

We want to keep as least as possible data in Elasticsearch. If data stored are logs, we want to:
  • make sure apps are sending only meaningful logs
  • make sure we capture repetitive error messages so the app can be fixed and stop emitting them

Shards and Replicas


We can set the number of shards and replicas per index in Elasticsearch when we create the index, and we can dynamically update the number of replicas (but not the number of primary shards) for existing indices.​

Setting Shards and Replicas on Index Creation


Specify the desired number in the index settings payload:


PUT /indexName
{
  "settings": {
    "index": {
      "number_of_shards": 6,
      "number_of_replicas": 2
    }
  }
}

This creates the index with 6 primary shards and 2 replicas per primary shard.​

Adjusting Replicas After Creation


You can adjust the number of replicas for an existing index using the settings API:


PUT /indexName/_settings
{
  "index": {
    "number_of_replicas": 3
  }
}

Replicas can be changed at any time, but the number of primary shards is fixed for the lifetime of the index.​

Shard and Replica Principles


Each index has a configurable number of primary shards.
Each primary shard can have multiple replica shards (copies).
Replicas improve fault tolerance and can spread search load.​

We should choose shard and replica counts based on data size, node count, and performance needs. Adjusting these settings impacts resource usage and indexing/search performance.


Index Size


To find out the size of each index (shards) we can use the following Kibana DevTools query:


GET /_cat/shards?v&h=index,shard,prirep,state,unassigned.reason,node,store&s=store:desc

The output contains the following columns:
  • index - index name
  • shard - order number of a (primary) shard. If we have 2 shards and 2 replicas, we'd have 4 rows, with shard=0 for first two rows (first primary and replica) and shard=1 for next two rows (second primary and replica)
  • prirep - is shard a primary (p) or replica (r)
  • state - e.g. STARTED
  • unassigned
  • reason
  • node - name of the node
  • store - used storage (in gb, mb or kb)


Each shard should not be larger than 50GB. We can impose this via Index Lifecycle Policy where we can set rollover criteria.

Friday, 24 October 2025

Introduction to Kubernetes CoreDNS



CoreDNS is a DNS server that runs inside Kubernetes and is responsible for service discovery — i.e., translating service names (like my-service.default.svc.cluster.local) into IP addresses.


What CoreDNS Does

In a Kubernetes cluster:
  • Every Pod and Service gets its own DNS name.
  • CoreDNS listens for DNS queries from Pods (via /etc/resolv.conf).
  • It looks up the name in the cluster’s internal DNS records and returns the correct ClusterIP or Pod IP.
So if a Pod tries to reach mysql.default.svc.cluster.local, CoreDNS will resolve it to the IP of the mysql service.

How It Works

Runs as a Deployment in the kube-system namespace.
Has a Service called kube-dns (for backward compatibility).
Uses a ConfigMap (coredns) to define how DNS queries are processed.
Listens on port 53 (UDP/TCP), the standard DNS port.

Example CoreDNS ConfigMap snippet:

apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns
  namespace: kube-system
data:
  Corefile: |
    .:53 {
        errors
        health
        kubernetes cluster.local in-addr.arpa ip6.arpa {
            pods insecure
            fallthrough in-addr.arpa ip6.arpa
        }
        prometheus :9153
        forward . /etc/resolv.conf
        cache 30
        loop
        reload
        loadbalance
    }

Key Plugins


CoreDNS is modular — it uses plugins for specific functionality:
  • kubernetes: handles DNS for cluster Services/Pods.
  • forward: forwards queries to upstream resolvers for external domains.
  • cache: caches responses for faster resolution.
  • prometheus: exposes metrics for monitoring.
  • health: adds a health endpoint.

Why It Matters


Without CoreDNS, Pods can’t resolve service names.
It’s essential for communication between microservices.
It’s a critical cluster component — if it breaks, DNS resolution (and often your workloads) fail.

Common Commands


Check CoreDNS pods:

kubectl get pods -n kube-system -l k8s-app=kube-dns


View CoreDNS logs:

kubectl logs -n kube-system -l k8s-app=kube-dns


Edit CoreDNS config:

kubectl -n kube-system edit configmap coredns


---

Thursday, 16 October 2025

AWS Web Application Firewall (WAF)



A web application firewall is in place to protect public-facing web applications from outside threats.
WAF should be appropriately deployed and configured to appropriately block malicious traffic.
There should be a single Global WAF ACL set.

If WAF is not enabled or there are no rulesets configured:

Recommended Solution:

Enable a Web Application Firewall (WAF) for all public-facing web applications. Configure it with standard security rulesets (e.g., OWASP Core Rule Set) to detect and block common attacks such as SQL injection, XSS, and request smuggling. Regularly update and tune the rules to minimize false positives.

Risks and Side Effects:

Implementing or tightening WAF rules may initially block legitimate traffic or cause application performance degradation if not tuned properly. Continuous monitoring and rule adjustments are required to balance security and usability.

How to Check if Affected Resources Handle Customer Data:

Review application documentation and data flow diagrams to determine which web applications process or serve customer data. Check whether endpoints behind the affected web servers are used for customer authentication, data input forms, or APIs that exchange customer information. If logs or configuration files reference customer databases or CRM integrations, those applications are considered to handle customer data.


What are the resources in aws that should be behind WAF?

In AWS, any internet-facing resource that serves web traffic or APIs should ideally be protected by a Web Application Firewall (WAF). Here’s a breakdown of the main AWS resources that should be behind WAF, depending on how your architecture is set up:

1. Amazon CloudFront Distributions

Why: CloudFront is often the first entry point for global web traffic. Applying WAF at this layer helps block malicious requests before they reach your origin servers.

Use case: Static websites, dynamic web applications, APIs, or content delivery networks (CDNs).

2. Application Load Balancers (ALB)

Why: ALBs route traffic to backend EC2 instances, containers, or Lambda functions. WAF at this level protects internal applications or microservices exposed to the internet.

Use case: Web apps, REST APIs, or services behind an ALB.

3. Amazon API Gateway

Why: API Gateway endpoints are public by default and can expose critical business logic. WAF helps prevent attacks like injection, scraping, and DoS attempts.

Use case: Public or partner-facing APIs.

4. AWS AppSync (GraphQL APIs)

Why: GraphQL APIs are vulnerable to overly complex queries and enumeration attacks. WAF can enforce query depth limits and request validation.

Use case: Mobile or web applications using GraphQL.

5. Amazon Cognito Hosted UI (optional)

Why: If you expose Cognito’s hosted UI for user sign-up or login, WAF can help protect against brute-force or injection attacks.

Use case: Authentication portals for customers or employees.

6. AWS Elastic Beanstalk Environments

Why: Beanstalk apps typically use an ALB or CloudFront. Apply WAF at the load balancer or CloudFront layer to protect the Beanstalk environment.

Use case: Managed web applications deployed through Elastic Beanstalk.

7. Public-Facing EC2 Instances (if not behind ALB)

Why: Directly exposed EC2 web servers are vulnerable entry points. A WAF can protect HTTP/HTTPS traffic through an AWS WAF-enabled CloudFront distribution or ALB placed in front.

Use case: Legacy applications or custom web servers.


Rule of Thumb

If the AWS resource:
  • Accepts HTTP/HTTPS requests directly from the internet, and
  • Handles customer data or serves data to customers
...then it should be behind AWS WAF (via CloudFront, ALB, or API Gateway integration).



AWS WAF Coverage Checklist


Here’s a short, practical checklist you can use to verify that all applicable AWS resources are protected by a WAF — either for internal compliance tracking (e.g., Drata) or automated auditing (e.g., AWS Config, Security Hub).


1. Identify Public-Facing Entry Points

 List all CloudFront distributions, ALBs, API Gateways, and AppSync APIs.

Use the AWS CLI or console:

aws cloudfront list-distributions
aws elbv2 describe-load-balancers
aws apigateway get-rest-apis
aws appsync list-graphql-apis


 Confirm which ones have a public DNS name or public IP (internet-facing).

2. Check for WAF Associations

 For each CloudFront distribution, confirm it has an associated WAF WebACL:

aws wafv2 list-web-acls --scope CLOUDFRONT


 For each Application Load Balancer:

aws wafv2 list-web-acls --scope REGIONAL


 For each API Gateway or AppSync API:

aws wafv2 list-web-acls --scope REGIONAL


 Verify that the WebACLs are actively associated with the resources above.

3. Review WAF Configuration

 Ensure WAF WebACLs use AWS Managed Rules (e.g., AWSManagedRulesCommonRuleSet, SQLiRuleSet).

 Check for custom rules or rate-based rules to block brute force or scraping.

 Verify logging is enabled to CloudWatch Logs or S3 for auditability.

4. Confirm Coverage for Customer-Data Applications

 Identify which web apps/APIs process customer data or serve data to customers (e.g., sign-in pages, dashboards, APIs).

 Ensure those endpoints are behind a CloudFront or ALB with WAF enabled.

 For internal-only services, document why WAF protection is not required (for audit traceability).

5. Ongoing Monitoring

 Enable AWS Config rule: wafv2-webacl-resource-association-check
→ Automatically detects if CloudFront, ALB, or API Gateway resources lack WAF association.

 Integrate findings into AWS Security Hub or Drata evidence collection for continuous compliance.

Extended Arguments (xargs) Unix command




xargs builds and executes command lines from standard input.

While pipe command (|) takes the stdout of the previous command and forwards it to stdin of the next command, xargs takes space-separated strings from that stdin and convert them into arguments of xargs command.

Example:

$ echo "/dirA /dirB" | xargs ls

will be converted to:

ls /dirA /dirB

-n1 causes the command specified by xargs to be executed, taking one argument at a time from its input (arguments can be each in one line or simply separated by spaces or tabs) and running the command separately for each individual item.​ The -n1 option means "use just one argument per command invocation".​ Each item from standard input (for example, a filename from a list) will be passed separately to the command. As a result, the command will be run as many times as there are items in the input, once for each.

Example:

echo -e "a\nb\nc" | xargs -n1 echo

...will produce:

echo a
echo b
echo c

So each invocation receives only one argument from the input.​

This is useful when a command should process only one item at a time, such as deleting files one by one, or when handling commands that cannot accept multiple arguments simultaneously.

-I{} replaces {} in the command with the input item (in the following example that's lambda function name).

In the following example we use -I to replace {} with incoming argument for xarg and then we use $ positional parameters to interpolate inputs for sh:

Let's assume we have previously defined variables like...

AWS_PROFILE=...
AWS_REGION=...
SG_ID=...


aws lambda list-functions \
    --profile "$AWS_PROFILE" \
    --region "$AWS_REGION" \
    --query "Functions[].FunctionName" \
    --output text | \
xargs \
    -n1 \
    -I{} \
    sh -c \
        "aws lambda get-function-configuration \
            --profile \"\$1\" \
            --region \"\$2\" \
            --function-name \"\$3\" \
            --query \"VpcConfig.SecurityGroupIds\" \
            --output text 2>/dev/null | \
        grep \
            -w \"\$4\" && \
        echo \
            \"Found in Lambda function: \$3\"" \
    _ "$AWS_PROFILE" "$AWS_REGION" {} "$SG_ID"


The sh -c command allows passing multiple arguments, which are referenced as $1, $2, $3, and $4 inside the shell script.

The underscore (_) is used as a placeholder for the $0 positional parameter inside the sh -c subshell.

When you use sh -c 'script' arg0 arg1 arg2 ..., the first argument after the script (arg0) is assigned to $0 inside the script, and the rest (arg1, arg2, etc.) are assigned to $1, $2, etc.

In this context, _ is a common convention to indicate that the $0 parameter is not used or is irrelevant. It simply fills the required position so that $1, $2, $3, and $4 map correctly to "$AWS_PROFILE", "$AWS_REGION", {} (the function name), and "$SG_ID".


References:

Monday, 13 October 2025

AWS VPC Endpoint

 

A VPC Endpoint is a network component that enables private connectivity between AWS resources in a VPC and supported AWS services, without requiring public IP addresses or traffic to traverse the public internet.​

When to Use a VPC Endpoint


Use VPC Endpoints when security and privacy are priorities, as it allows your resources in private subnets to access AWS services (like S3, DynamoDB, or other supported services) without exposure to the internet.​

VPC Endpoints can improve performance, reduce latency, and simplify network architecture by removing dependencies on NAT gateways or internet gateways.​

They help in scenarios where compliance or regulatory requirements dictate that traffic must remain entirely within the AWS network backbone.​

Use them to save on NAT gateway or data transfer costs when large amounts of traffic are sent to or from AWS services.​

When Not to Use a VPC Endpoint


They may not be suitable if you require internet access for your workloads (e.g., accessing third-party services).​

If your use case does not require private connectivity and your infrastructure already relies on internet/NAT gateways, VPC Endpoints could add unnecessary complexity.​

There is an additional cost for interface endpoints, charged per hour and data transferred, which may be a consideration for cost-sensitive environments.​

Service support is not universal—gateway endpoints only work for S3 and DynamoDB, and not all AWS services support PrivateLink/interface endpoints.​

Alternatives to VPC Endpoints


NAT Gateway or NAT Instance: Provides private subnets with internet access, but all traffic goes over the public internet and incurs NAT gateway/data transfer costs.​

VPN Connection or AWS Direct Connect: Used for private connectivity between on-premises networks and AWS VPCs. These are more suitable for hybrid cloud requirements and broader connectivity scenarios.​

Internet Gateway: Needed if your resources require general internet access, though this exposes them to the public internet.

---

Policy types in AWS

 


There are several types of AWS policies, but the primary and most commonly referenced categories are identity-based policies and resource-based policies.

Main AWS Policy Types



Identity-based policies are attached to AWS IAM identities (users, groups, or roles) and define what actions those entities can perform on which resources.​


Resource-based policies are attached directly to AWS resources (such as S3 buckets or SNS topics), specifying which principals (identities or accounts) can access those resources and what actions are permitted.​

Resource-based policy example:

{
   "Version": "2012-10-17",
   "Id": "Policy1415115909152",
   "Statement": [
     {
       "Sid": "Access-to-specific-VPCE-only",
       "Principal": "*",
       "Action": "s3:GetObject",
       "Effect": "Allow",
       "Resource": ["arn:aws:s3:::yourbucketname",
                    "arn:aws:s3:::yourbucketname/*"],
       "Condition": {
         "StringEquals": {
           "aws:SourceVpce": "vpce-1a2b3c4d"
         }
       }
     }
   ]
}



Other AWS Policy Types

In addition to the above, AWS supports several other policy types, including:
  • Managed policies (AWS managed and customer managed)
  • Inline policies (directly embedded on a single identity)
  • Permissions boundaries (set maximum permissions for identities)
  • Service Control Policies (SCPs, used in AWS Organizations)
  • Access Control Lists (ACLs, primarily for resources like S3 buckets)
  • Session policies (restrict permissions for sessions created with temporary credentials).​
While identity-based and resource-based are the two fundamental categories most often discussed, the broader IAM ecosystem incorporates additional forms for more advanced governance and restrictions.​




How do resource-based and identity-based policies differ?


Resource-based and identity-based policies in AWS differ primarily in their attachment location and in how they control access permissions to AWS resources.​

Key Differences


Identity-based policies are attached to IAM entities (users, groups, or roles) and specify what actions these identities can perform on which resources. For example, an IAM user can have a policy that permits reading from specific DynamoDB tables or starting EC2 instances.​

Resource-based policies are attached directly to AWS resources (such as S3 buckets, SNS topics, or KMS keys). These policies define which principals (users, roles, accounts) can access the resource and what actions they are allowed to perform. Resource-based policies allow for fine-grained control, including granting access to principals outside of the resource owner’s AWS account.​

Attachment and Usage


Identity-based policies are managed at the IAM level and generally offer broader access control for multiple resources through one principal.​

Resource-based policies are applied specifically to resources and are used when the access control needs to be defined at the resource level, possibly for cross-account or external identity access.​

Policy Evaluation


When a request to access a resource is made, AWS evaluates all applicable identity-based and resource-based policies together.

An explicit 'Deny' found in any policy will override any 'Allow' statements, ensuring comprehensive and secure permission management.​

Both types can be used separately or together, depending on security architecture and the granularity of control required for identities and resources.


If resource based policy allows access to some user, do we need a separate identity-based policy which allows access to that resource to be attached to that user?


If a resource-based policy allows access to a user, there is generally no need to attach a separate identity-based policy for that specific action—access will be granted solely based on the resource-based policy's allowance.​​

Details on Policy Evaluation Logic


When a request is made, AWS checks both identity-based and resource-based policies that apply to the request.​

If any policy contains an explicit 'Deny', access is denied regardless of any 'Allow' statements elsewhere.​

If at least one policy (whether identity-based or resource-based) contains an 'Allow' and there are no explicit Deny statements, the action is permitted.​​

For resource-based policies, the permission is granted directly to the designated principal (user or role), even if the user or role does not have a corresponding identity-based policy that allows the action.​​

In cross-account scenarios, both the identity-based policy in the requesting account and the resource-based policy on the target resource must allow the action; otherwise, access is denied.​

A user can be granted access solely by a resource-based policy, but if finer-grained or multiple permissions are needed across various resources, identity-based policies are beneficial. A resource-based policy alone can grant access to a user, without needing a separate identity-based policy for that resource and action—the access will work as long as the resource-based policy allows it and there is no explicit deny elsewhere. AWS evaluates all applicable identity-based and resource-based policies for a request; if any presents an explicit deny, access is rejected, but if at least one policy allows the action, access is permitted.​​

This means a user with no identity-based permission, but with permission in a resource-based policy, can still access that specific resource unless a deny blocks them. However, in cross-account situations, both a corresponding identity-based policy in the user's account and a resource-based policy in the resource owner's account must allow the action for access to succeed.

Are policies listed in IAM in AWS Console, only identity-based policies?


Yes, the policies listed in the IAM section of the AWS Console are only identity-based policies—specifically managed policies and inline policies that are attached to IAM users, groups, or roles.​

IAM policies you see under "Policies" are either AWS managed, customer managed, or inline identity-based policies.​

Resource-based policies (such as S3 bucket policies, SNS topic policies, or Lambda resource policies) are not centrally listed in IAM “Policies” in the Console; instead, they are managed from the respective resource consoles (e.g., via the S3 or Lambda management screens).​

The IAM Console does not display resource-based policies in the Policies list, since these are stored on resources, not IAM identities.​

To summarize, only identity-based policies—managed and inline—are listed in the IAM policies view in the AWS Console. Resource-based policies are managed and reviewed from the console page of each AWS service resource

---

Tuesday, 7 October 2025

Amazon RDS (Relational Database Service)

 


Amazon Relational Database Service (RDS) 
  • Distributed relational database service
  • Simplifies the setup, operation, and scaling of a relational database
  • Automates admin tasks like patching the database software, backing up databases and enabling point-in-time recovery 
  • Scaling storage and compute resources is done via API call

Amazon RDS supports eight major database engines:
  • Oracle (proprietary)
  • Microsoft SQL Server (proprietary)
  • IBM Db2 (community-developed)
  • Amazon Aurora (MySQL- and PostgreSQL-compatible)(open-source)
  • MySQL
  • MariaDB
  • PostgreSQL


Networking

We can launch Amazon RDS databases in the public or private subnet of a VPC. 

If DB instance is in a public subnet and we want it to be accessible from Internet:

  • Publicly Accessible property of the DB instance needs to be set to Yes
  • Inbound rules for the security group of the RDS instance need to allow connections from source IP
  • Internet Gateway needs to be attached to VPC

Troubleshooting


To test if RDS is accessible from Internet (and also that it's up and listening) we can use Telnet or Netcat.

If using MacOS, Telnet is not installed by default so install it via brew:

% brew install telnet

% telnet test-example-com-13-20240712155825.ckmh7hyrsza3.us-east-1.rds.amazonaws.com 3306 
Trying 121.65.2.95...
Connected to ec2-10-10-129-99.compute-1.amazonaws.com.
Escape character is '^]'.
Connection closed by foreign host.

If we want to use Netcat:

% nc test-example-com-13-20240712155825.ckmh7hyrsza3.us-east-1.rds.amazonaws.com 3306 -v
Connection to test-example-com-13-20240712155825.ckmh7hyrsza3.us-east-1.rds.amazonaws.com port 3306 [tcp/mysql] succeeded!


MySQL

Terraform


These are the privileges that allow user to execute DROP USER in RDS MySQL DB:

# Global privileges
resource "mysql_grant" "bojan_global" {
  user       = mysql_user.bojan.user
  host       = mysql_user.bojan.host
  database   = "*"
  table      = "*"
  privileges = ["CREATE USER"]
}

# Database-level privileges
resource "mysql_grant" "bojan" {
  user       = mysql_user.bojan.user
  host       = mysql_user.bojan.host
  database   = "%"
  privileges = ["SELECT", "SHOW VIEW", "INSERT", "UPDATE", "EXECUTE", "DELETE"]
}

root user in MySQL in RDS has not all admin privileges as if it was the regular MySQL instance.

root user in RDS MySQL can't grant SYSTEM_USER privilege to another user as this error occurs:

Error 1227 (42000): Access denied; you need (at least one of) the RDSADMIN USER privilege(s) for this operation

root user does not have privileges to grant privileges on mysql DB as this error occurs:

Error running SQL (GRANT DELETE ON `mysql`.* TO 'bojan'@'%'): Error 1044 (42000): Access denied for user 'root'@'%' to database 'mysql'


---

Resources:

Friday, 5 September 2025

Introduction to Amazon Kinesis Data Streams

 


Amazon Kinesis Data Streams is one of 4 Amazon Kinesis services. It helps to easily stream data at any scale.

  • There are no servers to manage
  • Two capacity modes:
    •  on-demand: eliminates the need to provision or manage capacity required for running applications; Automatic provisioning and scaling
    •  provisioned
  • Pay only for what you use 
  • Built-in integrations with other AWS services to create analytics, serverless, and application integration solutions
  • To ingest and collect terabytes of data per day from application and service logs, clickstream data, sensor data, and in-app user events to power live dashboards, generate metrics, and deliver data into data lakes
  • To build applications for high-frequency event data such as clickstream data, and gain access to insights in seconds, not days, using AWS Lambda or Amazon Managed Service for Apache Flink
  • To pair with AWS Lambda to respond to or adjust immediate occurrences within the event-driven applications in your environment, at any scale.

The producers continually push data to Kinesis Data Streams, and the consumers process the data in real time. Producer can be e.g. CloudWatch Log and consumer can be e.g. Lambda function.

A producer puts data records into shards and a consumer gets data records from shards. Consumers use shards for parallel data processing and for consuming data in the exact order in which they are stored. If writes and reads exceed the shard limits, the producer and consumer applications will receive throttles, which can be handled through retries.

Capacity Modes

  • Provisioned
    • data stream capacity is fixed
    • 1 shard has fixed capacities: 
      • Write: Maximum 1 MiB/second 1,000 records/second 
      • Read: Maximum 2 MiB/second
    • N shards will multiply R/W capacity by N
  • On-demand
    • data stream capacity scales automatically

Data Retention Period

A Kinesis data stream stores records from 24 hours by default, up to 8760 hours (365 days). You can update the retention period via the Kinesis Data Streams console or by using the IncreaseStreamRetentionPeriod and the DecreaseStreamRetentionPeriod operations.

The default retention period of 24 hours covers scenarios where intermittent lags in processing require catch-up with the real-time data. A seven-day retention lets you reprocess data for up to seven days to resolve potential downstream data losses. Long-term data retention greater than seven days and up to 365 days lets you reprocess old data for use cases such as algorithm back testing, data store backfills, and auditing.


Data Stream << Shards << Data Records


Kinesis data stream is a set of shards. Each shard has a sequence of data records.

A data record is the unit of data stored in a Kinesis data stream. Data records are composed of:
  • Sequence number
    • within the shard
    • assigned by Kinesis Data Streams
  • Partition key
    • used to isolate and route records to different shards of a data stream
    • specified by your data producer while adding data to a Kinesis data stream. For example, let’s say you have a data stream with two shards (shard 1 and shard 2). You can configure your data producer to use two partition keys (key A and key B) so that all records with key A are added to shard 1 and all records with key B are added to shard 2.
  • Data blob
    • an immutable sequence of bytes
    • data of interest your data producer adds to a data stream
    • Kinesis Data Streams does not inspect, interpret, or change the data in the blob in any way
    • A data blob (the data payload before Base64-encoding)  can be up to 1 MB.

How to find all shards in a stream?

% aws kinesis list-shards \
--stream-name my-stream \
--region us-east-2 \
--profile my-profile
{
    "Shards": [
        {
            "ShardId": "shardId-000000000000",
            "HashKeyRange": {
                "StartingHashKey": "0",
                "EndingHashKey": "440282366920938463463374607431768211455"
            },
            "SequenceNumberRange": {
                "StartingSequenceNumber": "49663754454378916691333541504734985347376184017408753666"
            }
        }
    ]
}


Note HashKeyRange - it is used by data producers to determine into which shard to write the next document.

Shard Iterator


Shard iterators are a fundamental tool for consumers to process data in the correct order and without missing records. They provide fine control over the reading position, making it possible to build scalable, reliable stream processing workflows.

A shard iterator is a reference marker that specifies the exact position within a shard from which data reading should begin and continue sequentially. It enables consumers to access, read, and process records from a particular location within a data stream shard.

Different types of shard iterators control where reading starts:
  • AT_SEQUENCE_NUMBER: Reads from the specific sequence number.
  • AFTER_SEQUENCE_NUMBER: Reads just after the specified sequence number.
  • TRIM_HORIZON: Starts from the oldest available data in the shard.
  • LATEST: Starts from the newest record (most recently added record, records added after the iterator is created)
  • AT_TIMESTAMP: Starts from a specific timestamp in the shard
When reading repeatedly from a stream, the shard iterator is updated after each GetRecords request (with the NextShardIterator returned by the API).

This mechanism allows applications to seamlessly resume reading from the correct point in the stream.

Proper management of shard iterators is essential to avoid data loss or duplicate reads, especially as data retention policies and processing speeds can affect the availability of data in the stream


How to get an iterator in a shard?


A shard iterator is obtained using the GetShardIterator API, which marks the spot in the shard to start reading records.

The command aws kinesis get-shard-iterator is used to obtain a pointer (iterator) to a specific position in a shard of a Kinesis stream. We need this iterator to actually read records from the stream using aws kinesis get-records.


% ITERATOR=$(aws kinesis get-shard-iterator \
    --stream-name my-stream \
    --shard-id shardId-000000000000 \
    --shard-iterator-type TRIM_HORIZON \
    --query 'ShardIterator' \
    --output text \
    --region us-east-2 \
    --profile my-profile)

ShardIterator is a token we pass to aws kinesis get-records to actually fetch the data.

% echo $ITERATOR
AAAAAAAAAAHf65JkbV8ZJQ...Exsy8WerU5Z8LKI8wtXm95+blpXljd0UWgDs7Seo9QlpikJI/6U=


We can now read records directly from the stream:

% aws kinesis get-records \
--shard-iterator $ITERATOR \
--limit 50 \
--region us-east-2 \
--profile my-profile
{
    "Records": [
        {
            "SequenceNumber": "49666716197061357389751170868210642185623031110770360322",
            "ApproximateArrivalTimestamp": "2025-09-03T17:37:00.343000+01:00",
            "Data": "H4sIAAAAAAAA/+3YTW/TMBgH8K8S...KaevxIQAA",
            "PartitionKey": "54dc991cd70ae6242c35f01972968478"
        },
        {
            "SequenceNumber": "49666716197061357389751170868211851111442645739945066498",
            "ApproximateArrivalTimestamp": "2025-09-03T17:37:00.343000+01:00",
            "Data": "H4sIAAAAAAAA/+3YS2vcM...889uPBV8BVXceAAA=",
            "PartitionKey": "e4e9ad254c154281a67d05a33fa0ea31"
        },
        ...
   ]
}

Data field in each record is Base64-encoded. That’s because Kinesis doesn’t know what your producers are sending (could be JSON, gzipped text, protobuf, etc.), so it just delivers raw bytes.

Why each record read from the shard has a different Partition Key?


Each record in a Kinesis stream can have a different PartitionKey because the PartitionKey is chosen by the data producer for each record and is not tied directly to the way records are read from the stream or which iterator type is used. Even when reading from a single shard using TRIM_HORIZON, records within that shard may have different PartitionKeys because the PartitionKey is used to route records to shards at the time of writing—not to group records within a shard during reading.


How Partition Keys and Shards Work


The PartitionKey determines to which shard a record is sent by hashing the key and mapping it to a shard's hash key range. Multiple records with different PartitionKeys can end up in the same shard, especially if the total number of unique partition keys is greater than the number of shards, or if the hash function maps them together. When reading from a shard (with any iterator, including TRIM_HORIZON), the records are read in order of arrival, but each can have any PartitionKey defined at ingest time.

Reading and PartitionKeys


Using TRIM_HORIZON just means starting at the oldest record available in the shard. It does not guarantee all records have the same PartitionKey, only that they are the oldest records remaining for that shard. Records from a single shard will often have various PartitionKeys, all mixed together as per their original ingest. Therefore, it is normal and expected to see a variety of PartitionKeys when reading a batch of records from the same shard with TRIM_HORIZON


Writing and Reading

Data records written by the same producer can end up in different shards if the producer chooses different PartitionKeys for those records. The distribution of records across shards is determined by the hash of the PartitionKey assigned to each record, not by the producer or the order of writing.

Distribution Across Shards

If a producer sends data with varying PartitionKeys, Kinesis uses those keys' hashes to assign records to shards; thus, even the same producer's records can be spread across multiple shards.
If the producer uses the same PartitionKey for all records, then all its records will go to the same shard, preserving strict ordering for that key within the shard.

Reading N Records: Producer and Sequence

When reading N records from a shard iterator, those records are the next available records in that shard, in the order they arrived in that shard. These records will not necessarily all be from the same producer, nor are they guaranteed to be N consecutive records produced by any single producer.

Records from different producers, as well as from the same producer if it used multiple PartitionKeys, can appear in any sequence within a shard, depending on how PartitionKeys are mapped at write time.
In summary, unless a producer always uses the same PartitionKey, its records may spread across shards, and any batch read from a shard iterator will simply reflect the ordering of records within that shard, including records from multiple producers and PartitionKeys.


How to get records content in a human-readable format?

We need to extract Data field from a record, Base64 decode it and then process the data further. In our example the payload was gzip-compressed JSON (as that's what CloudWatch Logs → Kinesis subscription delivers) so we need to decompress it and parse it as JSON:


% aws kinesis get-records \
--shard-iterator $ITERATOR \
--limit 1 \
--query 'Records[0].Data' \
--output text \
--region us-east-2 \
--profile my-profile \
| base64 --decode \
| gzip -d \
| jq .

{
  "messageType": "DATA_MESSAGE",
  "owner": "123456789999",
  "logGroup": "/aws/lambda/my-lambda",
  "logStream": "2025/09/03/[$LATEST]202865ca545544deb61360d571180d45",
  "subscriptionFilters": [
    "my-lambda-MainSubscriptionFilter-r1YlqNvCVDqk"
  ],
  "logEvents": [
    {
      "id": "39180581104302115620949753249891540826922898325656305664",
      "timestamp": 1756918020250,
      "message": "{\"log.level\":\"info\",\"@timestamp\":\"2025-09-03T16:47:00.250Z\",\"log.origin\":{\"function\":\"github.com/elastic/apm-aws-lambda/app.(*App).Run\",\"file.name\":\"app/run.go\",\"file.line\":98},\"message\":\"Exiting due to shutdown event with reason spindown\",\"ecs.version\":\"1.6.0\"}\n"
    },
    {
      "id": "39180581104302115620949753249891540826922898325656305665",
      "timestamp": 1756918020250,
      "message": "{\"log.level\":\"warn\",\"@timestamp\":\"2025-09-03T16:47:00.250Z\",\"log.origin\":{\"function\":\"github.com/elastic/apm-aws-lambda/apmproxy.(*Client).forwardLambdaData\",\"file.name\":\"apmproxy/apmserver.go\",\"file.line\":357},\"message\":\"Dropping lambda data due to error: metadata is not yet available\",\"ecs.version\":\"1.6.0\"}\n"
    },
    {
      "id": "39180581104302115620949753249891540826922898325656305666",
      "timestamp": 1756918020250,
      "message": "{\"log.level\":\"warn\",\"@timestamp\":\"2025-09-03T16:47:00.250Z\",\"log.origin\":{\"function\":\"github.com/elastic/apm-aws-lambda/apmproxy.(*Client).forwardLambdaData\",\"file.name\":\"apmproxy/apmserver.go\",\"file.line\":357},\"message\":\"Dropping lambda data due to error: metadata is not yet available\",\"ecs.version\":\"1.6.0\"}\n"
    }
  ]
}


How can Kinesis ingest logs from lambda functions in CloudWatch?


To ingest logs from AWS Lambda functions into Amazon Kinesis, you can follow these steps to set up an integration between CloudWatch Logs (where Lambda logs are stored) and Kinesis Data Streams (for further processing). Here’s the high-level process:

1. Create an Amazon Kinesis Data Stream


Go to AWS Management Console → Kinesis → Data Streams → "Create data stream".

Specify the name of your stream and choose the number of shards (a shard is a unit of throughput in a Kinesis stream).

Click Create to set up your data stream.

2. Set Up CloudWatch Logs Subscription to Kinesis


Go to CloudWatch → Logs → Select the Log Group that captures your Lambda logs.

Click on the Actions dropdown → Select Create a subscription filter.

In this step, you are creating a subscription filter that will forward logs from CloudWatch to Kinesis.

Destination: Choose Kinesis stream as the destination.

Kinesis Stream: Select the Kinesis Data Stream you created in step 1.

Log Format Transformation (optional): You can specify a filter pattern if you only want certain logs forwarded to Kinesis (e.g., to capture error logs only).

IAM Role: Ensure that an IAM role with appropriate permissions exists (i.e., it can put records into the Kinesis stream). CloudWatch will need permission to send data to Kinesis on behalf of your Lambda function.

3. Add Permissions to the IAM Role


Ensure that the IAM role you are using for CloudWatch Logs has the correct permissions to publish data to Kinesis. You can create or modify a role with the following permissions:


{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "kinesis:PutRecord",
        "kinesis:PutRecords"
      ],
      "Resource": "arn:aws:kinesis:REGION:ACCOUNT_ID:stream/STREAM_NAME"
    }
  ]
}



4. Lambda Logs Flow into Kinesis Stream


Once the subscription filter is set up, any logs generated by your Lambda functions that are written to CloudWatch will be forwarded to your Kinesis Data Stream. You can then use Kinesis for further processing, such as real-time analytics, storage in Amazon S3, or triggering alerts.

5. Processing Data from Kinesis


Kinesis Data Firehose: If you want to load logs into storage destinations like Amazon S3 or Redshift, you can use Kinesis Data Firehose to batch, compress, and send logs from the stream to your preferred storage.

Kinesis Data Analytics: You can use Kinesis Data Analytics to process logs in real-time using SQL queries.

AWS Lambda: You can also configure another Lambda function to process logs directly from the Kinesis stream for custom event-driven processing.

By setting up this pipeline, you ensure that all logs generated by your Lambda function in CloudWatch Logs are captured in Kinesis for further real-time processing, storage, or analytics.


How to check current number of records in the stream?

There is no direct API call in Kinesis Data Streams that returns “how many records are currently in the stream”.

Kinesis is a streaming system, not a queue, so it retains data by time (e.g., 24 hours or 7 days), not until it’s consumed — therefore the service does not track or expose a “current record count”.

However, you can approximate or monitor data volume in a few useful ways:

1. Use CloudWatch Metrics


Kinesis automatically emits several CloudWatch metrics you can use:
  • IncomingRecords
    • Meaning : Number of records ingested per second
    • Useful For: Write rate monitoring
  • IncomingBytes
    • Meaning: Number of bytes ingested per second
    • Useful For: Throughput monitoring
  • GetRecords.IteratorAgeMilliseconds
    • Meaning: Age of the last record your consumer read
    • Useful For: Backlog (lag) detection

To check stream lag, look at:

GetRecords.IteratorAgeMilliseconds


If this is growing → your consumer is behind (i.e., backlog increasing).

2. Estimate Backlog (Number of Unread Records)


The Iterator Age metric allows estimating backlog:

Backlog (records) ≈ iterator_age_seconds × incoming_records_per_second

Example:

IteratorAge = 300 seconds (5 minutes)
IncomingRecords = 100 records/sec
Backlog ≈ 300 * 100 = 30,000 pending records

3. Use Enhanced Monitoring (Per-Shard Metrics)


You can turn on Enhanced Shard-Level Metrics, which gives you shard-specific counts instead of aggregated metrics.

Terraform example to enable:

resource "aws_kinesis_stream" "example" {
  name        = "my-stream"
  shard_count = 1

  shard_level_metrics = [
    "IncomingBytes",
    "IncomingRecords",
    "OutgoingBytes",
    "OutgoingRecords",
    "IteratorAgeMilliseconds",
    "ReadProvisionedThroughputExceeded",
    "WriteProvisionedThroughputExceeded",
  ]
}

4. Manually Count Records (Only for Special Cases)


You can scan shards and count records, but this is slow, expensive, and not recommended for production:

aws kinesis get-shard-iterator \
    --stream-name my-stream \
    --shard-id shardId-000000000000 \
    --shard-iterator-type TRIM_HORIZON \
    --query 'ShardIterator' \
    --output text | \
    xargs -I {} aws kinesis get-records --shard-iterator {} --limit 10000


You would have to loop until NextShardIterator returns fewer results — this gets costly fast.


Metrics to Observe


Basic and enhanced CloudWatch Metrics


Kinesis Data Streams and Amazon CloudWatch are integrated so that you can collect, view, and analyze CloudWatch metrics for your Kinesis data streams. This integration supports basic stream-level and enhanced shard-level monitoring for Kinesis data streams.
  • Basic (stream-level) metrics – Stream-level data is sent automatically every minute at no charge.
  • Enhanced (shard-level) metrics – Shard-level data is sent every minute for an additional cost.

Stream has producers and consumers. If rate of writing new records is higher than rate of reading them, records will reach their retention age (24 hours by default) and the oldest unread will start being removed from the stream and lost forever.

To prevent this from happening we can monitor some stream metrics and also set alarms when they reach critical thresholds.


GetRecords.IteratorAgeMilliseconds

It measures how old the oldest record returned by GetRecords is (how far our consumer lags). A very large value → our consumer(s) aren’t keeping up with the incoming write rate.

The age of the last record in all GetRecords calls made against a Kinesis stream, measured over the specified time period. Age is the difference between the current time and when the last record of the GetRecords call was written to the stream. The Minimum and Maximum statistics can be used to track the progress of Kinesis consumer applications. A value of zero indicates that the records being read are completely caught up with the stream. Shard-level metric name: IteratorAgeMilliseconds.

Meaningful Statistics: Minimum, Maximum, Average, Samples

Unit info: Milliseconds

There are 86,400,000 milliseconds in a day so if the reading of this metric goes above it, that means that some records will be lost.

Iterator-age number is a classic “consumer is falling behind” symptom.

IteratorAgeMilliseconds = 86.4M → very high, backlog building


GetRecords.Bytes

The number of bytes retrieved from the Kinesis stream, measured over the specified time period. Minimum, Maximum, and Average statistics represent the bytes in a single GetRecords operation for the stream in the specified time period.  
             
Shard-level metric name: OutgoingBytes
             
Meaningful Statistics: Minimum, Maximum, Average, Sum, Samples
             
Unit info: Bytes



GetRecords - sum (MiB/second)

GetRecords - sum (Count)
  • GetRecords.Records decreasing → Lambda is lagging


GetRecords iterator age - maximum (Milliseconds)
GetRecords latency - average (Milliseconds)

GetRecords success - average (Ratio)
  • GetRecords.Success decreasing → Lambda is not keeping up

Incoming data - sum (MiB/second)
Incoming data - sum (Count)

PutRecord - sum (MiB/second)
PutRecords - sum (MiB/second)
PutRecord latency - average (Milliseconds)
PutRecords latency - average (Milliseconds)
PutRecord success - average (Ratio)
PutRecords successful records - average (Percent)
PutRecords failed records - average (Percent)
PutRecords throttled records - average (Percent)

Read throughput exceeded - average (Ratio)
Write throughput exceeded - average (Count)







Addressing Bottlenecks


If Lambda function is stream consumer, with just 1 shard, only 1 Lambda invocation can read at a time (per shard). If our log rate exceeds what that shard can handle, the iterator age skyrockets.

It is possible to change number of shards on a live Kinesis stream:

aws kinesis update-shard-count \
  --stream-name your-stream-name \
  --target-shard-count 4 \
  --scaling-type UNIFORM_SCALING


Each shard = 1 MiB/s write and 2 MiB/s read, so 4 shards = 4 MiB/s write and 8 MiB/s read.
Lambda will then process 4 records batches in parallel (one per shard).


To empower Lambda for faster processing, consider increasing its RAM memory e.g. from 256 to 512MB. In AWS Console, this can be done in Lambda >> Configuration >> General configuration.

In Kinesis stream trigger, we can try to increase:
  • batch size: 100 --> 400
  • concurrent batches per shard: 1 --> 5

Example of settings for Kinesis stream trigger:

  • Activate trigger: Yes
  • Batch size: 400
  • Batch window: None
  • Concurrent batches per shard: 5
  • Event source mapping ARN: arn:aws:lambda:us-east-2:123456789012:event-source-mapping:80bd81a9-c175-4af5-9aa9-8926b0587f40
  • Last processing result: OK
  • Maximum age of record: -1
  • Metrics: None
  • On-failure destination: None
  • Report batch item failures: No
  • Retry attempts: -1
  • Split batch on error: No
  • Starting position: TRIM_HORIZON
  • Tags: View
  • Tumbling window duration: None
  • UUID: 80bd81a9-c175-4af5-9aa9-8926b0587f40

In AWS Console, this can be done in Lambda >> Configuration >> Triggers.


What “Concurrent batches per shard” does?
  • Each shard can have multiple batches being processed at the same time.
  • Default is 1, meaning: the next batch from a shard won’t be sent to Lambda until the previous batch finishes.
  • If your Lambda is slow or variable in duration, this can create a backlog because only one batch per shard is in flight at a time.

When to increase it?
  • If IteratorAgeMilliseconds is very high → Lambda cannot keep up with the stream.
  • If Lambda execution duration is variable → a single batch in flight per shard limits throughput.
  • If you have sufficient Lambda concurrency (which you do — up to 15) → you can safely allow multiple batches per shard.

Recommended approach
  • Start with 2–5 concurrent batches per shard
    • This allows Kinesis to send multiple batches from the same shard to Lambda simultaneously.
    • Observe if IteratorAgeMilliseconds decreases.
  • Monitor Lambda throttles and duration
    • Ensure your Lambda’s memory/cpu and timeout can handle multiple concurrent batches.
    • if no throttling currently → room to increase concurrency.
  • Adjust batch size if needed
    • Larger batch sizes may help throughput, but smaller batch sizes + higher concurrency often reduce latency.

Important notes
  • Increasing concurrency per shard does not increase shard limits; it just allows more parallelism per shard.
  • If Lambda fails batches, retries are per batch → more concurrency increases the number of batches being retried simultaneously.

Practical starting point:
  • Set Concurrent batches per shard = 5
  • Keep batch size = 400
  • Monitor:
    • IteratorAgeMilliseconds
    • GetRecords.Records
    • Lambda duration / concurrency
  • Adjust up/down based on observed backlog.

Whatever you change, do a single change at a time and after that monitor performance. That will give a clear picture of the impact of that particular parameter and its value.


Amazon Kinesis Data Streams Terminology and concepts - Amazon Kinesis Data Streams