In Elasticsearch, a data stream is an abstraction layer designed to simplify the management of continuously generated time-series data, such as logs, metrics, and events.
Key Characteristics
- Time-Series Focus: Every document indexed into a data stream must contain a @timestamp field, which is used to organize and query the data.
- Append-Only Design: Data streams are optimized for use cases where data is rarely updated or deleted. We cannot send standard update or delete requests directly to the stream; these must be performed via _update_by_query or directed at specific backing indices.
- Unified Interface: Users interact with a single named resource (the data stream name) for both indexing and searching, even though the data is physically spread across multiple underlying indices.
Architecture: Backing Indices
A data stream consists of one or more hidden, auto-generated backing indices:
- Write Index: The most recently created backing index. All new documents are automatically routed here.
- Rollover: When the write index reaches a specific size or age, Elasticsearch automatically creates a new backing index (rollover) and sets it as the new write index.
- Search: Search requests sent to the data stream are automatically routed to all of its backing indices to return a complete result set.
Automated Management
Data streams rely on two primary automation tools:
- Index Templates: These define the stream's structure, including field mappings and settings, and must include a data_stream object to enable the feature.
- Lifecycle Management (ILM/DSL): Tools like Index Lifecycle Management (ILM) or the newer Data Stream Lifecycle automate tasks like moving old indices to cheaper hardware (hot/warm/cold tiers) and eventually deleting them based on retention policies.
When to Use
- Ideal for: Logs, events, performance metrics, and security traces.
- Avoid for: Use cases requiring frequent updates to existing records (like a product catalog) or data that lacks a timestamp.
How does data stream know when to rollover?
Data streams are typically managed by:
- Index Lifecycle Management (ILM)
- Data Stream Lifecycle (DSL) - newer concept
In cluster settings, data_streams.lifecycle.poll_interval defines how often shall Elasticsearch go over each data stream, check if it is eligible for a rollover and then perform it.
To find this interval value, check the output of
GET _cluster/settings
By default, the GET _cluster/settings command only returns settings that have been manually overridden so if we are using default values, we need to add ?include_defaults=true.
Default interval value is 5 minutes which can be verified by checking cluster's default settings:
GET _cluster/settings?include_defaults=true&filter_path=defaults.data_streams.lifecycle.poll_interval
Output:
{
"defaults": {
"data_streams": {
"lifecycle": {
"poll_interval": "5m"
}
}
}
}
After this interval, Elasticsearch rolls over the write index of the data stream, if it fulfills the conditions defined by cluster.lifecycle.default.rollover. If we are using default cluster settings, we can check its default value:
GET _cluster/settings?include_defaults=true&filter_path=defaults.cluster.lifecycle
Output:
{
"defaults": {
"cluster": {
"lifecycle": {
"default": {
"rollover": "max_age=auto,max_primary_shard_size=50gb,min_docs=1,max_primary_shard_docs=200000000"
}
}
}
}
}
max_age=7d: This is why our indices are rolling over every week.
max_primary_shard_size=50gb: Prevents shards from becoming too large and slow.
max_primary_shard_docs=200000000: A built-in limit to maintain search performance, even if the 50GB size hasn't been reached yet
In our case max_age=auto which means Elasticsearch is using a dynamic rollover strategy based on our retention period. If we look at https://github.com/elastic/elasticsearch/blob/main/server/src/main/java/org/elasticsearch/action/admin/indices/rollover/RolloverConfiguration.java#L174-L195 we can see the comment:
/**
* When max_age is auto we’ll use the following retention dependent heuristics to compute the value of max_age:
* - If retention is null aka infinite (default), max_age will be 30 days
* - If retention is less than or equal to 1 day, max_age will be 1 hour
* - If retention is less than or equal to 14 days, max_age will be 1 day
* - If retention is less than or equal to 90 days, max_age will be 7 days
* - If retention is greater than 90 days, max_age will be 30 days
*/
So, max age of backing index before rollover depends on how long we want to keep data overall in our data stream. For example, if it's 90 days, Elasticsearch will perform rollover and create a new backing index every 7 days.
Instead of a single fixed value for every data stream, auto adjusts the rollover age to ensure that indices aren't kept too long or rolled over too frequently for their specific retention settings.
max_age=auto is a "smart" setting designed to prevent "small index bloat" while ensuring data is deleted on time. It ensures our max_age is always a fraction of our total retention so that we have several backing indices to delete sequentially as they expire.
Data Stream Lifecycle (DSL)
This is a streamlined, automated alternative to the older Index Lifecycle Management (ILM).
While ILM focuses on "how" data is stored (tiers, hardware, merging), the lifecycle block focuses on "what" happens to the data based on business needs, primarily focusing on retention and automated optimization.
How to find out if data stream is managed by Index Lifecycle Management (ILM) or Data Stream Lifecycle (DSL)?
Get the data stream's details and look at template, lifecycle, next_generation_managed_by and prefer_ilm attributes. Example:
GET _data_stream/ilm-history-7
Output snippet:
"template": "ilm-history-7",
"lifecycle": {
"enabled": true,
"data_retention": "90d",
"effective_retention": "90d",
"retention_determined_by": "data_stream_configuration"
},
"next_generation_managed_by": "Data stream lifecycle",
"prefer_ilm": true,
lifecycle block in our data stream's index template refers to the Data Stream Lifecycle (DSL).
Inside that lifecycle block, we typically see these children:
- enabled (Boolean):
- Interpretation: Determines if Elasticsearch should actively manage this data stream using DSL.
- Behavior: When set to true, Elasticsearch automatically handles rollover (based on cluster defaults) and deletion (based on our retention settings). If this is missing but other attributes are present, it often defaults to true.
- data_retention (String):
- Interpretation: The minimum amount of time Elasticsearch is guaranteed to store our data.
- Format: Uses time units like 90d (90 days), 30m (30 minutes), or 1h (1 hour).
- Behavior: This period is calculated starting from the moment a backing index is rolled over (it becomes "read-only"), not from its creation date.
- effective_retention
- This is the final calculated value that Elasticsearch actually uses to delete data.
- What it represents: It is the minimum amount of time our data is guaranteed to stay in the cluster after an index has rolled over.
- Why it might differ from our setting: We might set data_retention: "90d", but the cluster might have a global "max retention" or "default retention" policy that overrides our specific request
- retention_determined_by
- This attribute identifies the source of the effective_retention value. Common values include:
- data_stream_configuration: The retention is coming directly from the data_retention we set in our index template or data stream.
- default_retention: We didn't specify a retention period, so Elasticsearch is using the cluster-wide default (e.g., data_streams.lifecycle.retention.default).
- max_retention: We tried to set a very long retention (e.g., 1 year), but a cluster admin has capped all streams at a lower value (e.g., 90 days) using data_streams.lifecycle.retention.max
- downsampling (Object/Array):
- Interpretation: Configures the automatic reduction of time-series data resolution over time.
- Behavior: It defines when (e.g., after 7 days) and how (e.g., aggregate 1-minute metrics into 1-hour blocks) data should be condensed to save storage space while keeping historical trends searchable.
Elasticsearch determines the final retention value using this priority:
- If a Max Retention is set on the cluster and our setting exceeds it, Max Retention wins.
- If we have configured Data Retention on the stream, it is used (as long as it's under the max).
- If we have not configured anything, the Default Retention for the cluster is used.
- If no defaults or maxes exist and we haven't set a value, retention is Infinite.
What if data retention is set both in lifecycle (DSL) and in ILM associated to the index template used for data stream?
If we see retention-related settings in both the lifecycle block and the settings block of an index template, the lifecycle block takes precedence because it is the native configuration for the Data Stream Lifecycle (DSL). This is the modern way to manage data streams. When the lifecycle block is present and enabled: true, Elasticsearch ignores any traditional ILM "Delete" phase settings. It manages the retention of the data stream indices exclusively through the DSL background process.
If a data stream has both a lifecycle block and an ILM policy in data stream index template like:
"settings": {
"index.lifecycle.name": "my-ilm-policy"
}
...then:
- The lifecycle block wins: Elasticsearch will prioritize the Data Stream Lifecycle (DSL) for retention and rollover.
- The ILM policy is ignored: We will often see a warning in the logs or the "Explain" API indicating that the ILM policy is being bypassed because DSL is active.
If we have a custom setting in the settings block (like a metadata field or a legacy retention setting) index.lifecycle.retention: It is ignored for logic: DSL only looks at the lifecycle object. Any other setting is treated as a static index setting and will not trigger the deletion of indices.
---

No comments:
Post a Comment