Query lookback enforcement doesn't return full dataset when federating across tenants of different retention periods #9727

jonDufty · 2024-10-23T22:34:38Z

Describe the bug

After upgrading to mimir 2.14 we found that all of our queries longer than 7 days were only returning 7 days worth of data. We run a multi-tenanted system, where we have different sized tenants with different retention periods, typically 7d, 1m and 1y. Users of the platform need to be able to query across multiple tenants, so our default Mimir datasource uses query federation to do this

We found that querying one of the large tenants in isolation for longer period (e.g. 30 days) would return the full data set, but when querying across all tenants, we would only get 7 days worth of data.

I believe this behaviour is a result of #8388. Where the query frontend takes the minimum retention period out of all tenants, and takes the minimum of that and the max query lookback. So when we included the smaller tenants with a 7 day retention period, the max lookback for any query would be 7 days, despite also including larger tenants that had more than 7 days of data.

To Reproduce

Steps to reproduce the behavior:

Have multiple tenants with varying retention periods, e.g. 1h and 1d
Query mimir for a duration greater than the smallest retention period (i.e. 1d), and only specify the largest tenant. It should return a full day of data
Run the query again but query both tenants. It should only return 1h of data

Expected behavior

When querying across multiple tenants I would expect the max lookback to be equivalent to the largest retention period to avoid truncating any results

Environment

Infrastructure: Kubernetes EKS
Deployment tool: helm + jsonnet
Mimir version 2.14

Mimir config

Main Config

  compactor:
    compaction_concurrency: 2
    compaction_interval: "30m"
    data_dir: "/data"
    deletion_delay: "2h"
    first_level_compaction_wait_period: "25m"
    max_closing_blocks_concurrency: 2
    max_opening_blocks_concurrency: 4
    sharding_ring:
      heartbeat_period: "1m"
      heartbeat_timeout: "4m"
      wait_stability_min_duration: "1m"
    symbols_flushers_concurrency: 4
  frontend:
    cache_results: true
    log_queries_longer_than: "5s"
    max_outstanding_per_tenant: 4096
    parallelize_shardable_queries: true
    query_sharding_target_series_per_shard: 2500
    results_cache:
      backend: "memcached"
      memcached:
        max_item_size: 1048576
        timeout: "500ms"
  frontend_worker:
    grpc_client_config:
      max_send_msg_size: 419430400
  limits:
    align_queries_with_step: true
    cardinality_analysis_enabled: true
    max_cache_freshness: "15m"
    max_query_parallelism: 400
    max_total_query_length: "12000h"
    native_histograms_ingestion_enabled: true
    out_of_order_time_window: "15m"
    query_sharding_max_sharded_queries: 640
    query_sharding_total_shards: 32
  querier:
    max_concurrent: 20
    timeout: "2m"
  query_scheduler:
    max_outstanding_requests_per_tenant: 4096
  runtime_config:
    file: "/var/mimir/runtime.yaml"
  server:
    grpc_server_max_recv_msg_size: 524288000
    grpc_server_max_send_msg_size: 524288000
    log_format: "json"
  store_gateway:
    sharding_ring:
      heartbeat_period: "1m"
      heartbeat_timeout: "4m"
      kvstore:
        prefix: "multi-zone/"
      tokens_file_path: "/data/tokens"
      unregister_on_shutdown: false
      wait_stability_min_duration: "1m"
      zone_awareness_enabled: true
  tenant_federation:
    enabled: true
  usage_stats:
    enabled: false
    installation_mode: "helm"

Runtime/Tenant Config

Indicative example of our tenant setup

overrides:
      large:
        compactor_blocks_retention_period: "1y"
        compactor_split_and_merge_shards: 6
        compactor_split_groups: 12
        ingestion_rate: 30000000
        max_fetched_chunks_per_query: 10000000
        max_label_names_per_series: 50
      medium:
        compactor_blocks_retention_period: "4w"
        ingestion_rate: 100000
        max_label_names_per_series: 50
      small:
        compactor_blocks_retention_period: "1w"
        ingestion_rate: 100000
        max_label_names_per_series: 50

Additional context

The specific line of code in question is here https://github.com/grafana/mimir/pull/8388/files#diff-92de40a72c3f7eb8744de54750b3f1279255c1e494ca560878a6407a4c46e3e9R133

The text was updated successfully, but these errors were encountered:

bcrisp4 · 2024-11-08T18:06:54Z

+1 - also seeing this behavior and it is causing problems in my multi-tenanted system.

armandgrillet added the component/querier label Oct 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query lookback enforcement doesn't return full dataset when federating across tenants of different retention periods #9727

Query lookback enforcement doesn't return full dataset when federating across tenants of different retention periods #9727

jonDufty commented Oct 23, 2024 •

edited

Loading

bcrisp4 commented Nov 8, 2024

Query lookback enforcement doesn't return full dataset when federating across tenants of different retention periods #9727

Query lookback enforcement doesn't return full dataset when federating across tenants of different retention periods #9727

Comments

jonDufty commented Oct 23, 2024 • edited Loading

Describe the bug

To Reproduce

Expected behavior

Environment

Additional context

bcrisp4 commented Nov 8, 2024

jonDufty commented Oct 23, 2024 •

edited

Loading