Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query lookback enforcement doesn't return full dataset when federating across tenants of different retention periods #9727

Open
jonDufty opened this issue Oct 23, 2024 · 1 comment

Comments

@jonDufty
Copy link

jonDufty commented Oct 23, 2024

Describe the bug

After upgrading to mimir 2.14 we found that all of our queries longer than 7 days were only returning 7 days worth of data. We run a multi-tenanted system, where we have different sized tenants with different retention periods, typically 7d, 1m and 1y. Users of the platform need to be able to query across multiple tenants, so our default Mimir datasource uses query federation to do this

We found that querying one of the large tenants in isolation for longer period (e.g. 30 days) would return the full data set, but when querying across all tenants, we would only get 7 days worth of data.

I believe this behaviour is a result of #8388. Where the query frontend takes the minimum retention period out of all tenants, and takes the minimum of that and the max query lookback. So when we included the smaller tenants with a 7 day retention period, the max lookback for any query would be 7 days, despite also including larger tenants that had more than 7 days of data.

To Reproduce

Steps to reproduce the behavior:

  1. Have multiple tenants with varying retention periods, e.g. 1h and 1d
  2. Query mimir for a duration greater than the smallest retention period (i.e. 1d), and only specify the largest tenant. It should return a full day of data
  3. Run the query again but query both tenants. It should only return 1h of data

Expected behavior

When querying across multiple tenants I would expect the max lookback to be equivalent to the largest retention period to avoid truncating any results

Environment

  • Infrastructure: Kubernetes EKS
  • Deployment tool: helm + jsonnet
  • Mimir version 2.14

Mimir config

Main Config
  compactor:
    compaction_concurrency: 2
    compaction_interval: "30m"
    data_dir: "/data"
    deletion_delay: "2h"
    first_level_compaction_wait_period: "25m"
    max_closing_blocks_concurrency: 2
    max_opening_blocks_concurrency: 4
    sharding_ring:
      heartbeat_period: "1m"
      heartbeat_timeout: "4m"
      wait_stability_min_duration: "1m"
    symbols_flushers_concurrency: 4
  frontend:
    cache_results: true
    log_queries_longer_than: "5s"
    max_outstanding_per_tenant: 4096
    parallelize_shardable_queries: true
    query_sharding_target_series_per_shard: 2500
    results_cache:
      backend: "memcached"
      memcached:
        max_item_size: 1048576
        timeout: "500ms"
  frontend_worker:
    grpc_client_config:
      max_send_msg_size: 419430400
  limits:
    align_queries_with_step: true
    cardinality_analysis_enabled: true
    max_cache_freshness: "15m"
    max_query_parallelism: 400
    max_total_query_length: "12000h"
    native_histograms_ingestion_enabled: true
    out_of_order_time_window: "15m"
    query_sharding_max_sharded_queries: 640
    query_sharding_total_shards: 32
  querier:
    max_concurrent: 20
    timeout: "2m"
  query_scheduler:
    max_outstanding_requests_per_tenant: 4096
  runtime_config:
    file: "/var/mimir/runtime.yaml"
  server:
    grpc_server_max_recv_msg_size: 524288000
    grpc_server_max_send_msg_size: 524288000
    log_format: "json"
  store_gateway:
    sharding_ring:
      heartbeat_period: "1m"
      heartbeat_timeout: "4m"
      kvstore:
        prefix: "multi-zone/"
      tokens_file_path: "/data/tokens"
      unregister_on_shutdown: false
      wait_stability_min_duration: "1m"
      zone_awareness_enabled: true
  tenant_federation:
    enabled: true
  usage_stats:
    enabled: false
    installation_mode: "helm"
Runtime/Tenant Config Indicative example of our tenant setup
overrides:
      large:
        compactor_blocks_retention_period: "1y"
        compactor_split_and_merge_shards: 6
        compactor_split_groups: 12
        ingestion_rate: 30000000
        max_fetched_chunks_per_query: 10000000
        max_label_names_per_series: 50
      medium:
        compactor_blocks_retention_period: "4w"
        ingestion_rate: 100000
        max_label_names_per_series: 50
      small:
        compactor_blocks_retention_period: "1w"
        ingestion_rate: 100000
        max_label_names_per_series: 50

Additional context

The specific line of code in question is here https://github.com/grafana/mimir/pull/8388/files#diff-92de40a72c3f7eb8744de54750b3f1279255c1e494ca560878a6407a4c46e3e9R133

@bcrisp4
Copy link

bcrisp4 commented Nov 8, 2024

+1 - also seeing this behavior and it is causing problems in my multi-tenanted system.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants