Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crashing with "panic: failed to migrate ds" #4588

Open
shashank42 opened this issue Apr 18, 2024 · 19 comments
Open

Crashing with "panic: failed to migrate ds" #4588

shashank42 opened this issue Apr 18, 2024 · 19 comments

Comments

@shashank42
Copy link

shashank42 commented Apr 18, 2024

Describe the bug

Deployed using https://github.com/rudderlabs/rudderstack-helm
rudderstack backend keeps crashing and restarting with the error

Screenshot 2024-04-18 at 12 19 59 AM
Defaulted container "rudderstack-backend" out of: rudderstack-backend, rudderstack-telegraf-sidecar
Executing docker entrypoint script
Computed db host to namaste-rudderstack-rudderstack-postgresql-0.namaste-rudderstack-rudderstack-postgresql-headless
2024/04/18 06:57:30 maxprocs: Updating GOMAXPROCS=1: determined from CPU quota
INFO: No .env file found.
2024-04-18T06:57:30.671Z	INFO	rudder-server/main.go:27	Setting memory limit to	{"limit": 1073741824}
2024-04-18T06:57:30.673Z	INFO	stats	stats/statsd.go:44	StatsD client setup succeeded.
2024-04-18T06:57:30.673Z	INFO	stats	stats/statsd.go:89	Stats started successfully in mode "StatsD" with address "localhost:8125"
2024-04-18T06:57:30.675Z	INFO	app	app/app.go:55	Open source version of rudder-server
2024-04-18T06:57:30.681Z	INFO	backend-config	backend-config/single_workspace.go:60	Setup backend config complete	{"workspaceId": ""}
2024-04-18T06:57:33.131Z	INFO	runner	runner/runner.go:344	warehousemode embedded
2024-04-18T06:57:33.133Z	INFO	db.recovery	db/embeddedRecovery.go:42	Starting in normal mode
2024-04-18T06:57:33.479Z	INFO	backend-config	backend-config/backend-config.go:215	Workspace Config changed{"workspaces": 1, "sources": 6}
2024-04-18T06:57:36.430Z	INFO	runner.warehouse.notifier	notifier/notifier.go:165	Initializing Notifier...
2024-04-18T06:57:37.141Z	INFO	runner	runner/runner.go:344	warehousemode embedded
2024-04-18T06:57:37.142Z	INFO	admin	admin/admin.go:167	Serving on admin interface @ /data/rudderstack/rudder-server.sock
2024-04-18T06:57:37.142Z	INFO	apphandlers.EMBEDDED	apphandlers/embeddedAppHandler.go:87	Embedded mode: Starting Rudder Core
2024-04-18T06:57:37.143Z	INFO	apphandlers.EMBEDDED	apphandlers/embeddedAppHandler.go:95	Configured deployment type: "DEDICATED"
2024-04-18T06:57:37.143Z	INFO	apphandlers.EMBEDDED	apphandlers/embeddedAppHandler.go:105	Clearing DB false
2024-04-18T06:57:37.143Z	INFO	debugger.transformation	cache/cache.go:29	Using in-memory cache
2024-04-18T06:57:37.144Z	INFO	debugger.destination	cache/cache.go:29	Using in-memory cache
2024-04-18T06:57:37.145Z	INFO	debugger.source	cache/cache.go:29	Using in-memory cache
2024-04-18T06:57:37.146Z	INFO	runner.warehouse	warehouse/app.go:310	Starting Warehouse service...
2024-04-18T06:57:37.146Z	INFO	runner.warehouse	warehouse/app.go:375	Starting warehouse slave...
2024-04-18T06:57:37.146Z	INFO	runner.warehouse	warehouse/app.go:391	[WH]: Starting warehouse master...
2024-04-18T06:57:37.147Z	INFO	runner.warehouse.api	api/http.go:157	waiting for BackendConfig before starting on 8082
2024-04-18T06:57:37.147Z	INFO	runner.warehouse.api	api/http.go:142	Starting warehouse master service on8082
2024-04-18T06:57:37.148Z	INFO	runner.warehouse.notifier	notifier/notifier.go:312	Deleting all jobs for workspace: default::f597437212cb10a56115df53cd0f6881
2024-04-18T06:57:37.150Z	INFO	runner.warehouse	warehouse/app.go:475	Starting a new Warehouse Destination Router: RS
2024-04-18T06:57:37.150Z	INFO	runner.warehouse.router.RS	router/router.go:147	WH: Warehouse Router started: RS
2024-04-18T06:57:38.035Z	INFO	runner.warehouse.source	source/source.go:143	starting source jobs processing
2024-04-18T06:57:38.240Z	INFO	runner.warehouse.router.RS	router/router.go:254	Received updated workspace config
2024-04-18T06:57:38.635Z	INFO	transformer-features	transformer/features_impl.go:62	Fetching transformer features from http://namaste-rudderstack-transformer:9090
2024-04-18T06:57:39.030Z	INFO	jobsdb.gw	jobsdb/jobsdb.go:840	Connected to gw DB
2024-04-18T06:57:50.231Z	INFO	jobsdb.gw	jobsdb/jobsdb.go:840	Connected to gw DB
2024-04-18T06:57:56.737Z	INFO	jobsdb.rt	jobsdb/jobsdb.go:840	Connected to rt DB
2024-04-18T06:57:59.233Z	INFO	jobsdb.batch_rt	jobsdb/jobsdb.go:840	Connected to batch_rt DB
2024-04-18T06:58:54.035Z	INFO	jobsdb.proc_error	jobsdb/jobsdb.go:840	Connected to proc_error DB
2024-04-18T06:58:55.035Z	INFO	jobsdb.proc_error	jobsdb/jobsdb.go:840	Connected to proc_error DB
2024-04-18T06:58:57.340Z	INFO	jobsdb.esch	jobsdb/jobsdb.go:840	Connected to esch DB
2024-04-18T06:58:58.930Z	INFO	jobsdb.arc	jobsdb/jobsdb.go:840	Connected to arc DB
2024-04-18T06:59:00.036Z	INFO	apphandlers.EMBEDDED	apphandlers/setup.go:114	using Static Cluster Manager
2024-04-18T06:59:00.237Z	INFO	enterprise.suppress-user	suppress-user/factory.go:48	Suppress User feature is enterprise only
2024-04-18T06:59:00.238Z	WARN	rsources	rsources/handler.go:770	shared database is not configured, skipping logical replication monitoring
2024-04-18T06:59:00.239Z	INFO	gateway	gateway/handle_lifecycle.go:363	WebHandler waiting for BackendConfig before starting on 8080
2024-04-18T06:59:00.239Z	INFO	cluster	cluster/dynamic.go:100	Got trigger to change the mode, new mode: NORMAL, old mode: DEGRADED
2024-04-18T06:59:00.239Z	INFO	cluster	cluster/dynamic.go:218	Transiting the server from DegradedMode to NormalMode
2024-04-18T06:59:00.239Z	INFO	cluster	cluster/dynamic.go:123	Starting the server
2024-04-18T06:59:00.242Z	INFO	gateway	gateway/handle_lifecycle.go:195	BackendConfig initialised	{"sources": 6}
2024-04-18T06:59:00.242Z	INFO	gateway	gateway/handle_lifecycle.go:365	WebHandler Starting on 8080
2024-04-18T06:59:00.530Z	INFO	jobsdb.proc_error	jobsdb/backup.go:36	backupDSLoop is running
2024-04-18T06:59:01.737Z	INFO	jobsdb.gw	jobsdb/backup.go:36	backupDSLoop is running
2024-04-18T06:59:02.237Z	INFO	jobsdb.esch	jobsdb/backup.go:36	backupDSLoop is running
2024-04-18T06:59:03.536Z	INFO	jobsdb.arc	jobsdb/backup.go:36	backupDSLoop is running
2024-04-18T06:59:04.738Z	INFO	jobsdb.rt	jobsdb/backup.go:36	backupDSLoop is running
2024-04-18T06:59:13.138Z	INFO	jobsdb.batch_rt	jobsdb/backup.go:36	backupDSLoop is running
2024-04-18T07:08:49.633Z	INFO	archiver	archiver/archiver.go:99	Starting archiver
2024-04-18T07:08:49.635Z	INFO	processor	processor/processor.go:628	Starting processor in isolation mode: source
2024-04-18T07:08:49.637Z	INFO	processor	processor/processor.go:659	Starting pinger loop
2024-04-18T07:08:49.637Z	INFO	processor	processor/processor.go:661	Backend config received
2024-04-18T07:08:49.637Z	INFO	processor	processor/processor.go:664	Waiting for async init group
2024-04-18T07:08:49.637Z	INFO	processor	processor/processor.go:671	Async init group done
2024-04-18T07:08:49.637Z	INFO	processor	processor/processor.go:674	Waiting for transformer features
2024-04-18T07:08:49.637Z	INFO	processor	processor/processor.go:681	Transformer features received
2024-04-18T07:08:49.830Z	INFO	processor.stash	stash/stash.go:296	Processor errors stash loop started
2024-04-18T07:09:43.545Z	ERROR	utils.misc	misc/misc.go:926	Panic detected. Application will crash.{"stack": "goroutine 5537 [running]:\nruntime/debug.Stack()\n\t/usr/local/go/src/runtime/debug/stack.go:24 +0x5e\ngithub.com/rudderlabs/rudder-server/utils/misc.BugsnagNotify.func1.1()\n\t/rudder-server/utils/misc/misc.go:927 +0x37f\nsync.(*Once).doSlow(0x4778d2?, 0xc002607d78?)\n\t/usr/local/go/src/sync/once.go:74 +0xc2\nsync.(*Once).Do(...)\n\t/usr/local/go/src/sync/once.go:65\ngithub.com/rudderlabs/rudder-server/utils/misc.BugsnagNotify.func1()\n\t/rudder-server/utils/misc/misc.go:916 +0xbb\npanic({0x34155c0?, 0xc006a39540?})\n\t/usr/local/go/src/runtime/panic.go:770 +0x132\ngithub.com/rudderlabs/rudder-server/jobsdb.(*Handle).migrateDSLoop(0xc000cdd188, {0x4377970, 0xc0025f6ff0})\n\t/rudder-server/jobsdb/migration.go:54 +0x1ff\ngithub.com/rudderlabs/rudder-server/jobsdb.(*Handle).startMigrateDSLoop.func1()\n\t/rudder-server/jobsdb/migration.go:28 +0x1f\ngithub.com/rudderlabs/rudder-server/jobsdb.(*Handle).startMigrateDSLoop.WithBugsnag.func2()\n\t/rudder-server/utils/misc/misc.go:945 +0x5c\ngolang.org/x/sync/errgroup.(*Group).Go.func1()\n\t/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:78 +0x56\ncreated by golang.org/x/sync/errgroup.(*Group).Go in goroutine 2162\n\t/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:75 +0x96\n", "panic": "failed to migrate ds: pq: canceling statement due to user request; driver: bad connection", "team": "Core", "goRoutines": 685, "version": "1.23.0", "releaseStage": "development"}
2024-04-18T07:09:43.545Z	ERROR	utils.misc	misc/misc.go:926	goroutine 5537 [running]:
github.com/rudderlabs/rudder-go-kit/logger.(*logger).Fatalw(0xc00128f9e0, {0x3a85163?, 0x4c6?}, {0xc0011ae600?, 0x1?, 0x34155c0?})
	/go/pkg/mod/github.com/rudderlabs/[email protected]/logger/logger.go:332 +0xa5
github.com/rudderlabs/rudder-server/utils/misc.BugsnagNotify.func1.1()
	/rudder-server/utils/misc/misc.go:926 +0x5b3
sync.(*Once).doSlow(0x4778d2?, 0xc002607d78?)
	/usr/local/go/src/sync/once.go:74 +0xc2
sync.(*Once).Do(...)
	/usr/local/go/src/sync/once.go:65
github.com/rudderlabs/rudder-server/utils/misc.BugsnagNotify.func1()
	/rudder-server/utils/misc/misc.go:916 +0xbb
panic({0x34155c0?, 0xc006a39540?})
	/usr/local/go/src/runtime/panic.go:770 +0x132
github.com/rudderlabs/rudder-server/jobsdb.(*Handle).migrateDSLoop(0xc000cdd188, {0x4377970, 0xc0025f6ff0})
	/rudder-server/jobsdb/migration.go:54 +0x1ff
github.com/rudderlabs/rudder-server/jobsdb.(*Handle).startMigrateDSLoop.func1()
	/rudder-server/jobsdb/migration.go:28 +0x1f
github.com/rudderlabs/rudder-server/jobsdb.(*Handle).startMigrateDSLoop.WithBugsnag.func2()
	/rudder-server/utils/misc/misc.go:945 +0x5c
golang.org/x/sync/errgroup.(*Group).Go.func1()
	/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:78 +0x56
created by golang.org/x/sync/errgroup.(*Group).Go in goroutine 2162
	/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:75 +0x96

2024/04/18 07:09:43 notifying bugsnag: failed to migrate ds: pq: canceling statement due to user request; driver: bad connection
panic: failed to migrate ds: pq: canceling statement due to user request; driver: bad connection [recovered]
	panic: failed to migrate ds: pq: canceling statement due to user request; driver: bad connection [recovered]
	panic: failed to migrate ds: pq: canceling statement due to user request; driver: bad connection

goroutine 5537 [running]:
github.com/bugsnag/bugsnag-go/v2.AutoNotify({0xc00740dac0, 0x3, 0xc001969500?})
	/go/pkg/mod/github.com/bugsnag/bugsnag-go/[email protected]/bugsnag.go:114 +0x3ef
panic({0x34155c0?, 0xc006a39540?})
	/usr/local/go/src/runtime/panic.go:770 +0x132
github.com/rudderlabs/rudder-server/utils/misc.BugsnagNotify.func1.1()
	/rudder-server/utils/misc/misc.go:935 +0x5d4
sync.(*Once).doSlow(0x4778d2?, 0xc002607d78?)
	/usr/local/go/src/sync/once.go:74 +0xc2
sync.(*Once).Do(...)
	/usr/local/go/src/sync/once.go:65
github.com/rudderlabs/rudder-server/utils/misc.BugsnagNotify.func1()
	/rudder-server/utils/misc/misc.go:916 +0xbb
panic({0x34155c0?, 0xc006a39540?})
	/usr/local/go/src/runtime/panic.go:770 +0x132
github.com/rudderlabs/rudder-server/jobsdb.(*Handle).migrateDSLoop(0xc000cdd188, {0x4377970, 0xc0025f6ff0})
	/rudder-server/jobsdb/migration.go:54 +0x1ff
github.com/rudderlabs/rudder-server/jobsdb.(*Handle).startMigrateDSLoop.func1()
	/rudder-server/jobsdb/migration.go:28 +0x1f
github.com/rudderlabs/rudder-server/jobsdb.(*Handle).startMigrateDSLoop.WithBugsnag.func2()
	/rudder-server/utils/misc/misc.go:945 +0x5c
golang.org/x/sync/errgroup.(*Group).Go.func1()
	/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:78 +0x56
created by golang.org/x/sync/errgroup.(*Group).Go in goroutine 2162
	/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:75 +0x96

While at the same time postgress machine logs the following error

2024-04-18 07:57:09.839 UTC [16135] STATEMENT:  DELETE FROM "batch_rt_job_status_435_2"
							WHERE NOT id = ANY(
								SELECT DISTINCT ON (job_id) id from "batch_rt_job_status_435_2" ORDER BY job_id ASC, id DESC
							)
2024-04-18 07:57:10.730 UTC [16111] LOG:  unexpected EOF on client connection with an open transaction
2024-04-18 07:57:11.328 UTC [18493] LOG:  could not receive data from client: Connection reset by peer
2024-04-18 07:57:11.347 UTC [18494] LOG:  could not receive data from client: Connection reset by peer
2024-04-18 07:57:11.429 UTC [18490] LOG:  could not receive data from client: Connection reset by peer
2024-04-18 07:57:11.441 UTC [18488] LOG:  could not send data to client: Broken pipe
2024-04-18 07:57:11.453 UTC [18488] FATAL:  connection to client lost
2024-04-18 07:57:11.538 UTC [18491] LOG:  could not receive data from client: Connection reset by peer
2024-04-18 07:57:11.543 UTC [18435] LOG:  could not send data to client: Connection reset by peer
2024-04-18 07:57:11.543 UTC [18435] STATEMENT:  SELECT * FROM (SELECT
										jobs.job_id, jobs.uuid, jobs.user_id, jobs.parameters, jobs.custom_val, jobs.event_payload, jobs.event_count,
										jobs.created_at, jobs.expire_at, jobs.workspace_id,
										pg_column_size(jobs.event_payload) as payload_size,
										sum(jobs.event_count) over (order by jobs.job_id asc) as running_event_counts,
										sum(pg_column_size(jobs.event_payload)) over (order by jobs.job_id) as running_payload_size,
										job_latest_state.job_state, job_latest_state.attempt,
										job_latest_state.exec_time, job_latest_state.retry_time,
										job_latest_state.error_code, job_latest_state.error_response, job_latest_state.parameters
									FROM
										"gw_jobs_545" AS jobs
										LEFT JOIN "gw_job_status_545" job_latest_state ON jobs.job_id=job_latest_state.job_id
									    WHERE ((job_latest_state.job_id IS NULL)) AND ((jobs.custom_val='GW')) AND (jobs.parameters->>'source_id'='2dPZSbNyzCt11TNn88HyTi3hF7N')
										ORDER BY jobs.job_id  LIMIT 772 ) t WHERE running_event_counts - t.event_count <= $1 AND running_payload_size - t.payload_size <= $2
2024-04-18 07:57:11.543 UTC [18435] FATAL:  connection to client lost
2024-04-18 07:57:11.543 UTC [18435] STATEMENT:  SELECT * FROM (SELECT
										jobs.job_id, jobs.uuid, jobs.user_id, jobs.parameters, jobs.custom_val, jobs.event_payload, jobs.event_count,
										jobs.created_at, jobs.expire_at, jobs.workspace_id,
										pg_column_size(jobs.event_payload) as payload_size,
										sum(jobs.event_count) over (order by jobs.job_id asc) as running_event_counts,
										sum(pg_column_size(jobs.event_payload)) over (order by jobs.job_id) as running_payload_size,
										job_latest_state.job_state, job_latest_state.attempt,
										job_latest_state.exec_time, job_latest_state.retry_time,
										job_latest_state.error_code, job_latest_state.error_response, job_latest_state.parameters
									FROM
										"gw_jobs_545" AS jobs
										LEFT JOIN "gw_job_status_545" job_latest_state ON jobs.job_id=job_latest_state.job_id
									    WHERE ((job_latest_state.job_id IS NULL)) AND ((jobs.custom_val='GW')) AND (jobs.parameters->>'source_id'='2dPZSbNyzCt11TNn88HyTi3hF7N')
										ORDER BY jobs.job_id  LIMIT 772 ) t WHERE running_event_counts - t.event_count <= $1 AND running_payload_size - t.payload_size <= $2
2024-04-18 07:57:11.932 UTC [18470] LOG:  unexpected EOF on client connection with an open transaction
2024-04-18 07:57:12.031 UTC [18499] LOG:  incomplete startup packet
2024-04-18 07:57:17.140 UTC [18514] ERROR:  relation "rsources_failed_keys_v2" already exists
2024-04-18 07:57:17.140 UTC [18514] STATEMENT:  create table "rsources_failed_keys_v2" (
			id VARCHAR(27) COLLATE "C",
			db_name text not null default 'namaste-rudderstack-rudderstack-postgresql-0.namaste-rudderstack-rudderstack-postgresql-headless',
			job_run_id text not null,
			task_run_id text not null,
			source_id text not null,
			destination_id text not null,
			primary key (id),
			unique (job_run_id, task_run_id, source_id, destination_id, db_name)
		)
2024-04-18 07:57:17.143 UTC [18514] ERROR:  relation "rsources_failed_keys_v2_records" already exists
2024-04-18 07:57:17.143 UTC [18514] STATEMENT:  create table "rsources_failed_keys_v2_records" (
			id VARCHAR(27) COLLATE "C",
			record_id text not null,
	    	ts timestamp not null default NOW(),
			primary key (id, record_id)
		)
2024-04-18 07:57:17.338 UTC [18514] ERROR:  relation "rsources_stats" already exists
2024-04-18 07:57:17.338 UTC [18514] STATEMENT:  create table "rsources_stats" (
			db_name text not null default 'namaste-rudderstack-rudderstack-postgresql-0.namaste-rudderstack-rudderstack-postgresql-headless',
			job_run_id text not null,
			task_run_id text not null,
			source_id text not null,
			destination_id text not null,
			in_count integer not null default 0,
			out_count integer not null default 0,
			failed_count integer not null default 0,
			ts timestamp not null default NOW(),
			primary key (db_name, job_run_id, task_run_id, source_id, destination_id)
		)
2024-04-18 07:57:20.533 UTC [18425] ERROR:  canceling autovacuum task
2024-04-18 07:57:20.533 UTC [18425] CONTEXT:  automatic vacuum of table "jobsdb.public.gw_job_status_511_1"
2024-04-18 07:57:21.128 UTC [18148] LOG:  unexpected EOF on client connection with an open transaction

While there are no errors on transformer

➜  rudderstack-helm git:(master) ✗ kubectl logs --tail=10000 -f namaste-rudderstack-transformer-f45745c8f-8psss

> [email protected] start
> cd dist;node ./src/index.js;cd ..

setting up statsd client
Stack Depth set: 64
Interval Bytes set: 524288
Transformer: Swagger route loading
Transformer: Swagger route loaded
{"extraData":{},"level":"info","message":"Using new routes","timestamp":"2024-04-18 06:48:30"}
Master (pid: 19) has started
Worker (pid: 26) is online
setting up statsd client
Stack Depth set: 64
Interval Bytes set: 524288
Transformer: Swagger route loading
Transformer: Swagger route loaded
{"extraData":{},"level":"info","message":"Using new routes","timestamp":"2024-04-18 06:48:33"}
Worker (pid: 26) has started
App started. Listening on port: 9090
@gitcommitshow
Copy link
Collaborator

Hi @shashank42 , thank you for reporting the issue. I will get back to you soon.

@gitcommitshow
Copy link
Collaborator

@shashank42 can you please share the version number of rudder-server you're using?

@shashank42
Copy link
Author

For the current deployment

    Image:         rudderlabs/rudder-server:latest
    Image ID:      docker.io/rudderlabs/rudder-server@sha256:6ed8c0155ed821dfded3ba8efd6909b313989684cacfe210322f4131ecdbbe77

https://github.com/glip-gg/rudderstack-helm

Here is the repo I am using

@shashank42
Copy link
Author

I think the issue is similar to this. I have taken the steps here, that is deleting all the tables inside jobsdb, however the errors have not stopped
https://rudderstack.slack.com/archives/C01E4PLB135/p1692320392539509?thread_ts=1689977868.588139&cid=C01E4PLB135

@gitcommitshow
Copy link
Collaborator

@shashank42 does the issue persist when you restart the postgres db pod?
The shared logs are not enough to make any conclusion - it seems migration loop timeouts, and from the db logs it looks like the database is overloaded

@shashank42
Copy link
Author

I just did a new deployment on a new cluster to check fresh setup case. There are still restarts, but less frequent.

Screenshot 2024-05-05 at 6 25 55 PM

Postgress doesn't seem to run out memory at any point.

@gitcommitshow
Copy link
Collaborator

@shashank42 can you check the resource (CPU, RAM) usage before and after you start the server? please also mention what resources(CPU, memory) have you provisioned for postgres?

Context: with the limited info, we are guessing that enough resources might not have been allocated (assuming the server is not overloaded with unusual number of requests) which causes a query timeout

@gitcommitshow
Copy link
Collaborator

@shashank42 can you please help with more information as mentioned above?

@Digital365Staking
Copy link

The error logs indicate that the RudderStack backend is crashing and restarting. This could be due to a variety of reasons, including issues with the database connection, memory limits, or problems with the Docker entrypoint script. Here are some steps you can take to troubleshoot this issue:

Check the Database Connection: The log message Computed db host to namaste-rudderstack-rudderstack-postgresql-0.namaste-rudderstack-rudderstack-postgresql-headless suggests that the application is trying to connect to a PostgreSQL database. Ensure that the database is running and accessible from the RudderStack backend. If the database is running on a different server, check that the network connection between the two servers is working correctly1.
Check the Docker Entrypoint Script: The error message Executing docker entrypoint script suggests that there might be an issue with the Docker entrypoint script. Make sure the script is correctly formatted and executable. If the script is not executable, you can make it executable by running chmod +x /path/to/your/script.sh2.
Check the Memory Limit: The log message Setting memory limit to {"limit": 1073741824} indicates that the application is setting a memory limit. If the application is running out of memory, it could cause crashes. You might need to increase the memory limit or optimize the application to use less memory3.
Check the StatsD Configuration: The log messages StatsD client setup succeeded and Stats started successfully in mode "StatsD" with address "localhost:8125" suggest that the application is using StatsD for metrics. Ensure that the StatsD server is running and accessible from the RudderStack backend4.
Check the RudderStack Version: The log message Open source version of rudder-server indicates that you’re using the open-source version of RudderStack. If there’s a newer version available, consider updating to see if the issue has been resolved in a later release5.

@shashank42
Copy link
Author

shashank42 commented Aug 13, 2024

Thank you!

I tried to resolve multiple times but I think kubernetes was an overkill. Just using Docker right now and it works great. I do have one issue where I am running out of memory (200gb) in local postgress after every few days so I clean and restart. Is there an option to have an expiry of local archive?

I have these variables right now

- RSERVER_JOBS_DB_DS_LIMIT=10
- RSERVER_JOBS_DB_QUERY_REQUEST_TIMEOUT=1800s
- RSERVER_JOBS_DB_BACKUP_ENABLED=false
- RSERVER_PROCESSOR_FIXED_LOOP_SLEEP_IN_MS=2000
- RSERVER_ROUTER_FIXED_LOOP_SLEEP_IN_MS=3000
- RSERVER_ARCHIVAL_ENABLED=false

@gitcommitshow
Copy link
Collaborator

So sorry @shashank42 , I missed your reply. Good to hear that docker installation is working for you.

Is there an option to have an expiry of local archive?

Let me check if there's a configuration for this

@kalavt
Copy link

kalavt commented Nov 25, 2024

Same issue here:

`2024-11-25T09:37:05.896Z INFO jobsdb.gw jobsdb/jobsdb.go:1234 [JobsDB] gw_jobs_9225 is full by rows. Count: 100954, Size: 241360896
2024-11-25T09:37:05.906Z INFO jobsdb.gw jobsdb/jobsdb.go:2463 [[ gw : addNewDSLoop ]]: NewDS
2024-11-25T09:37:05.906Z INFO jobsdb.gw jobsdb/jobsdb.go:1313 Creating new DS {JobTable:gw_jobs_9226 JobStatusTable:gw_job_status_9226 Index:9226}
2024-11-25T09:37:14.820Z INFO jobsdb.batch_rt jobsdb/jobsdb.go:1234 [JobsDB] batch_rt_jobs_18392 is full by rows. Count: 104350, Size: 295190528
2024-11-25T09:37:14.822Z INFO jobsdb.batch_rt jobsdb/jobsdb.go:2463 [[ batch_rt : addNewDSLoop ]]: NewDS
2024-11-25T09:37:14.822Z INFO jobsdb.batch_rt jobsdb/jobsdb.go:1313 Creating new DS {JobTable:batch_rt_jobs_18393 JobStatusTable:batch_rt_job_status_18393 Index:18393}
2024/11/25 09:37:23 bugsnag/sessions/publisher.publish invalid API key: ''
2024-11-25T09:38:17.577Z INFO jobsdb.arc jobsdb/jobsdb.go:1234 [JobsDB] arc_jobs_9225 is full by rows. Count: 100470, Size: 224542720
2024-11-25T09:38:17.579Z INFO jobsdb.arc jobsdb/jobsdb.go:2463 [[ arc : addNewDSLoop ]]: NewDS
2024-11-25T09:38:17.579Z INFO jobsdb.arc jobsdb/jobsdb.go:1313 Creating new DS {JobTable:arc_jobs_9226 JobStatusTable:arc_job_status_9226 Index:9226}
2024/11/25 09:38:23 bugsnag/sessions/publisher.publish invalid API key: ''
2024/11/25 09:39:23 bugsnag/sessions/publisher.publish invalid API key: ''
2024-11-25T09:40:17.427Z INFO jobsdb.batch_rt jobsdb/jobsdb.go:1234 [JobsDB] batch_rt_jobs_18393 is full by rows. Count: 101014, Size: 285417472
2024-11-25T09:40:17.428Z INFO jobsdb.batch_rt jobsdb/jobsdb.go:2463 [[ batch_rt : addNewDSLoop ]]: NewDS
2024-11-25T09:40:17.428Z INFO jobsdb.batch_rt jobsdb/jobsdb.go:1313 Creating new DS {JobTable:batch_rt_jobs_18394 JobStatusTable:batch_rt_job_status_18394 Index:18394}
2024/11/25 09:40:23 bugsnag/sessions/publisher.publish invalid API key: ''
2024-11-25T09:40:56.787Z ERROR utils.misc misc/misc.go:926 Panic detected. Application will crash. {"stack": "goroutine 455 [running]:\nruntime/debug.Stack()\n\t/usr/local/go/src/runtime/debug/stack.go:24 +0x5e\ngithub.com/rudderlabs/rudder-server/utils/misc.BugsnagNotify.func1.1()\n\t/rudder-server/utils/misc/misc.go:927 +0x37e\nsync.(*Once).doSlow(0xc001ec5a50?, 0xc003e45d30?)\n\t/usr/local/go/src/sync/once.go:74 +0xbf\nsync.(*Once).Do(...)\n\t/usr/local/go/src/sync/once.go:65\ngithub.com/rudderlabs/rudder-server/utils/misc.BugsnagNotify.func1()\n\t/rudder-server/utils/misc/misc.go:916 +0xbb\npanic({0x32a8f60?, 0xc0016bb2c0?})\n\t/usr/local/go/src/runtime/panic.go:914 +0x21f\ngithub.com/rudderlabs/rudder-server/jobsdb.(*Handle).migrateDSLoop(0xc000d2c380, {0x40ae8d0, 0xc001ec5a40})\n\t/rudder-server/jobsdb/migration.go:54 +0x1ff\ngithub.com/rudderlabs/rudder-server/jobsdb.(*Handle).startMigrateDSLoop.func1()\n\t/rudder-server/jobsdb/migration.go:28 +0x1f\ngithub.com/rudderlabs/rudder-server/jobsdb.(*Handle).startMigrateDSLoop.WithBugsnag.func2()\n\t/rudder-server/utils/misc/misc.go:945 +0x5c\ngolang.org/x/sync/errgroup.(*Group).Go.func1()\n\t/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:78 +0x56\ncreated by golang.org/x/sync/errgroup.(*Group).Go in goroutine 539\n\t/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:75 +0x96\n", "panic": "failed to migrate ds: pq: canceling statement due to user request; driver: bad connection", "team": "Core", "goRoutines": 831, "version": "1.22.1", "releaseStage": "development"}
2024-11-25T09:40:56.787Z ERROR utils.misc misc/misc.go:926 goroutine 455 [running]:
github.com/rudderlabs/rudder-go-kit/logger.(*logger).Fatalw(0xc000f6d830, {0x38fbe93?, 0x4c8?}, {0xc002361b00?, 0x1?, 0x140?})
/go/pkg/mod/github.com/rudderlabs/[email protected]/logger/logger.go:332 +0xa5
github.com/rudderlabs/rudder-server/utils/misc.BugsnagNotify.func1.1()
/rudder-server/utils/misc/misc.go:926 +0x5f6
sync.(*Once).doSlow(0xc001ec5a50?, 0xc003e45d30?)
/usr/local/go/src/sync/once.go:74 +0xbf
sync.(*Once).Do(...)
/usr/local/go/src/sync/once.go:65
github.com/rudderlabs/rudder-server/utils/misc.BugsnagNotify.func1()
/rudder-server/utils/misc/misc.go:916 +0xbb
panic({0x32a8f60?, 0xc0016bb2c0?})
/usr/local/go/src/runtime/panic.go:914 +0x21f
github.com/rudderlabs/rudder-server/jobsdb.(*Handle).migrateDSLoop(0xc000d2c380, {0x40ae8d0, 0xc001ec5a40})
/rudder-server/jobsdb/migration.go:54 +0x1ff
github.com/rudderlabs/rudder-server/jobsdb.(*Handle).startMigrateDSLoop.func1()
/rudder-server/jobsdb/migration.go:28 +0x1f
github.com/rudderlabs/rudder-server/jobsdb.(*Handle).startMigrateDSLoop.WithBugsnag.func2()
/rudder-server/utils/misc/misc.go:945 +0x5c
golang.org/x/sync/errgroup.(*Group).Go.func1()
/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:78 +0x56
created by golang.org/x/sync/errgroup.(*Group).Go in goroutine 539
/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:75 +0x96

2024/11/25 09:40:56 notifying bugsnag: failed to migrate ds: pq: canceling statement due to user request; driver: bad connection
2024/11/25 09:40:56 bugsnag.Notify: bugsnag/payload.deliver: invalid api key: ''
2024/11/25 09:40:56 bugsnag/sessions/publisher.publish invalid API key: ''
panic: failed to migrate ds: pq: canceling statement due to user request; driver: bad connection [recovered]
panic: failed to migrate ds: pq: canceling statement due to user request; driver: bad connection [recovered]
panic: failed to migrate ds: pq: canceling statement due to user request; driver: bad connection

`

image

@gitcommitshow
Copy link
Collaborator

@kalavt can you please describe how to reproduce your issue and how did it impact you?
Please share CPU/RAM usage before and after starting the server

@gitcommitshow
Copy link
Collaborator

gitcommitshow commented Nov 25, 2024

@shashank42 this is the config to set local archive expiry - RSERVER_ARCHIVAL_JOB_RETENTION (default value is 24h)

@kalavt
Copy link

kalavt commented Nov 25, 2024

Hi @gitcommitshow I've no idea how to reproduce the issue,
the rudder-server-0 is got restart regularly ( almost daily basis)

`

  • kubectl get pods --all-namespaces
    data rudder-server-0 1/1 Running 1 (32m ago) 7h21m
    data rudder-server-postgresql-0 1/1 Running 0 7h21m
    data rudder-transformer-7f5689f479-fc26t 1/1 Running 0 2d7h
    data rudder-transformer-7f5689f479-xkzsn 1/1 Running 0 56d
    `

rudder-server
Limits:
cpu: 4
memory: 4Gi
Requests:
cpu: 1
memory: 4Gi

rudder-server-postgresql-0
Limits:
cpu: 4
memory: 4Gi
Requests:
cpu: 2
memory: 4Gi

what I can see is on rudder-server-o crashing, the postgresql-0 shows below logs might relevant:

`

2024-11-25 09:32:51.447 GMT [122614] ERROR: canceling autovacuum task
2024-11-25 09:32:51.447 GMT [122614] CONTEXT: automatic analyze of table "rudderstack.public.arc_jobs_9224"
2024-11-25 09:34:22.250 GMT [123218] ERROR: canceling autovacuum task
2024-11-25 09:34:22.250 GMT [123218] CONTEXT: automatic analyze of table "rudderstack.public.batch_rt_jobs_18391"
2024-11-25 09:40:56.785 GMT [122371] ERROR: canceling statement due to user request
2024-11-25 09:40:56.785 GMT [122371] STATEMENT: DELETE FROM "batch_rt_job_status_18389"
WHERE NOT id = ANY(
SELECT DISTINCT ON (job_id) id from "batch_rt_job_status_18389" ORDER BY job_id ASC, id DESC
)
2024-11-25 09:40:59.171 GMT [125646] LOG: incomplete startup packet
2024-11-25 09:40:59.406 GMT [125659] ERROR: relation "rsources_failed_keys_v2" already exists
2024-11-25 09:40:59.406 GMT [125659] STATEMENT: create table "rsources_failed_keys_v2" (
id VARCHAR(27) COLLATE "C",
db_name text not null default 'rudder-server-postgresql-0.rudder-server-postgresql-headless',
job_run_id text not null,
task_run_id text not null,
source_id text not null,
destination_id text not null,
primary key (id),
unique (job_run_id, task_run_id, source_id, destination_id, db_name)
)
2024-11-25 09:40:59.407 GMT [125659] ERROR: relation "rsources_failed_keys_v2_records" already exists
2024-11-25 09:40:59.407 GMT [125659] STATEMENT: create table "rsources_failed_keys_v2_records" (
id VARCHAR(27) COLLATE "C",
record_id text not null,
ts timestamp not null default NOW(),
primary key (id, record_id)
)
2024-11-25 09:40:59.413 GMT [125659] ERROR: relation "rsources_stats" already exists
2024-11-25 09:40:59.413 GMT [125659] STATEMENT: create table "rsources_stats" (
db_name text not null default 'rudder-server-postgresql-0.rudder-server-postgresql-headless',
job_run_id text not null,
task_run_id text not null,
source_id text not null,
destination_id text not null,
in_count integer not null default 0,
out_count integer not null default 0,
failed_count integer not null default 0,
ts timestamp not null default NOW(),
primary key (db_name, job_run_id, task_run_id, source_id, destination_id)
)
2024-11-25 09:41:01.695 GMT [125372] ERROR: canceling autovacuum task
2024-11-25 09:41:01.695 GMT [125372] CONTEXT: automatic analyze of table "rudderstack.public.arc_jobs_9226"
2024-11-25 09:57:21.654 GMT [130709] ERROR: canceling autovacuum task
2024-11-25 09:57:21.654 GMT [130709] CONTEXT: automatic analyze of table "rudderstack.public.batch_rt_jobs_18398"

`

let me know if you would like more details or logs.

@kalavt
Copy link

kalavt commented Nov 27, 2024

@gitcommitshow I think it's might relevant to below SQL, after DB Accumulated up with massive data, the performance would be very low, hence we got "panic": "failed to migrate ds: pq: canceling statement due to user request; driver: bad connection" and rudder-server kept crashing.

DELETE FROM "arc_job_status_95" WHERE NOT id = ANY(
 SELECT DISTINCT ON (job_id) id from "arc_job_status_95" ORDER BY job_id ASC, id DESC )

it seems the MR fixed the issue, but not merged to 1.3.x release.

#2693

@gitcommitshow
Copy link
Collaborator

@kalavt are you using rudder-server 1.3.x?

Please upgrade to at least 1.36.x

@kalavt
Copy link

kalavt commented Nov 27, 2024

Hi @gitcommitshow I'm currently on rudder-server 1.38.3,
I saw the patch only merged to 1.4.0 onwards.

image

@gitcommitshow
Copy link
Collaborator

I'm currently on rudder-server 1.38.3,
I saw the patch only merged to 1.4.0 onwards.

In that case, your version is patched already. But you still see this error, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants