-
Notifications
You must be signed in to change notification settings - Fork 10
Job Usage Factor Calculation Notes
-
create_db()
creates a new table that is dynamically sized based on two options passed in when initially creating the database:--priority-decay-half-life
and--priority-usage-reset-period
. Eventually, I think we could create the FluxAccounting database with a configuration file, but for now, I pass them as options when running thecreate-db
subcommand. When the FluxAccounting database gets created, there are now two additional tables that get created:
job_usage_factor_table: stores past job usage factors per user in the association table
When a user is added to association_table, they are also added to to job_usage_factor_table.
The amount of columns in job_usage_factor_table are determined by two parameters: PriorityUsageResetPeriod and PriorityDecayHalfLife, each which represent a number of weeks by which to hold usage factors up to the time period where jobs no longer play a factor in calculating a usage factor. If these options aren't specified, the table defaults to 4 usage columns, each which represent one week's worth of jobs.
job_archive_interface.py
gets new functions:
def calc_usage_factor(jobs_conn, acct_conn, user, bank, priority_decay_half_life=None, priority_usage_reset_period=None,)
The value of priority_decay_half_life determines how far back to go to retrieve jobs in the job-archive. The number of weeks is converted to seconds by multiplying it by 604800. It then uses view_job_records()
to filter out the job archive and retrieve a user's jobs that have completed in the time period specified.
view_job_records(jobs_conn, output_file, user=user, after_start_time=(time.time() - after_start_time))
I modeled the usage factor calculation after Slurm's model, which looks at the ratio of a user's historical usage to the rest of the cluster's historical usage.
def apply_decay_factor(decay_factor, acct_conn, user=None, bank=None):
The past usage factors have a the decay factor D (0.5) applied to them before they are added to the user's current usage factor.
usage_user_past = ( D * Ulast_period) + (D * D * Uperiod-2) + ...
After the current usage factor is calculated, it is written to job_usage_factor_table along with the other, older factors. The oldest factor gets removed from the table since it is no longer needed.
Then, a similar process is repeated to calculate the raw usage factor for all of the user's siblings' jobs in that same time period.
The two usage factors are then divided to get a ratio: usage_user_historical / usage_siblings_current
Users who have run a lot of jobs will have a larger usage factor than users who have not run as many jobs.
Let's say a user has the following job records from the most recent PriorityDecayHalfLife:
UserID Username JobID T_Submit T_Run T_Inactive Nodes R
0 1002 1002 102 1605633403.22141 1605635403.22141 1605637403.22141 2 {"version":1,"execution": {"R_lite":[{"rank":"0","children": {"core": "0"}}]}}
1 1002 1002 103 1605633403.22206 1605635403.22206 1605637403.22206 2 {"version":1,"execution": {"R_lite":[{"rank":"0","children": {"core": "0"}}]}}
2 1002 1002 104 1605633403.22285 1605635403.22286 1605637403.22286 2 {"version":1,"execution": {"R_lite":[{"rank":"0","children": {"core": "0"}}]}}
3 1002 1002 105 1605633403.22347 1605635403.22348 1605637403.22348 1 {"version":1,"execution": {"R_lite":[{"rank":"0","children": {"core": "0"}}]}}
4 1002 1002 106 1605633403.22416 1605635403.22416 1605637403.22416 1 {"version":1,"execution": {"R_lite":[{"rank":"0","children": {"core": "0"}}]}}
total nodes used: 8
total time elapsed: 10000.0
usage_user_current: (total nodes used) * (total time elapsed) = 80000.0
And the user's past job usage factors (each one represents a PriorityDecayHalfLife period up to the PriorityUsageResetPeriod):
username bank usage_factor_period_0 usage_factor_period_1 usage_factor_period_2 usage_factor_period_3
0 1002 C 16.00000 16.00000 16.00000 16.00000
The past usage factors have the decay factor applied to them: [8.0, 4.0, 2.0, 1.0]
usage_user_past: 15.0
usage_user_historical: (usage_user_current) + (usage_user_past) = 80015.0
Here represents all of the jobs run on by siblings in that same period:
UserID Username JobID T_Submit T_Run T_Inactive Nodes R
0 1001 1001 100 1605633403.22009 1605635403.22009 1605637403.22009 1 {"version":1,"execution": {"R_lite":[{"rank":"0","children": {"core": "0"}}]}}
1 1001 1001 101 1605633403.22073 1605635403.22073 1605637403.22073 1 {"version":1,"execution": {"R_lite":[{"rank":"0","children": {"core": "0"}}]}}
total nodes used: 2
total time elapsed: 4000.0000009536743
usage_siblings_current: (total nodes used) + (total time elapsed) = 8000.0000
job usage factor: (usage_user_historical: 80015.0) / (usage_siblings_current: 8000.0000) = 10.0019
I see this function being called as part of a script that gets run every x period of time or so via cron
, that would update the values for users in the association_table
.