Skip to content

Job Usage Factor Calculation Notes

Christopher Moussa edited this page Apr 20, 2021 · 7 revisions

calc_usage_factor() flowchart

  • create_db() creates a new table that is dynamically sized based on two options passed in when initially creating the database: --priority-decay-half-life and --priority-usage-reset-period. Eventually, I think we could create the FluxAccounting database with a configuration file, but for now, I pass them as options when running the create-db subcommand. When the FluxAccounting database gets created, there are now two additional tables that get created:

job_usage_factor_table: stores past job usage factors per user in the association table

When a user is added to association_table, they are also added to to job_usage_factor_table.

The amount of columns in job_usage_factor_table are determined by two parameters: PriorityUsageResetPeriod and PriorityDecayHalfLife, each which represent a number of weeks by which to hold usage factors up to the time period where jobs no longer play a factor in calculating a usage factor. If these options aren't specified, the table defaults to 4 usage columns, each which represent one week's worth of jobs.

job_archive_interface.py gets new functions:

def calc_usage_factor(jobs_conn, acct_conn, user, bank, priority_decay_half_life=None, priority_usage_reset_period=None,)

The value of priority_decay_half_life determines how far back to go to retrieve jobs in the job-archive. The number of weeks is converted to seconds by multiplying it by 604800. It then uses view_job_records() to filter out the job archive and retrieve a user's jobs that have completed in the time period specified.

view_job_records(jobs_conn, output_file, user=user, after_start_time=(time.time() - after_start_time))

I modeled the usage factor calculation after Slurm's model, which looks at the ratio of a user's historical usage to the rest of the cluster's historical usage.

def apply_decay_factor(decay_factor, acct_conn, user=None, bank=None):

The past usage factors have a the decay factor D (0.5) applied to them before they are added to the user's current usage factor.

usage_user_past = ( D * Ulast_period) + (D * D * Uperiod-2) + ...

After the current usage factor is calculated, it is written to job_usage_factor_table along with the other, older factors. The oldest factor gets removed from the table since it is no longer needed.

Then, a similar process is repeated to calculate the raw usage factor for all of the user's siblings' jobs in that same time period.

The two usage factors are then divided to get a ratio: usage_user_historical / usage_siblings_current

Users who have run a lot of jobs will have a larger usage factor than users who have not run as many jobs.

An example of calculating the job usage factor

Let's say a user has the following job records from the most recent PriorityDecayHalfLife:

   UserID Username  JobID         T_Submit            T_Run       T_Inactive  Nodes                                                                               R
0    1002     1002    102 1605633403.22141 1605635403.22141 1605637403.22141      2  {"version":1,"execution": {"R_lite":[{"rank":"0","children": {"core": "0"}}]}}
1    1002     1002    103 1605633403.22206 1605635403.22206 1605637403.22206      2  {"version":1,"execution": {"R_lite":[{"rank":"0","children": {"core": "0"}}]}}
2    1002     1002    104 1605633403.22285 1605635403.22286 1605637403.22286      2  {"version":1,"execution": {"R_lite":[{"rank":"0","children": {"core": "0"}}]}}
3    1002     1002    105 1605633403.22347 1605635403.22348 1605637403.22348      1  {"version":1,"execution": {"R_lite":[{"rank":"0","children": {"core": "0"}}]}}
4    1002     1002    106 1605633403.22416 1605635403.22416 1605637403.22416      1  {"version":1,"execution": {"R_lite":[{"rank":"0","children": {"core": "0"}}]}}

total nodes used: 8

total time elapsed: 10000.0

usage_user_current: (total nodes used) * (total time elapsed) = 80000.0

And the user's past job usage factors (each one represents a PriorityDecayHalfLife period up to the PriorityUsageResetPeriod):

  username bank  usage_factor_period_0  usage_factor_period_1  usage_factor_period_2  usage_factor_period_3
0     1002    C               16.00000               16.00000               16.00000               16.00000

The past usage factors have the decay factor applied to them: [8.0, 4.0, 2.0, 1.0]

usage_user_past: 15.0

usage_user_historical: (usage_user_current) + (usage_user_past) = 80015.0

Here represents all of the jobs run on by siblings in that same period:

    UserID Username  JobID         T_Submit            T_Run       T_Inactive  Nodes                                                                               R
0     1001     1001    100 1605633403.22009 1605635403.22009 1605637403.22009      1  {"version":1,"execution": {"R_lite":[{"rank":"0","children": {"core": "0"}}]}}
1     1001     1001    101 1605633403.22073 1605635403.22073 1605637403.22073      1  {"version":1,"execution": {"R_lite":[{"rank":"0","children": {"core": "0"}}]}}

total nodes used: 2

total time elapsed: 4000.0000009536743

usage_siblings_current: (total nodes used) + (total time elapsed) = 8000.0000

job usage factor: (usage_user_historical: 80015.0) / (usage_siblings_current: 8000.0000) = 10.0019

I see this function being called as part of a script that gets run every x period of time or so via cron, that would update the values for users in the association_table.