Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] First Draft of OPEA Triage Tool #1185

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

louie-tsai
Copy link
Collaborator

@louie-tsai louie-tsai commented Nov 23, 2024

Description

To help customers debug OPEA issues, we create a Triage Tool to test and gather needed information for debugging.

For the first draft, it is for ChatQnA.

How to Use:
python Triage.py ChatQnA_Xeon.json
only need to change json file for different architectures like Gaudi

The Triage Tools will run some simple tests including:

  1. health_check
  2. micorservice testings
  3. statistics

Below information will be gather after above testings

  1. system info
  2. env variable info
  3. all docker logs for microservices
  4. (optional) profiling results like vllm pytorch profiling

Report:
plan to have a HTML report.
Right now we have

  1. simple console output
  2. simple html report with all docker logs embedded

console output
Below is the screenshot for console output.

image

html report with all docker logs embedded

image

Other info:

  1. all input, output, port configurations are in data.json file. no need to change codes for configurations and data
  2. each example will be implemented as a seperate python class

ToDo:

  1. need to make html report better
  2. make sure all microservices are tested on both Xeon and Gaudi
  3. move files into right folder
  4. apply to all examples

Issues

n/a.

Type of change

List the type of change like below. Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds new functionality)
  • Breaking change (fix or feature that would break existing design and interface)
  • Others (enhancement, documentation, validation, etc.)

Dependencies

NA

Tests

Manual Testing

@louie-tsai
Copy link
Collaborator Author

a sample report zip file
ChatQnA_Xeon.json_11-25_06-34.zip

@louie-tsai
Copy link
Collaborator Author

[ToDo] : understand whether we need debug RESTful API request here or using Telemetery

"output": "Data preparation succeeded"
},
{
"service": "retrival",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"service": "retrival",
"service": "retrieval",


import requests
unittest.TestLoader.sortTestMethodsUsing = None
#unittest.TestLoader.sortTestMethodsUsing = lambda self, a, b: (a < b) - (a > b)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove commented out code

p = subprocess.Popen(cmd, stdout=subprocess.PIPE, shell=True)
(output, err) = p.communicate()
p_status = p.wait()
#print("Command exit status/return code : ", p_status)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove commented out code

print("json load failed: " + self.filename)
pass
for i in data[class_name]:
#print(i["service"])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove commented out code

line= str(line)[2:-1]
row = line.split()
if columns != []:
#print(row)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove commented out code


import pandas as pd
df = pd.DataFrame(rows,columns=columns )
#print(df)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove commented out code

def update_docker_report(self, port, key, value):

df = self.docker_report_df
#docker_index = df.loc[df['PORTS'] == port]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove commented out code

self.triage_level = triage_level
self.triage_report = triage_report
self.ip = "http://0.0.0.0"
#print(command_line_param)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove commented out code

import pandas as pd
columns = ['env','value','required']
df = pd.DataFrame(rows,columns=columns )
#print(df)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove commented out code

rows.append(['machine', system_info.machine])
rows.append(['processor', system_info.processor])
df = pd.DataFrame(rows,columns=columns )
#print(df)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove commented out code

#print(df)
self.triage_report.system_info_df = df

#status, output = RunCmd().run("docker ps")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove commented out code


# Health Check
response_status_code = self.utils.service_health_check(self.ip, port, self.triage_report, self.triage_level)
#self.assertEqual(response_status_code, 200)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove commented out code

#self.assertEqual(response_status_code, 200)

# Testing
#response_status_code = utils.service_test(self.ip, port, endpoint_name, data, self.triage_report, self.triage_level)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove commented out code


# Health Check
response_status_code = self.utils.service_health_check(self.ip, port, self.triage_report, self.triage_level)
#self.assertEqual(response_status_code, 200)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove commented out code


# Health Check
response_status_code = self.utils.service_health_check(self.ip, port, self.triage_report, self.triage_level)
#self.assertEqual(response_status_code, 200)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove commented out code


# Health Check
response_status_code = self.utils.service_health_check(self.ip, port, self.triage_report, self.triage_level)
#self.assertEqual(response_status_code, 200)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove commented out code


# Health Check
response_status_code = self.utils.service_health_check(self.ip, port, self.triage_report, self.triage_level)
#self.assertEqual(response_status_code, 200)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove commented out code


# Health Check
response_status_code = self.utils.service_health_check(self.ip, port, self.triage_report, self.triage_level)
#self.assertEqual(response_status_code, 200)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove commented out code

Copy link

@alexsin368 alexsin368 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove any commented out code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants