[WIP] First Draft of OPEA Triage Tool #1185

louie-tsai · 2024-11-23T00:35:57Z

Description

To help customers debug OPEA issues, we create a Triage Tool to test and gather needed information for debugging.

For the first draft, it is for ChatQnA.

How to Use:
python Triage.py ChatQnA_Xeon.json
only need to change json file for different architectures like Gaudi

The Triage Tools will run some simple tests including:

health_check
micorservice testings
statistics

Below information will be gather after above testings

system info
env variable info
all docker logs for microservices
(optional) profiling results like vllm pytorch profiling

Report:
plan to have a HTML report.
Right now we have

simple console output
simple html report with all docker logs embedded

console output
Below is the screenshot for console output.

html report with all docker logs embedded

Other info:

all input, output, port configurations are in data.json file. no need to change codes for configurations and data
each example will be implemented as a seperate python class

ToDo:

need to make html report better
make sure all microservices are tested on both Xeon and Gaudi
move files into right folder
apply to all examples

Issues

n/a.

Type of change

List the type of change like below. Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds new functionality)
Breaking change (fix or feature that would break existing design and interface)
Others (enhancement, documentation, validation, etc.)

Dependencies

NA

Tests

Manual Testing

Signed-off-by: Tsai, Louie <[email protected]>

louie-tsai · 2024-11-25T06:48:03Z

a sample report zip file
ChatQnA_Xeon.json_11-25_06-34.zip

louie-tsai · 2024-11-27T19:43:33Z

[ToDo] : understand whether we need debug RESTful API request here or using Telemetery

alexsin368 · 2024-11-27T23:01:07Z

ChatQnA/tests/ChatQnA_Xeon.json

+      "output": "Data preparation succeeded"
+    },
+    {
+      "service": "retrival",


Suggested change

"service": "retrival",

"service": "retrieval",

alexsin368 · 2024-11-27T23:01:44Z

ChatQnA/tests/Triage.py

+
+import requests
+unittest.TestLoader.sortTestMethodsUsing = None
+#unittest.TestLoader.sortTestMethodsUsing = lambda self, a, b: (a < b) - (a > b)


remove commented out code

alexsin368 · 2024-11-27T23:02:00Z

ChatQnA/tests/Triage.py

+        p = subprocess.Popen(cmd, stdout=subprocess.PIPE, shell=True)
+        (output, err) = p.communicate()
+        p_status = p.wait()
+        #print("Command exit status/return code : ", p_status)


remove commented out code

alexsin368 · 2024-11-27T23:02:17Z

ChatQnA/tests/Triage.py

+            print("json load failed: " + self.filename)
+            pass
+    for i in data[class_name]:
+        #print(i["service"])


remove commented out code

alexsin368 · 2024-11-27T23:02:44Z

ChatQnA/tests/Triage.py

+        line= str(line)[2:-1]
+        row = line.split()
+        if columns != []:
+            #print(row)


remove commented out code

alexsin368 · 2024-11-27T23:02:55Z

ChatQnA/tests/Triage.py

+
+    import pandas as pd
+    df = pd.DataFrame(rows,columns=columns )
+    #print(df)


remove commented out code

alexsin368 · 2024-11-27T23:03:02Z

ChatQnA/tests/Triage.py

+  def update_docker_report(self, port, key, value):
+
+    df = self.docker_report_df
+    #docker_index = df.loc[df['PORTS'] == port]


remove commented out code

alexsin368 · 2024-11-27T23:03:28Z

ChatQnA/tests/Triage.py

+        self.triage_level = triage_level
+        self.triage_report = triage_report
+        self.ip = "http://0.0.0.0"
+        #print(command_line_param)


remove commented out code

alexsin368 · 2024-11-27T23:03:36Z

ChatQnA/tests/Triage.py

+        import pandas as pd
+        columns = ['env','value','required']
+        df = pd.DataFrame(rows,columns=columns )
+        #print(df)


remove commented out code

alexsin368 · 2024-11-27T23:03:44Z

ChatQnA/tests/Triage.py

+        rows.append(['machine', system_info.machine])
+        rows.append(['processor', system_info.processor])
+        df = pd.DataFrame(rows,columns=columns )
+        #print(df)


remove commented out code

alexsin368 · 2024-11-27T23:03:50Z

ChatQnA/tests/Triage.py

+        #print(df)
+        self.triage_report.system_info_df = df
+
+        #status, output = RunCmd().run("docker ps")


remove commented out code

alexsin368 · 2024-11-27T23:04:05Z

ChatQnA/tests/Triage.py

+
+        # Health Check
+        response_status_code = self.utils.service_health_check(self.ip, port, self.triage_report, self.triage_level)
+        #self.assertEqual(response_status_code, 200)


remove commented out code

alexsin368 · 2024-11-27T23:04:10Z

ChatQnA/tests/Triage.py

+        #self.assertEqual(response_status_code, 200)
+
+        # Testing
+        #response_status_code  = utils.service_test(self.ip, port, endpoint_name, data, self.triage_report, self.triage_level)


remove commented out code

alexsin368 · 2024-11-27T23:04:16Z

ChatQnA/tests/Triage.py

+
+        # Health Check
+        response_status_code = self.utils.service_health_check(self.ip, port, self.triage_report, self.triage_level)
+        #self.assertEqual(response_status_code, 200)


remove commented out code

alexsin368 · 2024-11-27T23:04:26Z

ChatQnA/tests/Triage.py

+
+        # Health Check
+        response_status_code = self.utils.service_health_check(self.ip, port, self.triage_report, self.triage_level)
+        #self.assertEqual(response_status_code, 200)


remove commented out code

alexsin368 · 2024-11-27T23:04:33Z

ChatQnA/tests/Triage.py

+
+        # Health Check
+        response_status_code = self.utils.service_health_check(self.ip, port, self.triage_report, self.triage_level)
+        #self.assertEqual(response_status_code, 200)


remove commented out code

alexsin368 · 2024-11-27T23:04:39Z

ChatQnA/tests/Triage.py

+
+        # Health Check
+        response_status_code = self.utils.service_health_check(self.ip, port, self.triage_report, self.triage_level)
+        #self.assertEqual(response_status_code, 200)


remove commented out code

alexsin368 · 2024-11-27T23:04:44Z

ChatQnA/tests/Triage.py

+
+        # Health Check
+        response_status_code = self.utils.service_health_check(self.ip, port, self.triage_report, self.triage_level)
+        #self.assertEqual(response_status_code, 200)


remove commented out code

alexsin368

Remove any commented out code

louie-tsai requested a review from lvliang-intel as a code owner November 23, 2024 00:35

louie-tsai requested review from preethivenkatesh, yinghu5 and letonghan November 23, 2024 00:39

First Draft of OPEA Triage Tool

c90eb1b

Signed-off-by: Tsai, Louie <[email protected]>

louie-tsai force-pushed the triagetool branch from 4f82faa to c90eb1b Compare November 25, 2024 06:44

alexsin368 reviewed Nov 27, 2024

View reviewed changes

ChatQnA/tests/Triage.py

line= str(line)[2:-1]

row = line.split()

if columns != []:

#print(row)

Copy link

alexsin368 Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove commented out code

alexsin368 reviewed Nov 27, 2024

View reviewed changes

ChatQnA/tests/Triage.py

import pandas as pd

df = pd.DataFrame(rows,columns=columns )

#print(df)

Copy link

alexsin368 Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove commented out code

alexsin368 reviewed Nov 27, 2024

View reviewed changes

alexsin368 suggested changes Nov 27, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] First Draft of OPEA Triage Tool #1185

[WIP] First Draft of OPEA Triage Tool #1185

louie-tsai commented Nov 23, 2024 •

edited

Loading

louie-tsai commented Nov 25, 2024

louie-tsai commented Nov 27, 2024

alexsin368 Nov 27, 2024

alexsin368 Nov 27, 2024

alexsin368 Nov 27, 2024

alexsin368 Nov 27, 2024

alexsin368 Nov 27, 2024

alexsin368 Nov 27, 2024

alexsin368 Nov 27, 2024

alexsin368 Nov 27, 2024

alexsin368 Nov 27, 2024

alexsin368 Nov 27, 2024

alexsin368 Nov 27, 2024

alexsin368 Nov 27, 2024

alexsin368 Nov 27, 2024

alexsin368 Nov 27, 2024

alexsin368 Nov 27, 2024

alexsin368 Nov 27, 2024

alexsin368 Nov 27, 2024

alexsin368 Nov 27, 2024

alexsin368 left a comment

[WIP] First Draft of OPEA Triage Tool #1185

Are you sure you want to change the base?

[WIP] First Draft of OPEA Triage Tool #1185

Conversation

louie-tsai commented Nov 23, 2024 • edited Loading

Description

Issues

Type of change

Dependencies

Tests

louie-tsai commented Nov 25, 2024

louie-tsai commented Nov 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexsin368 left a comment

Choose a reason for hiding this comment

louie-tsai commented Nov 23, 2024 •

edited

Loading