Thermal Control Test Plan
Table of Contents
Overview
The purpose is to test functionality of thermal control feature on the SONiC switch DUT. The thermal control feature contains 3 major functions: FAN status monitor, thermal status monitor and thermal control policy management.
- "FAN status monitor" reads FAN status via platform API every 60 seconds and saves it to redis database. User can fetch FAN status via command line
show platform fanstatus
. - "Thermal status monitor" reads thermal status via platform API every 60 seconds and saves it to redis database. User can fetch thermal status via command line
show platform temperature
. - The thermal control policy is defined in a JSON file and loaded by thermal control daemon in pmon docker. Thermal control daemon collects thermal information and matches thermal conditions. Once some thermal conditions match, related thermal actions will be triggered.
A more detailed design and function description can be found in this document.
Scope
This test is targeting a running SONiC system with fully functioning configuration. The purpose of the test is functional testing of thermal control on SONiC system, making sure that FAN status and thermal status can be shown to user and correct actions are executed once predefined thermal policy conditions match.
Test Structure
Setup Configuration
Since this feature is not related to traffic and network topology, all current topology can be applied for this test.
Ansible and Pytest
This test plan is based on platform test infrastructure as additional cases. The test will reuse current platform test in SONiC-mgmt. New pytest test cases will be added to test_platform_info.py. In addition, valid_policy.json, invalid_format_policy.json and invalid_value_policy.json will be added as thermal policy configuration file for test purpose.
Valid policy file
In the valid policy file, two policies are defined. One is for "any PSU absence", the other is for "all FAN and PSU presence".
- In the case of "any PSU absence", the expected behavior based on the design and implementation is that FAN speed is set to 100% and thermal control algorithm is disabled.
- In the case of "all FAN and PSU presence", the thermal control algorithm is enabled and the FAN speed is being adjusted by the thermal control.
The valid_policy.json file content is like:
{
"thermal_control_algorithm": {
"run_at_boot_up": "false",
"fan_speed_when_suspend": "60"
},
"info_types": [
{
"type": "fan_info"
},
{
"type": "psu_info"
}
],
"policies": [
{
"name": "any PSU absence",
"conditions": [
{
"type": "fan.any.absence"
}
],
"actions": [
{
"type": "thermal_control.control",
"status": "false"
},
{
"type": "fan.all.set_speed",
"speed": "100"
}
]
},
{
"name": "any FAN absence",
"conditions": [
{
"type": "fan.any.absence"
}
],
"actions": [
{
"type": "thermal_control.control",
"status": "false"
},
{
"type": "fan.all.set_speed",
"speed": "100"
}
]
},
{
"name": "all FAN and PSU presence",
"conditions": [
{
"type": "fan.all.presence"
},
{
"type": "psu.all.presence"
}
],
"actions": [
{
"type": "thermal_control.control",
"status": "true"
},
{
"type": "fan.all.set_speed",
"speed": "65"
}
]
}
]
}
Invalid format policy file
In invalid_format_policy.json, the file content is not valid JSON at all. A file with content "invalid" should be good for test purpose.
Invalid value policy file
In invalid_value_policy.json, the file content contains minus value of target speed. We couldn't cover all invalid value here because there are too many possibilities. The major purpose of this configuration file is to verify thermal control daemon would not crash while loading such an invalid configuration files. For other negative test cases, they are already covered by unit test. The invalid_value_policy.json content is like:
{
"info_types": [
{
"type": "fan_info"
},
{
"type": "psu_info"
}
],
"policies": [
{
"name": "any PSU absence",
"conditions": [
{
"type": "fan.any.absence"
}
],
"actions": [
{
"type": "fan.all.set_speed",
"speed": "-1"
}
]
}
]
}
Test Cases
Show FAN Status Test
Show FAN status test verifies that all FAN related information can be shown correctly via show platform fanstatus
.
Procedure
- Testbed setup.
- Mock random data for "presence", "speed", "status", "target_speed", "led status".
- Issue command
show platform fanstatus
. - Record the command output.
- Verify that command output matches the mock data.
- Restore mock data.
Show Thermal Status Test
Show thermal status test verifies that all thermal related information can be shown correctly via show platform temperature
.
Procedure
- Testbed setup.
- Fill mock data for "temperature", "high_threshold", "high_critical_threshold".
- Issue command
show platform temperature
. - Record the command output.
- Verify that command output matches the mock data.
- Restore mock data.
FAN Test
FAN test verifies that proper action should be taken for conditions including: FAN absence, FAN over speed, FAN under speed.
Procedure
- Testbed setup.
- Copy valid_policy.json to pmon docker and backup the original one.
- Restart thermal control daemon to reload policy configuration file. Verify thermal algorithm is disabled and FAN speed is set to 60% according to configuration file.
- Make mock data: first FAN absence.
- Wait for at least 65 seconds. Verify target speed of all FANs are set to 100% according to valid_policy.json. Verify there is a warning log for FAN absence.
- Make mock data: first FAN presence.
- Wait for at least 65 seconds. Verify target speed of all FANs are set to 65% according to valid_policy.json. Verify there is a notice log for FAN presence.
- Make mock data: first FAN speed exceed threshold(speed < target speed), second FAN speed exceed threshold(speed > target speed).
- Wait for at least 65 seconds. Verify led turns to red for first and second FAN. Verify there is a warning log for over speed and a warning log for under speed.
- Make mock data: first and second FAN speed recover to normal.
- Wait for at least 65 seconds. Verify led turns to green for first and second FAN. Verify there are two notice logs for speed recovery.
- Restore the original policy file. Restore mock data.
Note: The reason that we wait at least 65 seconds is that thermal policy run every 60 seconds according to design.
PSU Absence Test
PSU absence test verifies that once any PSU absence, all FAN speed will be set to proper value according to policy file.
Procedure
- Testbed setup.
- Copy valid_policy.json to pmon docker and backup the original one.
- Restart thermal control daemon to reload policy configuration file.
- Turn off one PSUs.
- Wait for at least 65 seconds. Verify target speed of all FANs are set to 100% according to valid_policy.json.
- Turn on one PSU and turn off the other PSU.
- Wait for at least 65 seconds. Verify target speed of all FANs are still 100% according to valid_policy.json.
- Turn on all PSUs.
- Wait for at least 65 seconds. Verify target speed of all Fans are set to 65% according to valid_policy.json.
- Restore the original policy file.
Note: The reason that we wait at least 65 seconds is that thermal policy run every 60 seconds according to design. For switch who has only one PSU, step 6 and step 7 will be ignored.
Invalid Policy Format Load Test
Invalid policy format test verifies that thermal control daemon does not exit when loading a invalid_format_policy.json file. The thermal control daemon cannot perform any thermal policy in this case, but FAN monitor and thermal monitor should still work.
Procedure
- Testbed setup.
- Copy invalid_format_policy.json to pmon docker and backup the original one.
- Restart thermal control daemon to reload policy configuration file.
- Verify thermal control daemon can be started up. Verify error log about loading invalid policy file is output.
- Restore the original policy file.
Invalid Policy Value Load Test
Invalid policy value test verifies that thermal control daemon does not exit when loading a invalid_value_policy.json file. The thermal control daemon cannot perform any thermal policy in this case, but FAN monitor and thermal monitor should still work.
Procedure
- Testbed setup.
- Copy invalid_value_policy.json to pmon docker and backup the original one.
- Restart thermal control daemon to reload policy configuration file.
- Verify thermal control daemon can be started up. Verify error log about loading invalid policy file is output.
- Restore the original policy file.