Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scpi psu reconnect #726

Merged
merged 20 commits into from
Sep 9, 2024
Merged

Scpi psu reconnect #726

merged 20 commits into from
Sep 9, 2024

Conversation

agthomas-uc
Copy link
Contributor

I have implemented a robust reconnecting agent to re-establish the PsuInterface, clear the PSU's event register, and begin reading out data. Let me know if I can help work with Daniel to get these PRs synced up. As my code is ignorant to a direct ethernet connection, there will be a compatibility issue around self.port which will require me to modify _initialize_module agent.py in the same way Daniel did to the main body of in his PR.

Description

Implemented in the agent with one new driver function, these changes prevent the agent from crashing due to any disconnect after initialization and from block errors due disconnects during data readout. These changes attempt to reconnect the agent on a 5 second cadence and log each attempt and whether it was successful. Upon reconnecting, the event register is cleared to ensure previously MEAS? commands are not outputted to incorrect channels.

Long form description (copied from simonsobs/daq-discussions#106).

  1. As Brian suspected, implementing a reliable reconnection required clearing the PSU. To do so I created a clear command in the driver which is called once the agent reconnects. This ensures no queries from when the PSU was disconnected are sent and no bad measurements are received. Attempting to query leftover data from a disconnect can return 2 or 3 values in one query separated by \n characters.
  • I initially resolved these bad measurements by adding .splitlines() and attempting to use this data. However, reading data with a newline character creates a driver bug where measuring ch1 voltage returns 0 (integer zero), measuring ch1 current returns ch1 voltage, measuring ch2 voltage returns ch1 current, ..., measuring ch3 current returns ch3 voltage, and ch3 current is unrecoverable. Repeating this process does not turn additional channels to zero and instances of two \n characters still only "cycled" the data by 1. Querying the power as well did not stop the cycle. I tried many iterations of driver code and read the 2230G user's manual but did not determine why this behavior would occur.
  1. The block structure error would occur when a disconnect (socket.timeout()) caused the data acquisition for loop to break without reading all values to data. Since psu is set to None when a disconnect occurs, I prevent block structure errors by checking that psu is still initialized before publishing.
  • I initially implemented a more heavy handed solution comparing the length of data['data'] to twice the number of channels. However, I believe checking if psu is initialized accomplishes the same goal more efficiently. I can revert if I missed a case.
  1. The initialize module sees two types of Exceptions. When the module fails to initialize because it is still disconnected, a socket.timeout() occurs after 5 seconds (as set in prologix_interface.py#L10). If the connection has not been established within one minute, OSError [Errno 113] "No route to host" occurs. I deal with both by excepting any Exception to enable reconnect to attempt indefinitely.

Motivation and Context

This change is required because our PSU disconnects on a regular basis, between hourly and weekly, and previously needed to be manually power cycled to be restored. This prevented us from using the PSU to heat the focal plane and has been a standing issue in our lab for over 2 years.

This addresses one agent listed here socs/issues/721 and the issue I created to break this out socs/issues/725.

How Has This Been Tested?

I tested routinely during the creation of this branch and documented issues in simonsobs/daq-discussions#69 and (simonsobs/daq-discussions#75. After attempting to resolve these issues, I disconnected and reconnected the ethernet cable irregularly over a 20 minute period. 11/11 Reconnections were successful as shown below.

On the final test I made sure to disconnect the agent during the reconnection process multiple times and disconnected and reconnected the ethernet as fast as possible for 5 seconds. As I figured, this seemed to severely confuse the device. However, it recovered after ~2.5 minutes.
image

I am now running a duration test to ensure the connection stays reliable for an extended period of time and have remained connected for 5 days. I will update this PR if any disconnects and automatic reconnects occur.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality), the driver command to CLR may carry additional functionality.
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.

Copy link
Member

@BrianJKoopman BrianJKoopman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thanks @agthomas-uc! And thanks for including the results of your thorough testing. A few suggestions below.

As far as coordinating with #715, currently that adds a separate class for the direct Ethernet connection, so I think a merge would be quite minimal, though we'd have to re-implement this reconnection logic in that class as well. I pointed out this PR in a recent re-review there, so maybe we can see what @CAWNoodle says?

socs/agents/scpi_psu/drivers.py Outdated Show resolved Hide resolved
socs/agents/scpi_psu/agent.py Outdated Show resolved Hide resolved
socs/agents/scpi_psu/agent.py Outdated Show resolved Hide resolved
socs/agents/scpi_psu/agent.py Outdated Show resolved Hide resolved
socs/agents/scpi_psu/agent.py Outdated Show resolved Hide resolved
socs/agents/scpi_psu/agent.py Outdated Show resolved Hide resolved
socs/agents/scpi_psu/agent.py Outdated Show resolved Hide resolved
socs/agents/scpi_psu/agent.py Outdated Show resolved Hide resolved
socs/agents/scpi_psu/agent.py Outdated Show resolved Hide resolved
socs/agents/scpi_psu/agent.py Outdated Show resolved Hide resolved
I think this snuck in during a git merge, looks like a 'git' command got added
to the comment in 4567a0b.
Copy link
Member

@BrianJKoopman BrianJKoopman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The updates look good, thanks for all the changes! I made one small fix to the drivers imports, which were causing most of the checks to fail. Also one small comment edit. Will merge shortly. Thanks again!

@BrianJKoopman BrianJKoopman merged commit 7d2f158 into main Sep 9, 2024
5 checks passed
@BrianJKoopman BrianJKoopman deleted the scpi_psu_reconnect branch September 9, 2024 17:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants