Scylla Tests Hang: Debugging SSH Process Issues

by Omar Yusuf 48 views

Introduction

Hey guys! We've been diving deep into some tricky issues with our Scylla tests and SCT (Scylla Cluster Tool) integration, and we've hit a snag where tests are hanging even after they've supposedly finished running successfully. This is super frustrating, because it leaves processes lingering and makes our whole testing pipeline less efficient. So, let's break down what's happening, why it's happening, and how we can fix it. This article aims to provide a comprehensive overview of the issue, focusing on the root causes and potential solutions for these hanging processes. We'll explore the specifics of the threads and processes involved, and delve into the intricacies of SSH connections and how they might be contributing to the problem. By understanding the underlying mechanisms, we can better address the issue and ensure our tests run smoothly and efficiently. Scylla tests are a critical part of our development and release process, and any hiccups here can significantly impact our ability to deliver high-quality software. Therefore, resolving this issue is of utmost importance to maintain the integrity of our testing pipeline and overall development workflow.

The Problem: Hanging Tests and Lingering Processes

So, what's the deal? After a test run completes without any apparent errors, the job just… hangs. When we peek under the hood, we see that there are some processes still kicking around that shouldn't be. Specifically, we're noticing threads from modules like multiprocessing.queues, concurrent.futures.thread, sdcm.remote.libssh2_client, socketserver, and cassandra.io.libevreactor sticking around. This is a major pain because it ties up resources and prevents subsequent tests from running smoothly. One example of a job exhibiting this behavior is the run-restore-aws-multi-dc job, which you can check out here. This hanging issue not only delays the overall testing process but also introduces uncertainty about the actual state of the system after the tests. Lingering processes can consume valuable resources, potentially leading to performance degradation and even system instability. Moreover, these hanging processes can interfere with subsequent test runs, causing unpredictable results and making it difficult to accurately assess the system's behavior. Therefore, identifying and resolving the root cause of these hanging processes is essential to maintain the reliability and efficiency of our testing environment.

Historical Context: It Used to Work!

Interestingly, this wasn't always the case. We've seen the same tests running perfectly fine on older versions of SCT using sct-runner (version 1.9) and Python 3.12. You can see an example of a successful run (though with some processes left, which we'll get to) here. This historical data gives us a crucial clue: the issue is likely a regression introduced in a more recent version of our tools or dependencies. Regression testing is a critical aspect of software development, ensuring that new changes do not negatively impact existing functionality. In this context, the fact that the tests previously ran successfully highlights the importance of identifying the specific changes that triggered the hanging issue. By comparing the configurations and dependencies of the working and non-working versions, we can narrow down the potential causes and focus our debugging efforts more effectively. This historical perspective not only helps in troubleshooting but also reinforces the need for robust testing practices to prevent future regressions.

Pinpointing the Culprits: concurrent.futures.thread and sdcm.remote.libssh2_client

Okay, so we know it used to work, and now it doesn't. What's changed? Looking at the logs, it seems like the primary suspects are the concurrent.futures.thread and sdcm.remote.libssh2_client processes. While older successful runs also had some leftover processes (like multiprocessing.queues, socketserver, and cassandra.io.libevreactor), the current hangs seem to be specifically tied to these two. Let's dive into some code snippets and stack traces to understand why. These two modules, concurrent.futures.thread and sdcm.remote.libssh2_client, are crucial components in our testing infrastructure, handling concurrent tasks and remote SSH connections, respectively. Their involvement in the hanging issue suggests potential problems with how threads are managed or how SSH sessions are being handled. Concurrent programming and remote communication are inherently complex, and any issues in these areas can lead to subtle and difficult-to-debug problems. By focusing on these specific modules, we can streamline our investigation and concentrate on the areas most likely to be contributing to the hangs.

Examining concurrent.futures.thread

The stack trace for a thread from concurrent.futures.thread looks like this:

File: "/usr/local/lib/python3.13/threading.py", line 1014, in _bootstrap
  self._bootstrap_inner()
File: "/usr/local/lib/python3.13/threading.py", line 1043, in _bootstrap_inner
  self.run()
File: "/usr/local/lib/python3.13/threading.py", line 994, in run
  self._target(*self._args, **self._kwargs)
File: "/usr/local/lib/python3.13/concurrent/futures/thread.py", line 93, in _worker
  work_item.run()
File: "/usr/local/lib/python3.13/concurrent/futures/thread.py", line 59, in run
  result = self.fn(*self.args, **self.kwargs)
File: "/home/ubuntu/scylla-cluster-tests/sdcm/sct_events/decorators.py", line 26, in wrapper
  return func(*args, **kwargs)
File: "/home/ubuntu/scylla-cluster-tests/sdcm/utils/remote_logger.py", line 98, in _journal_thread
  self._retrieve(since=read_from_timestamp)
File: "/home/ubuntu/scylla-cluster-tests/sdcm/utils/remote_logger.py", line 127, in _retrieve
  self._remoter.run(
File: "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 690, in run
  result = _run()
File: "/home/ubuntu/scylla-cluster-tests/sdcm/utils/decorators.py", line 79, in inner
  return func(*args, **kwargs)
File: "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 681, in _run
  return self._run_execute(cmd, timeout, ignore_status, verbose, new_session, watchers)
File: "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 615, in _run_execute
  result = connection.run(**command_kwargs)
File: "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 605, in run
  self._process_output(watchers, encoding, stdout, stderr, reader, timeout,
File: "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 464, in _process_output
  sleep(timeout_read_data_chunk or 0.5)

This trace points to a thread that's part of a ThreadPoolExecutor and is stuck in a sleep call within the _process_output function of the sdcm.remote.libssh2_client module. This suggests that the thread is waiting for data from an SSH connection, but something is preventing it from receiving or processing that data, leading to the hang. The concurrent.futures.thread module provides a high-level interface for asynchronously executing callables, and its integration with the sdcm.remote.libssh2_client module is likely related to managing concurrent SSH sessions or tasks. Thread management in concurrent programming is crucial to prevent deadlocks and ensure efficient resource utilization. In this case, the fact that a thread is stuck in a sleep state indicates a potential deadlock or a situation where the thread is indefinitely waiting for a condition that will never be met. Understanding the interaction between the thread pool and the SSH client is key to diagnosing the root cause of this issue.

Diving into sdcm.remote.libssh2_client

Let's look at the sdcm.remote.libssh2_client side of things. We have a thread dump for Thread-305 which is an SSHReaderThread:

File: "/usr/local/lib/python3.13/threading.py", line 1014, in _bootstrap
  self._bootstrap_inner()
File: "/usr/local/lib/python3.13/threading.py", line 1043, in _bootstrap_inner
  self.run()
File: "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 77, in run
  self._read_output(self._session, self._channel, self._timeout,
File: "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 103, in _read_output
  session.simple_select(timeout=timeout_read_data)
File: "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/session.py", line 52, in simple_select
  select(readfds, writefds, (), timeout)

This thread is responsible for reading data from an SSH session's socket. It appears to be stuck in a select call, which is a low-level function that waits for data to become available on a socket. The fact that it's stuck here suggests that the SSH connection might not be properly closed or that there's some issue with the data stream. The sdcm.remote.libssh2_client module is responsible for handling SSH connections, and the SSHReaderThread is a critical component for asynchronously reading data from the SSH session. Asynchronous I/O is essential for handling network operations efficiently, but it also introduces complexities in error handling and resource management. The select call is a common technique for multiplexing I/O operations, allowing a single thread to monitor multiple file descriptors for readability or writability. The fact that the thread is stuck in select indicates a potential problem with the underlying SSH session or the way data is being read and processed.

Possible Causes and Solutions

Alright, armed with this information, let's brainstorm some potential causes and how we might fix them:

  1. Unclosed SSH Connections: The most obvious culprit is that SSH connections aren't being properly closed after the test completes. This could leave the SSHReaderThread waiting indefinitely for data that will never arrive.
    • Solution: Double-check the code to ensure that all SSH connections are explicitly closed in a finally block or using a context manager. Look for any instances where exceptions might be preventing the close operation from being executed. Proper resource management is crucial in any application, especially when dealing with network connections. Failing to close SSH connections can lead to resource leaks and ultimately cause the system to hang. Implementing robust error handling and using context managers or finally blocks can help ensure that connections are always closed, even in the face of exceptions.
  2. Hangs in Output Processing: The _process_output function in sdcm.remote.libssh2_client might be getting stuck while processing the output from the SSH session. This could be due to large amounts of data, unexpected data formats, or some other issue in the processing logic.
    • Solution: Review the _process_output function and add more robust error handling and logging. Consider adding timeouts to the processing logic to prevent indefinite hangs. Output processing is a critical step in many applications, especially when dealing with external systems or services. Ensuring that output is handled efficiently and robustly is essential to prevent hangs and other issues. Adding logging and error handling can provide valuable insights into the processing logic, while timeouts can prevent the system from getting stuck indefinitely in case of unexpected errors or data formats.
  3. Deadlocks in Threading: There might be a deadlock situation where threads are waiting for each other to release resources, leading to the hang. This is especially likely given the involvement of concurrent.futures.thread.
    • Solution: Analyze the threading logic to identify potential deadlocks. Use threading analysis tools or techniques to visualize thread interactions and dependencies. Ensure that locks are acquired and released in a consistent order to prevent deadlocks. Deadlocks are a common problem in concurrent programming, where two or more threads are blocked indefinitely, waiting for each other to release resources. Preventing deadlocks requires careful design and implementation of the threading logic. Using consistent locking patterns, avoiding circular dependencies, and employing timeouts can help reduce the risk of deadlocks. Threading analysis tools can also be valuable in identifying potential deadlocks and other concurrency issues.
  4. Issues with libssh2: There could be an underlying issue with the libssh2 library itself, such as a bug or a resource leak. This is less likely, but it's worth considering.
    • Solution: Check for known issues or updates to the libssh2 library. Consider trying a different SSH library or implementation to see if the issue persists. External dependencies can sometimes be the source of unexpected issues. While it's less likely that a widely used library like libssh2 has a bug, it's still worth considering as a potential cause. Checking for known issues, updates, or trying alternative libraries can help rule out problems with external dependencies.
  5. Changes in Python 3.13: The fact that this worked in Python 3.12 but not 3.13 suggests there might be a change in Python's threading or socket handling that's triggering the issue.
    • Solution: Investigate the changes between Python 3.12 and 3.13, particularly in the threading and socket modules. Look for any known issues or bug reports related to these areas. Python version compatibility is an important consideration when upgrading or migrating code. Changes in the Python interpreter or standard libraries can sometimes introduce unexpected issues. Investigating the changes between Python versions and looking for relevant bug reports can help identify potential compatibility problems.

Next Steps

So, where do we go from here? The next steps are to:

  1. Implement Logging: Add detailed logging to the sdcm.remote.libssh2_client module, especially around the SSH connection and output processing logic. This will give us more insight into what's happening during the test run.
  2. Reproduce Locally: Try to reproduce the issue locally to make debugging easier. This will allow us to use debuggers and other tools to step through the code.
  3. Test SSH Connection Closure: Write a simple test case that specifically focuses on opening and closing SSH connections to ensure that they're being handled correctly.
  4. Review Recent Changes: Carefully review the recent changes in SCT and the test setup to identify any potential regressions.

By systematically investigating these areas, we should be able to track down the root cause of the hanging tests and get our testing pipeline back on track. Systematic debugging is essential for tackling complex issues. By breaking down the problem into smaller, manageable steps, we can more effectively identify the root cause and implement a solution. Adding logging, reproducing the issue locally, and testing specific components are all valuable techniques for systematic debugging.

Conclusion

The hanging tests issue is definitely a tough one, but by understanding the processes involved, examining the stack traces, and systematically exploring potential causes, we can get to the bottom of it. We've identified some key areas to focus on, including SSH connection management, output processing, and potential deadlocks. By implementing the suggested solutions and continuing our investigation, we're confident that we can resolve this issue and ensure our Scylla tests run smoothly. Remember, guys, persistence and collaboration are key! Let's keep working together to solve this and improve our testing infrastructure. Addressing this issue not only enhances the reliability of our testing pipeline but also contributes to the overall quality and stability of our software. By investing the time and effort to understand and resolve complex problems like this, we strengthen our development process and ensure the continued success of our projects.