Skip to content

Latest commit

 

History

History
125 lines (101 loc) · 7.04 KB

sftp_filesystem.md

File metadata and controls

125 lines (101 loc) · 7.04 KB

The SftpFileSystem

Class SftpFileSystem is an implementation of java.nio.file.FileSystem and lets client code treat an SFTP server like any other file system. It is a remote file system, though, and that has some effects that clients have to aware of. Because operations are remote operations involving network requests and answers, the performance characteristics are different than most other file systems.

Creating an SftpFileSystem

An SftpFileSystem needs an SSH session to be able to talk to the SFTP server.

There are two ways to create an SftpFileSystem:

  1. If you already have an SSH ClientSession, you can create the file system off that session using SftpClientFactory.instance().createSftpFileSystem(). The file system remains valid until it is closed, or until the session is closed. When the file system is closed, the session will not be closed. When the session closes, so will the file system.

  2. You can create an SftpFileSystem with a sftp:// URI using the standard Java factory java.nio.file.FileSystems.newFileSystem(). This will automatically create an SshClient with default settings, and the file system will open an SSH session itself. This session has heartbeats enabled to keep it open for as long as the file system is open. The file system remains valid until closed, at which point it will close the session it had created. If the session closes while the file system is not closed, the next operation on such a file system will open a new session.

SSH Resource Management

Most operations on an SftpFileSystem, or on streams returned, produce SFTP requests over the network, and wait for a reply to be received. This works internally by using SftpClient to talk to the SFTP server. An SftpClient is the client-side implementation of the SFTP protocol over an SSH channel (a ChannelSession).

The SSH channel and the SftpClient are tightly coupled: the channel is opened when the SftpClientis initialized, and the channel is closed when the SftpClient is closed, and when the channel closes, so is the SftpClient.

For SftpFileSystem it would be rather inefficient to use a new SftpClient for each new operation. That would create a new channel every time, and tear it down after the operation. But channels have a setup cost, and the SFTP protocol also has to initialized, both of which involve exchanging messages over the network. This is not efficient if one wants to perform multiple operations, such as transferring multiple files with java.nio.file.Files.copy().

The Channel Pool

The SftpFileSystem thus employs a pool of SftpClients. This pool is initially empty. The first operation will create an SftpClient, initialize it, and then perform its operation. But then it will add the still open SftpClient to the pool for use by subsequent operations instead of closing it. The next operation can then simply grab this already initialized SftpClient with its open channel and perform its operation.

The pool is limited by a maximum size of SftpModuleProperties.POOL_SIZE (by default 8). The pool can grow to this size if there are that many threads that perform operations on the SftpFileSystem concurrently.

SftpClients in the pool need to be closed at some point. Consider an application that has a burst of file transfers and uses 8 threads to perform them. Afterwards, the pool will contain 8 SftpClients: that's 8 open SSH channels, each with an SFTP subsystem at the server's end. If the application then does only little, like transferring a few files sequentially over the next few hours, until the next burst (which may never come), then we don't want to keep all 8 channels open and consuming resources not only in the client but also on the server side.

(This assumes that the whole SSH session remains open for that long, which can be accomplished by using heartbeats on the session.)

The SftpFileSystem handles this by expiring inactive clients from the pool. If a client has been in the pool for SftpModuleProperties.POOL_LIFE_TIME (default is 10 seconds), it is removed from the pool and closed. (If it was in the pool for that time, this means it was idle for that time: no operation was performed on it.) If no operation on the SftpFileSystem occurs at all for this time, it's possible that the pool is emptied, and the next operation has to create and initialize a new SftpClientand channel.

If an application doesn't want this, it can define SftpModuleProperties.POOL_CORE_SIZE, which must be smaller than POOL_SIZE. By default, it is zero. If greater than zero, that many SftpClients are kept in the pool (and that many channels are kept open) even if they are idle.

It should be noted that the SFTP server may also decide to close channels whenever it wants. This will close the channel and the SftpClienton the client side. If it happens while a client-side operation is ongoing, the operation will fail with an exception; it it happens on an idle SftpClient in the pool, the SftpClient is simply removed from the pool.

If the whole SSH session is closed, the SftpFileSystem is closed. When a SftpFileSystem is closed, all SftpClients in the pool are closed, and no new clients will be added to the pool.

Choosing the Pool Size

If there are more then POOL_SIZE threads using the same SftpFileSystem, it is possible that all POOL_SIZE clients are already in use when a thread tries to do a file system operation. In this case, a new SftpClient is created, which will be closed after the operation. This is a sign of the POOL_SIZE being too small for the application, or that the application is badly designed. Using too many threads for remote file operations is not a good idea in SFTP: all traffic in the end goes over a single network connection anyway. Using a limited number of threads may bring some speedup compared to strictly sequential operations because the handling of the data received is offloaded to these threads, while the next message can already be sent or received. But copying 1000 files using 1000 threads and SSH channels is nonsense; it's far better to handle that many files in batches with a smaller number of threads (maybe 8).

In any case, SftpModuleProperties.POOL_SIZE should be large enough to accommodate the number of threads the client application is going to use for operations on the SftpFileSystem. If there are more threads, performance may degrade.

Versions of Apache MINA sshd <= 2.10.0 tried to mitigate this performance drop for such "extra" threads by keeping the SftpClient in a ThreadLocal, so that such "extra" threads could re-use the SftpClient. This mechanism has been removed because it sometimes caused memory leaks. The mechanism was also flawed because there were use cases where it just could not work correctly.

Design your application such that is uses a small maximum number of threads that perform operations on a SftpFileSysteminstance. Set SftpModuleProperties.POOL_SIZE such that it is >= the maximum number of threads that operate concurrently on the file system. The default pool size is 8.