Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Add keysize method #28

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

RFC: Add keysize method #28

wants to merge 1 commit into from

Conversation

xloem
Copy link

@xloem xloem commented Jul 23, 2021

Hi Lykos,

What do you think of adding this utility method to extract the data size from keys?

I'd like to switch to using your professional-looking library for git annex remotes, and I use this method a lot.

I haven't tested this code yet, but I likely will soon. Just looking for your input.

@Lykos153
Copy link
Owner

Lykos153 commented Jul 23, 2021

Hi, thanks for your PR!

I really like the idea of putting commonly needed functions into the library. I've been planning to move some generic and re-usable stuff from git-annex-remote-googledrive to this library, too.

But I want to keep them separate from the protocol functions. All public methods of Master (except for LinkRemote, LoggingHandler and Listen) are direct mappings to the commands of the special remote protocol. The location is not ideal and I'm planning to re-structure some of it for v2.0.0 (I kind of started in #26), but for now I'd prefer not to add anything else to Master.

I propose adding a utils module or class that contains additional functions for now. Or at least prepend them with something, eg. x_keysize or util_keysize. What do you think?

@Lykos153
Copy link
Owner

Ok I thought about it some more, and where I'm actually heading with v2.0.0 is adding some abstractions that make it easier to work with git-annex (without removing direct access to the protocol, of course).

My ideas so far:

  • Config class (maybe as a dict) that handles getconfig and setconfig automatically, so the remote doesn't need to keep an additional variable in sync with git-annex
  • Same for Creds
  • Key class that holds some additional information about the key and has a state property (that abstracts away setstate/getstate).

The latter would be a very good place to add this function. It could just have a size property, for example.

I haven't come around to implement any of this yet, but it should be quite straightforward and would only add to the existing functionality without breaking anything. If instances of Key have a __str__ method that returns the key, we could even default to sending out those objects wherever possible.

You are very welcome to add a Key class that implements your function as size property and returns its key when interpreted as string.

@xloem
Copy link
Author

xloem commented Jul 23, 2021

Here's a draft Key class.

@xloem xloem force-pushed the keysize branch 6 times, most recently from 24858f8 to aa763b7 Compare July 23, 2021 15:20
@Lykos153
Copy link
Owner

Lykos153 commented Aug 3, 2021

Sorry for the delay. I was quite busy in the last weeks. I'm going to look into it today!

@xloem
Copy link
Author

xloem commented Aug 3, 2021

I used this Key class a little since I made it. I noted these:

  • the key.uris().add / key.uris().discard method pairs are a little awkward, it would be clearer if uris were a property or adduri were a method of key.
  • the __str__ method is not automatically called; uses of Key as a string need it wrapped as str(key)

keyparts = keyparts.split('-')

# chunking
if len(keyparts) > 2 and keyparts[-1].startswith('C') and keyparts[-2].startswith('S'):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the idea of making every information that we've got about the key easily accessible!

def __repr__(self):
return str(self)

class UriList(collections.MutableSet):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's clever way to handle the URIs!

@property
def uris(self):
if self._uris is None:
self._uris = self._annex.geturls(self._key, self._prefix)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once this has been called, there's effectively caching in place, so we might as well cache the state, too.

return UriList(self.key, self.remote.annex, prefix)

@property
def state(self):
Copy link
Owner

@Lykos153 Lykos153 Aug 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In git-annex-remote-googledrive I use JSON to store arbitrary values in state. I figure this could be generally useful, so I'd like to provide an optional class with a nice interface at some point. What do you think?

Accompanying thoughts regarding the state in general:

  1. Caching: We could only read it the first time and then use a cached value. We would only update this value when the setter is called. This would save some messaging round-trips with git-annex in case the state is used multiple times by the remote. But this approach also assumes we're the only process accessing the state, which might not be a safe assumption to make.
  2. Like (1), but when writing, check beforehand if the value in the annex is equal to the cached value and throw an exception if it's not. => Saves round-trips when reading, adds an additional when writing to be safer.
  3. Like (2), but also update from the annex on each read. => None of the performance benefits, but less chance for overwriting changes than (4).
  4. Do nothing of this and let git-annex take care of merging the state, i.e. the last change always wins. And what we read is what is actually there.

Sometimes it's best not to overcomplicate stuff. However, with hundreds of thousands of keys, I found it crucial to optimize away all the unnecessary round-trips with git-annex. I think it's a question of how much work we want the library to take away from the remote application. I mean, stuff like caching can always be made configurable.

'''
Attributes
----------
remote : SpecialRemote
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only part of which I'm not yet convinced. Do you see any benefit in tying Key to SpecialRemote instead of Master? It's something that is passed from the annex to the remote after all. And the latter should know itself already.

I think I'd prefer
annex: Master
here.

(I want to get rid of Master eventually and instead use more descriptive names like Annex and Protocol.)

@Lykos153
Copy link
Owner

Lykos153 commented Aug 4, 2021

I used this Key class a little since I made it. I noted these:

* the `key.uris().add` / `key.uris().discard` method pairs are a little awkward, it would be clearer if uris were a property or adduri were a method of key.

It looked so nice, but yeah, let's move them to properties or methods.

* the `__str__` method is not automatically called; uses of Key as a string need it wrapped as `str(key)`

Oh no! I would have loved to just swap it out. Well, if this is a breaking change, then I guess we're approaching version 1.0.0.

@xloem
Copy link
Author

xloem commented Aug 8, 2021

All your recommendations seem fine. I haven't thrown them in yet because I'm not around the remote I was working on, in order to test changes to ensure I didn't make an error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants