This is git-annex's todo list. Link items to done when done.
git-annex unused eats memory
Posted Sat Sep 22 03:36:59 2012
parallel possibilities
Posted Tue Jul 17 17:54:57 2012
wishlist: swift backend
Posted Tue Jul 17 17:54:57 2012
tahoe lfs for reals
Posted Tue Jul 17 17:54:57 2012
union mounting
Posted Tue Jul 17 17:54:57 2012
hidden files
Posted Tue Jul 17 17:54:57 2012
optimise git-annex merge
Posted Tue Jul 17 17:54:57 2012
cache key info
Posted Tue Jul 17 17:54:57 2012
smudge
Posted Tue Jul 17 17:54:57 2012
add -all option
Posted Tue Jul 17 17:54:57 2012
windows support
Posted Tue Jul 17 17:54:57 2012
redundancy stats in status
Posted Tue Jul 17 17:54:57 2012
automatic bookkeeping watch command
Posted Tue Jul 17 17:54:57 2012
wishlist: special-case handling of Youtube URLs in Web special remote
Posted Tue Jul 17 17:54:57 2012
support S3 multipart uploads
Posted Tue Jul 17 17:54:57 2012
I have the same use case as Asheesh but I want to be able to see which filenames point to the same objects and then decide which of the duplicates to drop myself. I think
would be the wrong approach because how does git-annex know which ones to drop? There's too much potential for error.
Instead it would be great to have something like
While it's easy enough to knock up a bit of shell or Perl to achieve this, that relies on knowledge of the annex symlink structure, so I think really it belongs inside git-annex.
If this command gave output similar to the excellent
fastdup
utility:then you could do stuff like
My main concern with putting this in git-annex is that finding duplicates necessarily involves storing a list of every key and file in the repository, and git-annex is very carefully built to avoid things that require non-constant memory use, so that it can scale to very big repositories. (The only exception is the
unused
command, and reducing its memory usage is a continuing goal.)So I would rather come at this from a different angle.. like providing a way to output a list of files and their associated keys, which the user can then use in their own shell pipelines to find duplicate keys:
Which is implemented now!
(Making that pipeline properly handle filenames with spaces is left as an exercise for the reader..)
Well, I spent a few hours playing this evening in the 'reorg' branch in git. It seems to be shaping up pretty well; type-based refactoring in haskell makes these kind of big systematic changes a matter of editing until it compiles. And it compiles and test suite passes. But, so far I've only covered 1. 3. and 4. on the list, and have yet to deal with upgrades.
I'd recommend you not wait before using git-annex. I am committed to provide upgradability between annexes created with all versions of git-annex, going forward. This is important because we can have offline archival drives that sit unused for years. Git-annex will upgrade a repository to current standard the first time it sees it, and I hope the upgrade will be pretty smooth. It was not bad for the annex.version 0 to 1 upgrade earlier. The only annoyance with upgrades is that it will result in some big commits to git, as every symlink in the repo gets changed, and log files get moved to new names.
(The metadata being stored with keys is data that a particular backend can use, and is static to a given key, so there are no merge issues (and it won't be used to preserve mtimes, etc).)
What about Cygwin? It emulates POSIX fairly well under Windows (including signals, forking, fs (also things like /dev/null, /proc), unix file permissions), has all standard gnu utilities. It also emulates symlinks, but they are unfortunately incompatible with NTFS symlinks introduced in Vista due to some stupid restrictions on Windows.
If git-annex could be modified to not require symlinks to work, the it would be a pretty neat solution (and you get a real shell, not some command.com on drugs (aka cmd.exe))
What is the potential time-frame for this change? As I am not using git-annex for production yet, I can see myself waiting to avoid any potential hassle.
Supporting generic metadata seems like a great idea. Though if you are going this path, wouldn't it make sense to avoid metastore for mtime etc and support this natively without outside dependencies?
-- RichiH
The mtime cannot be stored for all keys. Consider a SHA1 key. The mtime is irrelevant; 2 files with different mtimes, when added to the SHA1 backend, should get the same key.
Probably our spam filter doesn't like your work IP.
Windows support is a must. In my experience, binary file means proprietary editor, which means Windows.
Unfortunately, there's not much overlap between people who use graphical editors in Windows all day vs. people who are willing to tolerate Cygwin's setup.exe, compile a Haskell program, learn git and git-annex's 90-odd subcommands, and use a mintty terminal to manage their repository, especially now that there's a sexy GitHub app for Windows.
That aside, I think Windows-based content producers are still the audience for git-annex. First Windows support, then a GUI, then the world.
For what it's worth, yes, I want to actually forget I ever had the same file in the filesystem with a duplicated name. I'm not just aiming to clean up the disk's space usage; I'm also aiming to clean things up so that navigating the filesystem is easier.
I can write my own script to do that based on the symlinks' target (and I wrote something along those lines), but I still think it'd be nicer if git-annex supported this use case.
Perhaps:
could let me remove a file from git-annex if the contents are available through a different name. (Right now, "git annex drop" requires the name and contents match.)
-- Asheesh.
Ah, OK. I assumed the metadata would be attached to a key, not part of the key. This seems to make upgrades/extensions down the line harder than they need to be, but you are right that this way, merges are not, and never will be, an issue.
Though with the SHA1 backend, changing files can be tracked. This means that tracking changes in mtime or other is possible. It also means that there are potential merge issues. But I won't argue the point endlessly. I can accept design decisions :)
The prefix at work is from a university netblock so yes, it might be on a few hundred proxy lists etc.
I agree with Christian.
One should first make a better use of connections to remotes before exploring parallel possibilities. One should pipeline the requests and answers.
Of course this could be implemented using parallel&concurrency features of Haskell to do this.
I really do want just one filename per file, at least for some cases.
For my photos, there's no benefit to having a few filenames point to the same file. As I'm putting them all into the git-annex, that is a good time to remove the pure duplicates so that I don't e.g. see them twice when browsing the directory as a gallery. Also, I am uploading my photos to the web, and I want to avoid uploading the same photo (by content) twice.
I hope that makes things clearer!
For now I'm just doing this:
(Yeah, Flickr for my photos for now. I feel sad about betraying the principle of autonomo.us-ness.)
Sounds like a good idea.
What's your source for this assertion? I would expect an amortized average of
O(1)
per insertion, i.e.O(n)
for full population.None of which necessarily change the algorithmic complexity. However real benchmarks are far more useful here than complexity analysis, and the dangers of premature optimization should not be forgotten.
Sure, I was aware of that, but my point still stands. Even 500k keys per 1GB of RAM does not sound expensive to me.
Why not? What's the maximum it should use? 512MB? 256MB? 32MB? I don't see the sense in the author of a program dictating thresholds which are entirely dependent on the context in which the program is run, not the context in which it's written. That's why systems have files such as
/etc/security/limits.conf
.You said you want git-annex to scale to enormous repositories. If you impose an arbitrary memory restriction such as the above, that means avoiding implementing any kind of functionality which requires
O(n)
memory or worse. Isn't it reasonable to assume that many users use git-annex on repositories which are not enormous? Even when they do work with enormous repositories, just like with any other program, they would naturally expect certain operations to take longer or become impractical without sufficient RAM. That's why I say that this restriction amounts to throwing out the baby with the bathwater. It just means that those who need the functionality would have to reimplement it themselves, assuming they are able, which is likely to result in more wheel reinventions. I've already shared my implementation but how many people are likely to find it, let alone get it working?Interesting. Presumably you are referring to some undocumented behaviour, rather than
--batch-size
which only applies when merging multiple files, and not when only sorting STDIN.It's the best choice for sorting. But sorting purely to detect duplicates is a dismally bad choice.
I'd expect the checksumming to be disk bound, not CPU bound, on most systems.
I suggest you start off on the WORM backend, and then you can run a job later to migrate to the SHA1 backend.
Have you checked what the smudge filter sees when the input is a symlink? Because git supports tracking symlinks, so it should also support pushing symlinks through a smudge filter, right? Either way: yes, contact the git devs, one can only ask and hope. And if you can demonstrate the awesomeness of git-annex they might get more 1interested :)
Hashing & segmenting seems to be around the corner, which is nice :)
Is there a chance that you will optionally add mtime to your native metadata store? If yes, I'd rather wait for v2 to start with the native system from the start. If not, I will probably set it up tonight.
PS: While posting from work, my comments are held for moderation once again. I am somewhat confused as to why this happens when I can just submit directly from home. And yes, I am using the same auth provider and user in both cases.
https://github.com/aspiers/git-config/blob/master/bin/git-annex-finddups
but it would be better in git-annex itself ...
Only if you want to search the whole repository for duplicates, and if you do, then you're necessarily going to have to chew up memory in some process anyway, so what difference whether it's git-annex or (say) a Perl wrapper?
That's a worthy goal, but if everything could be implemented with an O(1) memory footprint then we'd be in much more pleasant world :-) Even O(n) isn't that bad ...
That aside, I like your
--format="%f %k\n"
idea a lot. That opens up the "black box" of.git/annex/objects
and makes nice things possible, as your pipeline already demonstrates. However, I'm not sure why you thinkgit annex find | sort | uniq
would be more efficient. Not only does the sort require the very thing you were trying to avoid (i.e. the whole list in memory), but it's also O(n log n) which is significantly slower than my O(n) Perl script linked above.More considerations about this pipeline:
--include '*'
? Doesn'tgit annex find
with no arguments already include all files, modulo the requirement above that they're locally available?git annex find | ...
approach is likely to run up against its limitations sooner rather than later, because they're already used to the plethora of optionsfind(1)
provides. Rather than reinventing the wheel, is there some waygit annex find
could harness the power offind(1)
?Those considerations aside, a combined approach would be to implement
and then alter my Perl wrapper to
popen(2)
from that rather than usingFile::Find
. But I doubt you would want to ship Perl wrappers in the distribution, so if you don't provide a Haskell equivalent then users who can't code are left high and dry.Hm... O(N^2)? I think it just takes O(N). To read an entry out of a directory you have to download the entire directory (and store it in RAM and parse it). The constants are basically "too big to be good but not big enough to be prohibitive", I think. jctang has reported that his special remote hook performs well enough to use, but it would be nice if it were faster.
The Tahoe-LAFS folks are working on speeding up mutable files, by the way, after which we would be able to speed up directories.
Whoops! You'd only told me O(N) twice before..
So this is not too high priority. I think I would like to get the per-remote storage sorted out anyway, since probably it will be the thing needed to convert the URL backend into a special remote, which would then allow ripping out the otherwise unused pluggable backend infrastructure.
Update: Per-remote storage is now sorted out, so this could be implemented if it actually made sense to do so.
Adam, to answer a lot of points breifly..
Hey Asheesh, I'm happy you're finding git-annex useful.
So, there are two forms of duplication going on here. There's duplication of the content, and duplication of the filenames pointing at that content.
Duplication of the filenames is probably not a concern, although it's what I thought you were talking about at first. It's probably info worth recording that backup-2010/some_dir/foo and backup-2009/other_dir/foo are two names you've used for the same content in the past. If you really wanted to remove backup-2009/foo, you could do it by writing a script that looks at the basenames of the symlink targets and removes files that point to the same content as other files.
Using SHA1 ensures that the same key is used for identical files, so generally avoids duplication of content. But if you have 2 disks with an identical file on each, and make them both into annexes, then git-annex will happily retain both copies of the content, one per disk. It generally considers keeping copies of content a good thing. :)
So, what if you want to remove the unnecessary copies? Well, there's a really simple way:
This asks git-annex to add everything to the annex, but then remove any file contents that it can safely remove. What can it safely remove? Well, anything that it can verify is on another repository such as "other-disk"! So, this will happily drop any duplicated file contents, while leaving all the rest alone.
In practice, you might not want to have all your old backup disks mounted at the same time and configured as remotes. Look into configuring trust to avoid needing do to that. If usb-0 is already a trusted disk, all you need is a simple "git annex drop" on usb-1.
Unless you are forced to use a password, you should really be using a ssh key.