A walkthrough of the basic features of git-annex.
- creating a repository
- adding a remote
- adding files
- renaming files
- getting file content
- syncing
- transferring files: When things go wrong
- removing files
- removing files: When things go wrong
- modifying annexed files
- using ssh remotes
- moving file content between repositories
- unused data
- fsck: verifying your data
- fsck: when things go wrong
- backups
- automatically managing content
- more
creating a repository
This is very straightforward. Just tell it a description of the repository.
# mkdir ~/annex
# cd ~/annex
# git init
# git annex init "my laptop"
adding a remote
Like any other git repository, git-annex repositories have remotes. Let's start by adding a USB drive as a remote.
# sudo mount /media/usb
# cd /media/usb
# git clone ~/annex
# cd annex
# git annex init "portable USB drive"
# git remote add laptop ~/annex
# cd ~/annex
# git remote add usbdrive /media/usb/annex
This is all standard ad-hoc distributed git repository setup. The only git-annex specific part is telling it the name of the new repository created on the USB drive.
Notice that both repos are set up as remotes of one another. This lets either get annexed files from the other. You'll want to do that even if you are using git in a more centralized fashion.
adding files
# cd ~/annex
# cp /tmp/big_file .
# cp /tmp/debian.iso .
# git annex add .
add big_file (checksum...) ok
add debian.iso (checksum...) ok
# git commit -a -m added
When you add a file to the annex and commit it, only a symlink to the annexed content is committed. The content itself is stored in git-annex's backend.
renaming files
# cd ~/annex
# git mv big_file my_cool_big_file
# mkdir iso
# git mv debian.iso iso/
# git commit -m moved
You can use any normal git operations to move files around, or even make copies or delete them.
Notice that, since annexed files are represented by symlinks, the symlink will break when the file is moved into a subdirectory. But, git-annex will fix this up for you when you commit -- it has a pre-commit hook that watches for and corrects broken symlinks.
getting file content
A repository does not always have all annexed file contents available. When you need the content of a file, you can use "git annex get" to make it available.
We can use this to copy everything in the laptop's annex to the USB drive.
# cd /media/usb/annex
# git fetch laptop; git merge laptop/master
# git annex get .
get my_cool_big_file (from laptop...) ok
get iso/debian.iso (from laptop...) ok
syncing
Notice that in the previous example, you had to git fetch and merge from laptop first. This lets git-annex know what has changed in laptop, and so it knows about the files present there and can get them.
If you have a lot of repositories to keep in sync, manually fetching and merging from them can become tedious. To automate it there is a handy sync command, which also even commits your changes for you.
# cd /media/usb/annex
# git annex sync
commit
nothing to commit (working directory clean)
ok
pull laptop
ok
push laptop
ok
After you run sync, the repository will be updated with all changes made to its remotes, and any changes in the repository will be pushed out to its remotes, where a sync will get them. This is especially useful when using git in a distributed fashion, without a central bare repository. See sync for details.
transferring files: When things go wrong
After a while, you'll have several annexes, with different file contents. You don't have to try to keep all that straight; git-annex does location tracking for you. If you ask it to get a file and the drive or file server is not accessible, it will let you know what it needs to get it:
# git annex get video/hackity_hack_and_kaxxt.mov
get video/_why_hackity_hack_and_kaxxt.mov (not available)
Unable to access these remotes: usbdrive, server
Try making some of these repositories available:
5863d8c0-d9a9-11df-adb2-af51e6559a49 -- my home file server
58d84e8a-d9ae-11df-a1aa-ab9aa8c00826 -- portable USB drive
ca20064c-dbb5-11df-b2fe-002170d25c55 -- backup SATA drive
failed
# sudo mount /media/usb
# git annex get video/hackity_hack_and_kaxxt.mov
get video/hackity_hack_and_kaxxt.mov (from usbdrive...) ok
removing files
You can always drop files safely. Git-annex checks that some other annex has the file before removing it.
# git annex drop iso/debian.iso
drop iso/Debian_5.0.iso ok
removing files: When things go wrong
Before dropping a file, git-annex wants to be able to look at other remotes, and verify that they still have a file. After all, it could have been dropped from them too. If the remotes are not mounted/available, you'll see something like this.
# git annex drop important_file other.iso
drop important_file (unsafe)
Could only verify the existence of 0 out of 1 necessary copies
Unable to access these remotes: usbdrive
Try making some of these repositories available:
58d84e8a-d9ae-11df-a1aa-ab9aa8c00826 -- portable USB drive
ca20064c-dbb5-11df-b2fe-002170d25c55 -- backup SATA drive
(Use --force to override this check, or adjust annex.numcopies.)
failed
drop other.iso (unsafe)
Could only verify the existence of 0 out of 1 necessary copies
No other repository is known to contain the file.
(Use --force to override this check, or adjust annex.numcopies.)
failed
Here you might --force it to drop important_file
if you trust your backup.
But other.iso
looks to have never been copied to anywhere else, so if
it's something you want to hold onto, you'd need to transfer it to
some other repository before dropping it.
modifying annexed files
Normally, the content of files in the annex is prevented from being modified. That's a good thing, because it might be the only copy, you wouldn't want to lose it in a fumblefingered mistake.
# echo oops > my_cool_big_file
bash: my_cool_big_file: Permission denied
In order to modify a file, it should first be unlocked.
# git annex unlock my_cool_big_file
unlock my_cool_big_file (copying...) ok
That replaces the symlink that normally points at its content with a copy of the content. You can then modify the file like any regular file. Because it is a regular file.
(If you decide you don't need to modify the file after all, or want to discard
modifications, just use git annex lock
.)
When you git commit
, git-annex's pre-commit hook will automatically
notice that you are committing an unlocked file, and add its new content
to the annex. The file will be replaced with a symlink to the new content,
and this symlink is what gets committed to git in the end.
# echo "now smaller, but even cooler" > my_cool_big_file
# git commit my_cool_big_file -m "changed an annexed file"
add my_cool_big_file ok
[master 64cda67] changed an annexed file
1 files changed, 1 insertions(+), 1 deletions(-)
There is one problem with using git commit
like this: Git wants to first
stage the entire contents of the file in its index. That can be slow for
big files (sorta why git-annex exists in the first place). So, the
automatic handling on commit is a nice safety feature, since it prevents
the file content being accidentally committed into git. But when working with
big files, it's faster to explicitly add them to the annex yourself
before committing.
# echo "now smaller, but even cooler yet" > my_cool_big_file
# git annex add my_cool_big_file
add my_cool_big_file ok
# git commit my_cool_big_file -m "changed an annexed file"
using ssh remotes
So far in this walkthrough, git-annex has been used with a remote repository on a USB drive. But it can also be used with a git remote that is truely remote, a host accessed by ssh.
Say you have a desktop on the same network as your laptop and want to clone the laptop's annex to it:
# git clone ssh://mylaptop/home/me/annex ~/annex
# cd ~/annex
# git annex init "my desktop"
Now you can get files and they will be transferred (using rsync
via ssh
):
# git annex get my_cool_big_file
get my_cool_big_file (getting UUID for origin...) (from origin...)
SHA256-s86050597--6ae2688bc533437766a48aa19f2c06be14d1bab9c70b468af445d4f07b65f41e 100% 2159 2.1KB/s 00:00
ok
When you drop files, git-annex will ssh over to the remote and make sure the file's content is still there before removing it locally:
# git annex drop my_cool_big_file
drop my_cool_big_file (checking origin..) ok
Note that normally git-annex prefers to use non-ssh remotes, like
a USB drive, before ssh remotes. They are assumed to be faster/cheaper to
access, if available. There is a annex-cost setting you can configure in
.git/config
to adjust which repositories it prefers. See
the man page for details.
Also, note that you need full shell access for this to work -- git-annex needs to be able to ssh in and run commands. Or at least, your shell needs to be able to run the git-annex-shell command.
moving file content between repositories
Often you will want to move some file contents from a repository to some
other one. For example, your laptop's disk is getting full; time to move
some files to an external disk before moving another file from a file
server to your laptop. Doing that by hand (by using git annex get
and
git annex drop
) is possible, but a bit of a pain. git annex move
makes it very easy.
# git annex move my_cool_big_file --to usbdrive
move my_cool_big_file (to usbdrive...) ok
# git annex move video/hackity_hack_and_kaxxt.mov --from fileserver
move video/hackity_hack_and_kaxxt.mov (from fileserver...)
SHA256-s86050597--6ae2688bc533437766a48aa19f2c06be14d1bab9c70b468af445d4f07b65f41e 100% 82MB 199.1KB/s 07:02
ok
unused data
It's possible for data to accumulate in the annex that no files in any
branch point to anymore. One way it can happen is if you git rm
a file
without first calling git annex drop
. And, when you modify an annexed
file, the old content of the file remains in the annex. Another way is when
migrating between key-value backends.
This might be historical data you want to preserve, so git-annex defaults to preserving it. So from time to time, you may want to check for such data and eliminate it to save space.
# git annex unused
unused . (checking for unused data...)
Some annexed data is no longer used by any files in the repository.
NUMBER KEY
1 SHA256-s86050597--6ae2688bc533437766a48aa19f2c06be14d1bab9c70b468af445d4f07b65f41e
2 SHA1-s14--f1358ec1873d57350e3dc62054dc232bc93c2bd1
(To see where data was previously used, try: git log --stat -S'KEY')
(To remove unwanted data: git-annex dropunused NUMBER)
ok
After running git annex unused
, you can follow the instructions to examine
the history of files that used the data, and if you decide you don't need that
data anymore, you can easily remove it:
# git annex dropunused 1
dropunused 1 ok
Hint: To drop a lot of unused data, use a command like this:
# git annex dropunused 1-1000
fsck: verifying your data
You can use the fsck subcommand to check for problems in your data. What can be checked depends on the key-value backend you've used for the data. For example, when you use the SHA1 backend, fsck will verify that the checksums of your files are good. Fsck also checks that the annex.numcopies setting is satisfied for all files.
# git annex fsck
fsck some_file (checksum...) ok
fsck my_cool_big_file (checksum...) ok
...
You can also specify the files to check. This is particularly useful if you're using sha1 and don't want to spend a long time checksumming everything.
# git annex fsck my_cool_big_file
fsck my_cool_big_file (checksum...) ok
fsck: when things go wrong
Fsck never deletes possibly bad data; instead it will be moved to
.git/annex/bad/
for you to recover. Here is a sample of what fsck
might say about a badly messed up annex:
# git annex fsck
fsck my_cool_big_file (checksum...)
git-annex: Bad file content; moved to .git/annex/bad/SHA1:7da006579dd64330eb2456001fd01948430572f2
git-annex: ** No known copies exist of my_cool_big_file
failed
fsck important_file
git-annex: Only 1 of 2 copies exist. Run git annex get somewhere else to back it up.
failed
git-annex: 2 failed
backups
git-annex can be configured to require more than one copy of a file exists, as a simple backup for your data. This is controlled by the "annex.numcopies" setting, which defaults to 1 copy. Let's change that to require 2 copies, and send a copy of every file to a USB drive.
# echo "* annex.numcopies=2" >> .gitattributes
# git annex copy . --to usbdrive
Now when we try to git annex drop
a file, it will verify that it
knows of 2 other repositories that have a copy before removing its
content from the current repository.
You can also vary the number of copies needed, depending on the file name. So, if you want 3 copies of all your flac files, but only 1 copy of oggs:
# echo "*.ogg annex.numcopies=1" >> .gitattributes
# echo "*.flac annex.numcopies=3" >> .gitattributes
Or, you might want to make a directory for important stuff, and configure it so anything put in there is backed up more thoroughly:
# mkdir important_stuff
# echo "* annex.numcopies=3" > important_stuff/.gitattributes
For more details about the numcopies setting, see copies.
automatically managing content
Once you have multiple repositories, and have perhaps configured numcopies, any given file can have many more copies than is needed, or perhaps fewer than you would like. How to manage this?
The whereis subcommand can be used to see how many copies of a file are known, but then you have to decide what to get or drop. In this example, there are perhaps not enough copies of the first file, and too many of the second file.
# cd /media/usbdrive
# git annex whereis
whereis my_cool_big_file (1 copy)
0c443de8-e644-11df-acbf-f7cd7ca6210d -- laptop
whereis other_file (3 copies)
0c443de8-e644-11df-acbf-f7cd7ca6210d -- laptop
62b39bbe-4149-11e0-af01-bb89245a1e61 -- here (usb drive)
7570b02e-15e9-11e0-adf0-9f3f94cb2eaa -- backup drive
What would be handy is some automated versions of get and drop, that only gets a file if there are not yet enough copies of it, or only drops a file if there are too many copies. Well, these exist, just use the --auto option.
# git annex get --auto --numcopies=2
get my_cool_big_file (from laptop...) ok
# git annex drop --auto --numcopies=2
drop other_file ok
With two quick commands, git-annex was able to decide for you how to work toward having two copies of your files.
# git annex whereis
whereis my_cool_big_file (2 copies)
0c443de8-e644-11df-acbf-f7cd7ca6210d -- laptop
62b39bbe-4149-11e0-af01-bb89245a1e61 -- here (usb drive)
whereis other_file (2 copies)
0c443de8-e644-11df-acbf-f7cd7ca6210d -- laptop
7570b02e-15e9-11e0-adf0-9f3f94cb2eaa -- backup drive
The --auto option can also be used with the copy command, again this lets git-annex decide whether to actually copy content.
more
So ends the walkthrough. By now you should be able to use git-annex.
Want more? See tips for lots more features and advice.
Hi,
I guess the problem is with git-annex-shell. I tried to do 'git annex get file --from name_ssh_repo', and I got the following:
bash: git-annex-shell: command not found; failed; exit code 127
The same thing happens if I try to do 'git annex whereis'
git-annex-shell is indeed installed. How can I make my shell recognize this command?
Thanks a lot!
git annex fsck
complained that I had only one copy per file even though I had created my clone, already. Once Igit pull
ed from the second repo, not getting any changes for obvious reasons,git annex fsck
was happy. So I am not sure how my addition was incorrect. -- RichiHWhen
git annex get
does nothing, it's because it doesn't know a place to get the file from.This can happen if the
git-annex
branch has not propigated from the place where the file was added. For example, if on the laptop you had rungit pull ssh master
, that would only pull the master branch, not the git-annex branch.An easy way to ensure the git-annex branch is kept in sync is to run
git annex sync
git remote add laptop ~/annex
? this remote already exists under the name origin.git remote add usbdrive /media/usb/annex
? because the actual repo would be in /media/usb/annex, not /media/usb?Hi,
I could successfully clone my ssh repo's annex to my laptop, following these instructions. I'm also able to sync the repositories (laptop and ssh) when I commit new files in the ssh repo.
However, every time I try to get files from the ssh repo (using 'git annex get some_file'), nothing happens. Do you know what can be happening?
Thanks!
I may be missing something obvious, but when I copy to a remote repository, the object files are created, but no softlinks are created. When I pull everything from the remote, it pulls only files the local repo knows about already.
Moving from B to A creates no symlinks in A but the object files are moved to A. Copying back from A to B restores the object files in B and keeps them in A.
Copying from A to an empty C does not create any object files nor symlinks. Copying from C to A creates no symlinks in A but the object files are copied to A.
-- RichiH
git-annex-shell needs to be installed in the
PATH
on any host that will hold annexed files.If you installed with cabal, it might be
.cabal/bin/
. Whereever it was installed to is apparently not on the PATH that is set when you ssh into that host.Good spotting on the last line, fixed.
The laptop remote is indeed redundant, but it leads to clearer views of what is going on later in the walkthrough ("git pull laptop master", "(copying from laptop...)"). And if the original clone is made from a central bare repo, this reinforces that you'll want to set up remotes for other repos on the computer.
Hi,
It was already installed in PATH. In fact, I can call it from the command line, and it is recognized (e.g. calling 'git-annex-shell' gives me 'git-annex-shell: bad parameters'). However, every time I do a 'git annex whereis' or 'git annex get file --from repo', it gives me the following error:
bash: git-annex-shell: command not found Command ssh ["-S","/Users/username/annex/.git/annex/ssh/username@example.edu","-o","ControlMaster=auto","-o","ControlPersist=yes","username@example.edu","git-annex-shell 'configlist' '/~/annex'"] failed; exit code 127
I tried to run this ssh command, but it gives me the same 'command not found' error. It seems that the problem is with the ssh repo? The ssh repo has a git-annex-shell working and installed in PATH.
git annex whereis
on the file and see where it says it is.Thanks for the quick replay!
I already did 'git annex sync', but it didn't work. The steps were: 'git clone ssh...', then 'cd annex', then 'git annex init "laptop"'
After that, I did a 'git annex sync', and tried to get the file, but nothing happens. That's why I found it weird. Any other thing that might have happened?
Thanks again!
git annex move
only moves content. All symlink management is handled by git, so you have to keep repositories in sync using git as you would any other repo. When yougit pull B
in A, it will get whatever symlinks were added to B.(It can be useful to use a central bare repo and avoid needing to git pull from one repo to another, then you can just always push commits to the central repo, and pull down all changes from other repos.)
Ah yes, I feel kinda stupid in hindsight.
As the central server is most likely a common use case, would you object if I added that to the walkthrough? If you have any best practices on how to automate a push with every copy to a bare remote? AFAIK, git does not store information about bare/non-bare remotes, but this could easily be put into .git/config by git annex.
-- RichiH