Navigation Menu

Skip to content

Instantly share code, notes, and snippets.

@miku

miku/gi.md Secret

Created December 16, 2014 10:22
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save miku/430b9ea1b4f18f7dc88a to your computer and use it in GitHub Desktop.
Save miku/430b9ea1b4f18f7dc88a to your computer and use it in GitHub Desktop.
Git Internals 2014-12-16

cloning isn't just for sheeps and galactic empires

2014-12-16

clone


agenda

  1. Intro, Tools

  2. gif from the internets

  3. The repository

  4. End


About 50 slides. About 50 commands.


Disclaimer: I am just a git user.


try to forget

version control, svn or cvs for a moment.

calm

git ecosystem

  • Git is ubiquitous. Github, bitbucket and other many hosts use it. Many projects use it (Linux, Ruby, Go, Erlang, Rails, Homebrew, ...)

  • Heroku and other PAAS use git push based workflows.

  • Could you imagine something like git, but for writers? The current Pro Git book is written on Github, together with a community with git.

  • Tools like docker borrow their terminology, when they say docker push or pull.

  • Git is used by teams of thousands of people.

  • It takes a milliseconds to create a single repository to take a note.

  • Facebook has had 54G repository at one point (https://twitter.com/feross/status/459259593630433280).


no version control

Git did not start as a full version control system:

Git is a content-addressable filesystem. Great. What does that mean?

It means that

at the core of Git is a simple key-value data store. You can insert any kind of content into it, and it will give you back a key that you can use to retrieve the content again at any time.


tooling

  • tooling is better now

  • IDE-Support, Netbeans 7.4+, vim, ...

Setup

It makes life easier.


SVN vs git repositories

How big is an empty SVN repository? About 100k for the server part plus 124k for workdir.

An initial git repository is 56k. Without the hooks (40k) you have:

  • 118 bytes of config (can be shortened),
  • 23 bytes of HEAD
  • 73 bytes of description (can be left blank, only use by GitWeb)
  • 240 bytes of exclude example (can be left black)

So essentially, you have a few bytes for empty directories and about 100 bytes in configuration.


why no svn internals

$ sqlite3 wd/.svn/wc.db
SQLite version 3.7.12 2012-04-03 19:43:07
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite> .schema
CREATE TABLE ACTUAL_NODE (   wc_id  INTEGER NOT NULL REFERENCES WCROOT (id), ....
CREATE TABLE EXTERNALS (   wc_id  INTEGER NOT NULL REFERENCES WCROOT (id), ...
CREATE TABLE LOCK (   repos_id  INTEGER NOT NULL REFERENCES REPOSITORY (id), ...
CREATE TABLE NODES (   wc_id  INTEGER NOT NULL REFERENCES WCROOT (id),  ...
CREATE TABLE PRISTINE (   checksum  TEXT NOT NULL PRIMARY KEY,  ...
CREATE TABLE REPOSITORY (   id INTEGER PRIMARY KEY AUTOINCREMENT, ...
CREATE TABLE WCROOT (   id  INTEGER PRIMARY KEY AUTOINCREMENT, ...
CREATE TABLE WC_LOCK (   wc_id  INTEGER NOT NULL  REFERENCES WCROOT ...
CREATE TABLE WORK_QUEUE (   id  INTEGER PRIMARY KEY AUTOINCREMENT, ...
CREATE VIEW NODES_BASE AS   SELECT * FROM nodes   WHERE op_depth = 0;
CREATE VIEW NODES_CURRENT AS   SELECT * FROM nodes AS n   ...
CREATE INDEX I_ACTUAL_CHANGELIST ON ACTUAL_NODE (changelist);
CREATE INDEX I_ACTUAL_PARENT ON ACTUAL_NODE (wc_id, parent_relpath);
CREATE UNIQUE INDEX I_EXTERNALS_DEFINED ON EXTERNALS ...
CREATE INDEX I_EXTERNALS_PARENT ON EXTERNALS (wc_id, parent_relpath);
CREATE UNIQUE INDEX I_LOCAL_ABSPATH ON WCROOT (local_abspath);
CREATE INDEX I_NODES_PARENT ON NODES (wc_id, parent_relpath, op_depth);
...

git repository tree

$ tree .git/
.git/
├── HEAD
├── config
├── description
├── hooks
│   ├── applypatch-msg.sample
│   ├── commit-msg.sample
│   ├── post-update.sample
│   ├── pre-applypatch.sample
│   ├── pre-commit.sample
│   ├── pre-push.sample
│   ├── pre-rebase.sample
│   ├── prepare-commit-msg.sample
│   └── update.sample
├── info
│   └── exclude
├── objects
│   ├── info/
│   └── pack/
└── refs
    ├── heads/
    └── tags/

the HEAD

$ cat .git/HEAD
ref: refs/heads/master
  • a symbolic references to the branch you are on

  • default branch is master

  • at init-time the refs/heads/master file does not exist


git config

$ cat .git/config
[core]
    repositoryformatversion = 0
    filemode = true
    bare = false
    logallrefupdates = true
  • repositoryformatversion -- is for forward compatibility.
  • filemode = true -- do not ignore executable bits
  • bare = false -- has working directory
  • logallrefupdates -- Enable the reflog

global private excludes

$ cat .git/info/exclude
# git ls-files --others --exclude-from=.git/info/exclude
# Lines that start with '#' are comments.
# For a project mostly in C, the following would be a good set of
# exclude patterns (uncomment them if you want to use them):
# *.[oa]
# *~

Like .gitignore but not shared.


git repo size

No server/client distinction. This is a full repo:

$ du -h .git/
40K     .git/hooks
4.0K    .git/info
0       .git/objects/info
0       .git/objects/pack
0       .git/objects
0       .git/refs/heads
0       .git/refs/tags
0       .git/refs
56K     .git/ 

git bullet time

To understand what is going on, we will create a repository, create some files and commits.

bullet time

git init

How many Git repositories could you initialize in a second? About 100!

$ time git init
Initialized empty Git repository in ~/tmp/.git/

real    0m0.010s
user    0m0.002s
sys     0m0.007s

create a file

$ echo "HELLOWORLD" > README.md

Has anything changed in the repo?

No, there is just an untracked file.

[git:master?] $ git st
On branch master

Initial commit

Untracked files:
  (use "git add <file>..." to include in what will be committed)

  README.md

nothing added to commit but untracked files present (use "git add" to track)

add the file

Now add it.

[git:master?] $ git add README.md

[git:master] $ tree .git/
.git/
├── HEAD
├── config
├── description
├── hooks
│   └── ...
├── index
├── info
│   └── exclude
├── objects
│   ├── c6
│   │   └── 3053a6310c57fae01ecfde5cdf62d6c31111ea
│   ├── info
│   └── pack
└── refs
    ├── heads
    └── tags

a first object

.git/
.
├── objects
│   ├── c6
│   │   └── 3053a6310c57fae01ecfde5cdf62d6c31111ea
.   .

Hooray, we have a first object, its full key is

  • c63053a6310c57fae01ecfde5cdf62d6c31111ea

what is it?

[git:master] $ git show c63053a6310c57fae01ecfde5cdf62d6c31111ea
HELLOWORLD

It's the content we added! Git calls this a blob.

But, where does this SHA1 comes from?


blob

[git:master] $ git hash-object README.md
c63053a6310c57fae01ecfde5cdf62d6c31111ea

Anyone in this solar system, running

[git:master] $ echo "HELLOWORLD" | git hash-object --stdin
c63053a6310c57fae01ecfde5cdf62d6c31111ea

will get the same key for this exact content.

It's just the sha of some header and the content:

$ python
>>> import hashlib
>>> hashlib.sha1("blob 11\x00HELLOWORLD\n").hexdigest()
'c63053a6310c57fae01ecfde5cdf62d6c31111ea'

plumbing and porcelain

What is this

  • git hash-object?

It's from git's plumbing.

This is what Pro Git says:

... but because Git was initially a toolkit for a VCS rather than a full user-friendly VCS, it has a bunch of verbs that do low-level work and were designed to be chained together UNIX style or called from scripts. These commands are generally referred to as plumbing commands, and the more user- friendly commands are called porcelain commands.


object types

What is a blob now?

Git knows about four different kinds of object only and blob is the simplest one. It's basically file contents.

Let's see:

[git:master] $ git cat-file -t c63053a6310c57fae01ecfde5cdf62d6c31111ea
blob

It's small. Just 11 bytes.

[git:master] $ git cat-file -s c63053a6310c57fae01ecfde5cdf62d6c31111ea
11

object types

The help page for git cat-file will gives away the other object types:

[git:master] $ git cat-file -h
usage: git cat-file (-t|-s|...) <object>
...

<type> can be one of: blob, tree, commit, tag
    -t                    show object type
    -s                    show object size

Ok, there are

  • blobs,
  • trees,
  • commits and
  • tags.

Nothing more. Let's get back to our repo, and create more objects.


git commit

[git:master] $ git ci -m "Add README"
[master (root-commit) ede7e84] Add README
 1 file changed, 1 insertion(+)
 create mode 100644 README.md

Somewhere in some man page or git books it is recommended that you use the present tense for commit messages.

First line of commit message should not exceed 72/80 chars.

Then a blank line.

Then some detailed explanation, if necessary. People will thank you for following standards.


state of the repo

Let's look at out repo now. Now that you know that there are only blob, tree, commit and tag objects, and that we had a single blob object already, when we added the file, can you guess the number of object we have now? Of which type.

  • certainly there will be a - single - commit object
  • something else?
[git:master] $ tree .git

.git/
├── COMMIT_EDITMSG
├── HEAD
├── config
├── description
├── hooks
│   └── ...
├── index
├── info
│   └── exclude
├── logs
│   ├── HEAD
│   └── refs
│       └── heads
│           └── master
├── objects
│   ├── c0
│   │   └── a176e993d123ff67e7d065fee438e7e3ef7a92
│   ├── c6
│   │   └── 3053a6310c57fae01ecfde5cdf62d6c31111ea
│   ├── ed
│   │   └── e7e847cf64729e721dad8aa778a12b18306806
│   ├── info
│   └── pack
└── refs
    ├── heads
    │   └── master
    └── tags

c63053a6310c57fae01ecfde5cdf62d6c31111ea is the blob.

Let's look at c0a176e993d123ff67e7d065fee438e7e3ef7a92 and ede7e847cf64729e721dad8aa778a12b18306806.

We saw ede7e84 in the commit output before. We could guess, that this is the commit object. Let's see:

[git:master] $ git cat-file -t ede7e84
commit

Sure enough.

We have a single file, a commit. We haven't created any tags, so c0a176e993d123ff67e7d065fee438e7e3ef7a92 will probably be a tree:

[git:master] $ git cat-file -t c0a176e993d123ff67e7d065fee438e7e3ef7a92
tree

Easy?


Let's pretty print all of them. Let's start with the blob (again):

[git:master] $ git cat-file -p c63053a6310c57fae01ecfde5cdf62d6c31111ea
HELLOWORLD

Then the tree:

[git:master] $ git cat-file -p c0a176e993d123ff67e7d065fee438e7e3ef7a92
100644 blob c63053a6310c57fae01ecfde5cdf62d6c31111ea    README.md

Perms, type, sha, name.


And the commit:

[git:master] $ git cat-file -p ede7e84
tree c0a176e993d123ff67e7d065fee438e7e3ef7a92
author Martin Czygan <martin.czygan@gmail.com> 1418248242 +0100
committer Martin Czygan <martin.czygan@gmail.com> 1418248242 +0100

Add README

Tree, author, committer, message, dates.



something else in the tree?

[git:master] $ tree .git/
.git/
.
.
├── objects
│   └── ...
└── refs
    ├── heads
    │   └── master
    └── tags

refs/heads/master

We now have a refs/heads/master file.

What's in there?

[git:master] $ cat .git/refs/heads/master
ede7e847cf64729e721dad8aa778a12b18306806

This is the SHA1 of our first commit.

[git:master] $ git cat-file -p $(cat .git/refs/heads/master)
tree c0a176e993d123ff67e7d065fee438e7e3ef7a92
author Martin Czygan <martin.czygan@gmail.com> 1418248242 +0100
committer Martin Czygan <martin.czygan@gmail.com> 1418248242 +0100

Add README

We would get the same output for each of the following:

[git:master] $ git cat-file -p refs/heads/master
[git:master] $ git cat-file -p master

git cat-file for humans

[git:master] $ git show refs/heads/master
commit ede7e847cf64729e721dad8aa778a12b18306806
Author: Martin Czygan <martin.czygan@gmail.com>
Date:   Wed Dec 10 22:50:42 2014 +0100

    Add README

diff --git a/README.md b/README.md
new file mode 100644
index 0000000..c63053a
--- /dev/null
+++ b/README.md
@@ -0,0 +1 @@
+HELLOWORLD

show will format the date, adds colors, a diff.

Or just:

[git:master] $ git show master
...

where is my HEAD

The HEAD points to the current branch you are on. The branch you are planning to do your next commit on.

[git:master] $ cat .git/HEAD
ref: refs/heads/master

So, one last time...


.git/
├── COMMIT_EDITMSG
├── HEAD
├── config
├── description
├── hooks
│   └── ...
├── index
├── info
│   └── exclude
├── logs
│   ├── HEAD
│   └── refs
│       └── heads
│           └── master
├── objects
│   ├── c0
│   │   └── a176e993d123ff67e7d065fee438e7e3ef7a92
│   ├── c6
│   │   └── 3053a6310c57fae01ecfde5cdf62d6c31111ea
│   ├── ed
│   │   └── e7e847cf64729e721dad8aa778a12b18306806
│   ├── info
│   └── pack
└── refs
    ├── heads
    │   └── master
    └── tags

repository summary

  • config, boring stuff

  • description, boring stuff

  • info, boring stuff

  • objects, the key-value store

  • refs, branches

We still need to sort out two things, we haven't talked about, that is

  • logs and

  • index

We won't talk about the logs today (but you can inspect them on you own). It's some internal bookkeeping.


index

  • sits between the working dir and the object database
  • it's the staging area
  • everything you want to commit is gathered there

Why, wouldn't I just use a single commit everything I touched and add a commit message like this:

* updated docs
* fixed that little bug
* refactored my git talk

A commit - in the best case - represents a single conceptual change to a project. Not a single file, necessary. This gives a clearer history and makes many other things easier (another talk).

Nothing you change gets committed automatically. In fact you have to add your changes each time you want to commit something.


index

Why is the index useful?

You can edit what you want in your files, but still keep your commits and your projects history clean.

Can you add partial changes to a file in svn? I guess you can, somehow (http://stackoverflow.com/q/75809/89391)

Git pro tip:

$ git add --patch

update README

[git:master] $ echo "GIT'S STRANGE" >> README.md
[git:master+] $ git st
On branch master
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

    modified:   README.md

no changes added to commit (use "git add" and/or "git commit -a")   

There it is:

Changes not staged for commit

Why not? Because of the index. I am under no pressure to touch only those files, that I want to commit. I could even change into a new branch of development at this point, too.

Let's actually do this.


git checkout -b

[git:master+] $ git checkout -b getting-subjective
M   README.md
Switched to a new branch 'getting-subjective'

Again, let's see how the repo looks like...


[git:getting-subjective+] $ tree .git/
.git/
├── COMMIT_EDITMSG
├── HEAD
├── config
├── description
├── hooks
│   └── ...
├── index
├── info
│   └── exclude
├── logs
│   └── ...
├── objects
│   ├── c0
│   │   └── a176e993d123ff67e7d065fee438e7e3ef7a92
│   ├── c6
│   │   └── 3053a6310c57fae01ecfde5cdf62d6c31111ea
│   ├── ed
│   │   └── e7e847cf64729e721dad8aa778a12b18306806
│   ├── info
│   └── pack
└── refs
    ├── heads
    │   ├── getting-subjective
    │   └── master
    └── tags

no new object

There is no new object, since we haven't added README.md yet. So no new object so far.

But there is a new ref, namely refs/heads/getting-subjective.

What is in there?

[git:getting-subjective+] $ cat .git/refs/heads/getting-subjective
ede7e847cf64729e721dad8aa778a12b18306806

This is the first commits SHA1.

What do you guess is in HEAD. Rememeber, HEAD points to the name of the branch, which we are currently on:

[git:getting-subjective+] $ cat .git/HEAD
ref: refs/heads/getting-subjective

Sure enough.


imagine

Ok. Imagine for a moment, what will happen. We are on a new branch, because we want to keep some change seperate. For now. The branch we are on is named getting-subjective.

The next commit would do the following:

It would record the change to README.md. It would record a commit and would move forward the branch pointer (in refs/heads/getting-subjective) to the new commit.

We would have three new objects. A new blob object. A new blob object implies a new tree, because the tree contains the SHA1 of the blob. And a new commit object.

Is this true?


git add again

We now add the (all the) changes:

[git:getting-subjective+] $ git add README.md

And the tree?


[git:getting-subjective] $ tree .git/
.git/
├── COMMIT_EDITMSG
├── HEAD
.
.
├── objects
│   ├── 48
│   │   └── 21aa9297cfcd420f8efee1d8cef542ff3ca0fd
│   ├── c0
│   │   └── a176e993d123ff67e7d065fee438e7e3ef7a92
│   ├── c6
│   │   └── 3053a6310c57fae01ecfde5cdf62d6c31111ea
│   ├── ed
│   │   └── e7e847cf64729e721dad8aa778a12b18306806
│   ├── info
│   └── pack
└── refs
    ├── heads
    │   ├── getting-subjective
    │   └── master
    └── tags

welcome, 4821aa9

Ah, 4821aa9297cfcd420f8efee1d8cef542ff3ca0fd is new.

This is no suprise for us anymore:

[git:getting-subjective] $ git cat-file -t 4821aa9297cfcd420f8efee1d8cef542ff3ca0fd
blob

[git:getting-subjective] $ git cat-file -p 4821aa9297cfcd420f8efee1d8cef542ff3ca0fd
HELLOWORLD
GIT'S STRANGE

Now, let's commit that...


git commit again

[git:getting-subjective] $ git ci -m "Well, that's just your opinion, man."
[getting-subjective d0cdeeb] Well, that's just your opinion, man.
1 file changed, 1 insertion(+)

How many objects do we have now? Correct. 6.


[git:getting-subjective] $ tree .git
.git/
.
├── objects
│   ├── 2c
│   │   └── 19f5ff0194af40c6f38e2f81c4c1e6b9b1832e
│   ├── 48
│   │   └── 21aa9297cfcd420f8efee1d8cef542ff3ca0fd
│   ├── c0
│   │   └── a176e993d123ff67e7d065fee438e7e3ef7a92
│   ├── c6
│   │   └── 3053a6310c57fae01ecfde5cdf62d6c31111ea
│   ├── d0
│   │   └── cdeeb0752c14a6ff486ce478b8dc02adc33c68
│   ├── ed
│   │   └── e7e847cf64729e721dad8aa778a12b18306806
│   ├── info
│   └── pack
.

where is my HEAD?

[git:getting-subjective] $ cat .git/HEAD
ref: refs/heads/getting-subjective

[git:getting-subjective] $ cat .git/refs/heads/getting-subjective
d0cdeeb0752c14a6ff486ce478b8dc02adc33c68

Or shorter:

[git:getting-subjective] $ git rev-parse HEAD
d0cdeeb0752c14a6ff486ce478b8dc02adc33c68

what happened?

[git:getting-subjective] $ git log
commit d0cdeeb0752c14a6ff486ce478b8dc02adc33c68
Author: Martin Czygan <martin.czygan@gmail.com>
Date:   Wed Dec 10 23:39:53 2014 +0100

    Well, that's just your opinion, man.

commit ede7e847cf64729e721dad8aa778a12b18306806
Author: Martin Czygan <martin.czygan@gmail.com>
Date:   Wed Dec 10 22:50:42 2014 +0100

    Add README

You can see all code changes (patches) with, git log -p.


more questions

What happened in getting-subjective that did not happened in master?

[git:getting-subjective] $ git log master..getting-subjective
commit d0cdeeb0752c14a6ff486ce478b8dc02adc33c68
Author: Martin Czygan <martin.czygan@gmail.com>
Date:   Wed Dec 10 23:39:53 2014 +0100

    Well, that's just your opinion, man.

And what happened in master that did not happened in getting-subjective?

[git:getting-subjective] $ git log getting-subjective..master

Yes, nothing is correct.


back to master

Ok, go back to our master.

[git:getting-subjective] $ git co master

what happened on master?

[git:master] $ git log
commit ede7e847cf64729e721dad8aa778a12b18306806
Author: Martin Czygan <martin.czygan@gmail.com>
Date:   Wed Dec 10 22:50:42 2014 +0100

    Add README

[git:master] $ cat README.md
HELLOWORLD

HEAD is back at master:

[git:master] $ cat .git/HEAD
ref: refs/heads/master

And master points to the first commit, still.

[git:master] $ cat .git/refs/heads/master
ede7e847cf64729e721dad8aa778a12b18306806

git merge

[git:master] $ git merge getting-subjective
Updating ede7e84..d0cdeeb
Fast-forward
 README.md | 1 +
 1 file changed, 1 insertion(+)

This is a so-called fast-forward.


master

[git:master] $ git log
commit d0cdeeb0752c14a6ff486ce478b8dc02adc33c68
Author: Martin Czygan <martin.czygan@gmail.com>
Date:   Wed Dec 10 23:39:53 2014 +0100

    Well, that's just your opinion, man.

commit ede7e847cf64729e721dad8aa778a12b18306806
Author: Martin Czygan <martin.czygan@gmail.com>
Date:   Wed Dec 10 22:50:42 2014 +0100

    Add README

Are there any new objects? Look at the output, what do you think?


No. Because we fast-forwarded.

[git:master] $ tree .git
.git/
.
├── objects
│   ├── 2c
│   │   └── 19f5ff0194af40c6f38e2f81c4c1e6b9b1832e
│   ├── 48
│   │   └── 21aa9297cfcd420f8efee1d8cef542ff3ca0fd
│   ├── c0
│   │   └── a176e993d123ff67e7d065fee438e7e3ef7a92
│   ├── c6
│   │   └── 3053a6310c57fae01ecfde5cdf62d6c31111ea
│   ├── d0
│   │   └── cdeeb0752c14a6ff486ce478b8dc02adc33c68
│   ├── ed
│   │   └── e7e847cf64729e721dad8aa778a12b18306806
│   ├── info
│   └── pack
.

Still only 6 objects. But refs/heads/master changed.

[git:master] $ cat .git/refs/heads/master
d0cdeeb0752c14a6ff486ce478b8dc02adc33c68

wrapping it up

Git has simple internals.

There are only four objects. For a simple workflow, you don't even need tags.

Many things will fall out of this.


wrapping it up

Many things will fall out of this.


What are these objects?

$ git clone git@github.com:torvalds/linux
Cloning into 'linux'...
remote: Counting objects: 3927483, done.
...

If you git fetch, git fetches the objects and the references.

Once you fetched a remote, you can run things like:

$ git log master..origin/master

wrapping it up

More things will fall out of this.


Things like reset and rebase are just tools to manipulate the DAG.

$ git reset --hard origin/master

Not necessery trivial, but not mysterious either.

$ git rebase -i a64f821

there is more

But there are still many things to cover.

  • What is git reset really?

  • Github hype? Collaboration? Tarballs and patches?

  • Fixing history (reset, cherry-pick)

  • Merge strategies

  • Remotes


thanks

logo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment