src-d/herculesPublic

NotificationsYou must be signed in to change notification settings
Fork289
Star2.8k

Gaining advanced insights from Git repository history.

License

View license

2.8k stars 289 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Use this GitHub action with your project

Add this Action to an existing workflow or create a new one

View on Marketplace

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 1,006 Commits
.github/workflows		.github/workflows
cmd/hercules		cmd/hercules
contrib/_plugin_example		contrib/_plugin_example
doc		doc
internal		internal
leaves		leaves
python		python
.appveyor.yml		.appveyor.yml
.gitignore		.gitignore
.travis.yml		.travis.yml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
DCO		DCO
Dockerfile		Dockerfile
LICENSE.md		LICENSE.md
MAINTAINERS		MAINTAINERS
Makefile		Makefile
PLUGINS.md		PLUGINS.md
README.md		README.md
action.yml		action.yml
core.go		core.go
doc.go		doc.go
fix_yaml_unicode.py		fix_yaml_unicode.py
go.mod		go.mod
go.sum		go.sum
version.go		version.go
version_test.go		version_test.go

Repository files navigation

Hercules

Fast, insightful and highly customizable Git history analysis.

Overview •How To Use •Installation •Contributions •License

Overview

Hercules is an amazingly fast and highly customizable Git repository analysis engine written in Go. Batteries are included.Powered bygo-git.

Notice (November 2020): the main author is back from the limbo and is gradually resuming the development. See theroadmap.

There are two command-line tools:hercules andlabours. The first is a programwritten in Go which takes a Git repository and executes a Directed Acyclic Graph (DAG) ofanalysis tasks over the full commit history.The second is a Python script which shows some predefined plots over the collected data. These two tools are normally used together througha pipe. It is possible to write custom analyses using the plugin system. It is also possibleto merge several analysis results together - relevant for organizations.The analyzed commit history includes branches, merges, etc.

Hercules has been successfully used for several internal projects atsource{d}.There are blog posts:1,2 andapresentation. Pleasecontributeby testing, fixing bugs, addingnew analyses, or coding swagger!

The DAG of burndown and couples analyses with UAST diff refining. Generated withhercules --burndown --burndown-people --couples --feature=uast --dry-run --dump-dag doc/dag.dot https://github.com/src-d/hercules

torvalds/linux line burndown (granularity 30, sampling 30, resampled by year). Generated withhercules --burndown --first-parent --pb https://github.com/torvalds/linux | labours -f pb -m burndown-project in 1h 40min.

Installation

Grabhercules binary from theReleases page.labours is installable fromPyPi:

pip3 install labours

pip3 is the Python package manager.

Numpy and Scipy can be installed on Windows usinghttp://www.lfd.uci.edu/~gohlke/pythonlibs/

Build from source

You are going to need Go (>= v1.11) andprotoc.

git clone https://github.com/src-d/hercules && cd herculesmakepip3 install -e ./python

GitHub Action

It is possible to run Hercules as aGitHub Action:Hercules on GitHub Marketplace.Please refer to thesample workflow which demonstrates how to setup.

Contributions

...are welcome! SeeCONTRIBUTING andcode of conduct.

License

Apache 2.0

Usage

The most useful and reliably up-to-date command line reference:

hercules --help

Some examples:

# Use "memory" go-git backend and display the burndown plot. "memory" is the fastest but the repository's git data must fit into RAM.hercules --burndown https://github.com/go-git/go-git | labours -m burndown-project --resample month# Use "file system" go-git backend and print some basic information about the repository.hercules /path/to/cloned/go-git# Use "file system" go-git backend, cache the cloned repository to /tmp/repo-cache, use Protocol Buffers and display the burndown plot without resampling.hercules --burndown --pb https://github.com/git/git /tmp/repo-cache | labours -m burndown-project -f pb --resample raw# Now something fun# Get the linear history from git rev-list, reverse it# Pipe to hercules, produce burndown snapshots for every 30 days grouped by 30 days# Save the raw data to cache.yaml, so that later is possible to labours -i cache.yaml# Pipe the raw data to labours, set text font size to 16pt, use Agg matplotlib backend and save the plot to output.pnggit rev-list HEAD | tac | hercules --commits - --burndown https://github.com/git/git | tee cache.yaml | labours -m burndown-project --font-size 16 --backend Agg --output git.png

labours -i /path/to/yaml allows to read the output fromhercules which was saved on disk.

Caching

It is possible to store the cloned repository on disk. The subsequent analysis can run on thecorresponding directory instead of cloning from scratch:

# First time - cachehercules https://github.com/git/git /tmp/repo-cache# Second time - use the cachehercules --some-analysis /tmp/repo-cache

GitHub Action

The action produces the artifact namedhercules_charts. Since it is currently impossible to pack several files in one artifact, all thecharts and Tensorflow Projector files are packed in the inner tar archive. In order to view the embeddings,go toprojector.tensorflow.org, click "Load" and choose the two TSVs. Then use UMAP or T-SNE.

Docker image

docker run --rm srcd/hercules hercules --burndown --pb https://github.com/git/git | docker run --rm -i -v $(pwd):/io srcd/hercules labours -f pb -m burndown-project -o /io/git_git.png

Built-in analyses

Project burndown

hercules --burndownlabours -m burndown-project

Line burndown statistics for the whole repository.Exactly the same whatgit-of-theseusdoes but much faster. Blaming is performed efficiently and incrementally using a custom RB tree trackingalgorithm, and only the last modification date is recorded while running the analysis.

All burndown analyses depend on the values ofgranularity andsampling.Granularity is the number of days each band in the stack consists of. Samplingis the frequency with which the burnout state is snapshotted. The smaller thevalue, the more smooth is the plot but the more work is done.

There is an option to resample the bands insidelabours, so that you candefine a very precise distribution and visualize it different ways. Besides,resampling aligns the bands across periodic boundaries, e.g. months or years.Unresampled bands are apparently not aligned and start from the project's birth date.

Files

hercules --burndown --burndown-fileslabours -m burndown-file

Burndown statistics for every file in the repository which is alive in the latest revision.

Note: it will generate separate graph for every file. You don't want to run it on repository with many files.

People

hercules --burndown --burndown-people [--people-dict=/path/to/identities]labours -m burndown-person

Burndown statistics for the repository's contributors. If--people-dict is not specified, the identities arediscovered by the following algorithm:

We start from the root commit towards the HEAD. Emails and names are converted to lower case.
If we process an unknown email and name, record them as a new developer.
If we process a known email but unknown name, match to the developer with the matching email,and add the unknown name to the list of that developer's names.
If we process an unknown email but known name, match to the developer with the matching name,and add the unknown email to the list of that developer's emails.

If--people-dict is specified, it should point to a text file with the custom identities. Theformat is: every line is a single developer, it contains all the matching emails and names separatedby|. The case is ignored.

Overwrites matrix

Wireshark top 20 devs - overwrites matrix

hercules --burndown --burndown-people [--people-dict=/path/to/identities]labours -m overwrites-matrix

Beside the burndown information,--burndown-people collects the added and deleted line statistics perdeveloper. Thus it can be visualized how many lines written by developer A are removed by developer B.This indicates collaboration between people and defines expertise teams.

The format is the matrix with N rows and (N+2) columns, where N is the number of developers.

First column is the number of lines the developer wrote.
Second column is how many lines were written by the developer and deleted by unidentified developers(if--people-dict is not specified, it is always 0).
The rest of the columns show how many lines were written by the developer and deleted by identifieddevelopers.

The sequence of developers is stored inpeople_sequence YAML node.

Code ownership

Ember.js top 20 devs - code ownership

hercules --burndown --burndown-people [--people-dict=/path/to/identities]labours -m ownership

--burndown-people also allows to draw the code share through time stacked area plot. That is,how many lines are alive at the sampled moments in time for each identified developer.

Couples

torvalds/linux files' coupling in Tensorflow Projector

hercules --couples [--people-dict=/path/to/identities]labours -m couples -o <name> [--couples-tmp-dir=/tmp]

Important: it requires Tensorflow to be installed, please followofficial instructions.

The files are coupled if they are changed in the same commit. The developers are coupled if theychange the same file.hercules records the number of couples throughout the whole commit historyand outputs the two corresponding co-occurrence matrices.labours then trainsSwivel embeddings - dense vectors which reflect theco-occurrence probability through the Euclidean distance. The training requires a workingTensorflow installation. The intermediate files are stored in thesystem temporary directory or--couples-tmp-dir if it is specified. The trained embeddings arewritten to the current working directory with the name depending on-o. The output format is TSVand matchesTensorflow Projector so that the files and peoplecan be visualized with t-SNE implemented in TF Projector.

Structural hotness

      46  jinja2/compiler.py:visit_Template [FunctionDef]      42  jinja2/compiler.py:visit_For [FunctionDef]      34  jinja2/compiler.py:visit_Output [FunctionDef]      29  jinja2/environment.py:compile [FunctionDef]      27  jinja2/compiler.py:visit_Include [FunctionDef]      22  jinja2/compiler.py:visit_Macro [FunctionDef]      22  jinja2/compiler.py:visit_FromImport [FunctionDef]      21  jinja2/compiler.py:visit_Filter [FunctionDef]      21  jinja2/runtime.py:__call__ [FunctionDef]      20  jinja2/compiler.py:visit_Block [FunctionDef]

Thanks to Babelfish, hercules is able to measure how many times each structural unit has been modified.By default, it looks at functions; refer toSemantic UAST XPathmanual to switch to something else.

hercules --shotness [--shotness-xpath-*]labours -m shotness

Couples analysis automatically loads "shotness" data if available.

hercules --shotness --pb https://github.com/pallets/jinja | labours -m couples -f pb

Aligned commit series

tensorflow/tensorflow aligned commit series of top 50 developers by commit number.

hercules --devs [--people-dict=/path/to/identities]labours -m devs -o <name>

We record how many commits made, as well as lines added, removed and changed per day for each developer.We plot the resulting commit time series using a few tricks to show the temporal grouping. In other words,two adjacent commit series should look similar after normalization.

We compute the distance matrix of the commit series. Our distance metric isDynamic Time Warping.We useFastDTW algorithm which has linear complexityproportional to the length of time series. Thus the overall complexity of computing the matrix is quadratic.
We compile the linear list of commit series withSeriation technique.Particularly, we solve theTravelling Salesman Problem which is NP-complete.However, given the typical number of developers which is less than 1,000, there is a good chance thatthe solution does not take much time. We useGoogle or-tools solver.
We find 1-dimensional clusters in the resulting path withHDBSCANalgorithm and assign colors accordingly.
Time series are smoothed by convolving with theSlepian window.

This plot allows to discover how the development team evolved through time. It also shows "commit flashmobs"such asHacktoberfest. For example, here are the revealedinsights from thetensorflow/tensorflow plot above:

"Tensorflow Gardener" is classified as the only outlier.
The "blue" group of developers covers the global maintainers and a few people who left (at the top).
The "red" group shows how core developers join the project or become less active.

Added vs changed lines through time

tensorflow/tensorflow added and changed lines through time.

hercules --devs [--people-dict=/path/to/identities]labours -m old-vs-new -o <name>

--devs from the previous section allows to plot how many lines were added and how many existing changed(deleted or replaced) through time. This plot is smoothed.

Efforts through time

kubernetes/kubernetes efforts through time.

hercules --devs [--people-dict=/path/to/identities]labours -m devs-efforts -o <name>

Besides,--devs allows to plot how many lines have been changed (added or removed) by each developer.The upper part of the plot is an accumulated (integrated) lower part. It is impossible to have the same scalefor both parts, so the lower values are scaled, and hence there are no lower Y axis ticks.There is a difference between the efforts plot and the ownership plot, although changing lines correlatewith owning lines.

Sentiment (positive and negative comments)

It can be clearly seen that Django comments were positive/optimistic in the beginning, but later became negative/pessimistic.
hercules --sentiment --pb https://github.com/django/django | labours -m sentiment -f pb

We extract new and changed comments from source code on every commit, applyBiDiSentimentgeneral purpose sentiment recurrent neural network and plot the results. Requireslibtensorflow.E.g.sadly, we need to hide the rect from the documentation finder for now is negative andTheano has a built-in optimization for logsumexp (...) so we can just write the expression directlyis positive. Don't expect too much though - as was written, the sentiment model isgeneral purpose and the code comments have different nature, so there is no magic (for now).

Hercules must be built with "tensorflow" tag - it is not by default:

make TAGS=tensorflow

Such a build requireslibtensorflow.

Everything in a single pass

hercules --burndown --burndown-files --burndown-people --couples --shotness --devs [--people-dict=/path/to/identities]labours -m all

Plugins

Hercules has a plugin system and allows to run custom analyses. SeePLUGINS.md.

Merging

hercules combine is the command which joins several analysis results in Protocol Buffers format together.

hercules --burndown --pb https://github.com/go-git/go-git > go-git.pbhercules --burndown --pb https://github.com/src-d/hercules > hercules.pbhercules combine go-git.pb hercules.pb | labours -f pb -m burndown-project --resample M

Bad unicode errors

YAML does not support the whole range of Unicode characters and the parser onlabours sidemay raise exceptions. Filter the output fromhercules throughfix_yaml_unicode.py to discardsuch offending characters.

hercules --burndown --burndown-people https://github.com/... | python3 fix_yaml_unicode.py | labours -m people

Plotting

These options affects all plots:

labours [--style=white|black] [--backend=] [--size=Y,X]

--style sets the general style of the plot (seelabours --help).--background changes the plot background to be either white or black.--backend chooses the Matplotlib backend.--size sets the size of the figure in inches. The default is12,9.

(required in macOS) you can pin the default Matplotlib backend with

echo "backend: TkAgg" > ~/.matplotlib/matplotlibrc

These options are effective in burndown charts only:

labours [--text-size] [--relative]

--text-size changes the font size,--relative activate the stretched burndown layout.

Custom plotting backend

It is possible to output all the information needed to draw the plots in JSON format.Simply append.json to the output (-o) and you are done. The data format is not fullyspecified and depends on the Python code which generates it. Each JSON file shouldcontain"type" which reflects the plot kind.

Caveats

Processing all the commits may fail in some rare cases. If you get an error similar to#106please report there and specify--first-parent as a workaround.
Burndown collection may fail with an Out-Of-Memory error. See the next session for the workarounds.
Parsing YAML in Python is slow when the number of internal objects is big.hercules' outputfor the Linux kernel in "couples" mode is 1.5 GB and takes more than an hour / 180GB RAM to beparsed. However, most of the repositories are parsed within a minute. Try using Protocol Buffersinstead (hercules --pb andlabours -f pb).

To speed up yaml parsing

# Debian, Ubuntuapt install libyaml-dev# macOSbrew install yaml-cpp libyaml# you might need to re-install pyyaml for changes to make effectpip uninstall pyyamlpip --no-cache-dir install pyyaml

Burndown Out-Of-Memory

If the analyzed repository is big and extensively uses branching, the burndown stats collection mayfail with an OOM. You should try the following:

Read the repo from disk instead of cloning into memory.
Use--skip-blacklist to avoid analyzing the unwanted files. It is also possible to constrain the--language.
Use thehibernation feature:--hibernation-distance 10 --burndown-hibernation-threshold=1000. Play with those two numbers to start hibernating right before the OOM.
Hibernate on disk:--burndown-hibernation-disk --burndown-hibernation-dir /path.
--first-parent, you win.

Roadmap

Switch fromsrc-d/go-git togo-git/go-git. Upgrade the codebase to be compatible with the latest Go version.
Update the docs regarding the copyrights and such.
Fix the reported bugs.
Remove the dependency on Babelfish for parsing the code. It is abandoned and a better alternative should be found.
Remove the ad-hoc analyses added while source{d} was agonizing.