- Notifications
You must be signed in to change notification settings - Fork10
chhibber/pgme
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
PGME is a GPU Metrics exporters that leverages the nvidai-smi binary. The initial work and key metric gathering code isderived from:
Nvidia-smi command used to gather metrics:
nvidia-smi --query-gpu=name,index,temperature.gpu,utilization.gpu,utilization.memory,memory.total,memory.free,memory.used --format=csv,noheader,nounits
I have added the following in an attempt to make it a more robust service:
- configuration via environment variables
- Makefile for local build
- liveness HTTP request probe for Kubernetes(k8s)
- graceful shutdown of http server
- exporter details at http://[[ip of server]]:[[port]/
- Integration with AWS Codebuild and Publishing to DockerHub or AWS ECR via different buildspec files
Working On:
- Kubernetes service and helm configuration
Local MAC Build (Generates a binary that works on OSX based systems)
git clone https://github.com/chhibber/pgme.gitcd pgmemake build-mac
Local Linux Build (Genrates a binary that works on Linux systems)
https://github.com/chhibber/pgme.gitcd pgmemake build
Local Docker Build (Generates a docker image)
https://github.com/chhibber/pgme.gitcd pgmemake docker-build IMAGE_REPO_NAME=[[ repo_name/app_name ]] IMAGE_TAG=[[ version info ]]# Example runnvidia-docker run -p 9101:9101 chhibber/pgme2018/01/05 21:32:31 Starting the service...2018/01/05 21:32:31 - PORT set to 9101. If environment variable PORT is not set the default is 91012018/01/05 21:32:31 The service is listening on 9101...
- The default port is 9101
You can change the port by defining the environment variabl PORT in front of the binary.
> PORT=9101 ./pgme
nvidia-docker run -p 9101:9101 chhibber/pgme:2017.01
Available Metrics -http://localhost:9101/metrics
temperature_gpu{gpu="TITAN X (Pascal)[0]"} 41utilization_gpu{gpu="TITAN X (Pascal)[0]"} 0utilization_memory{gpu="TITAN X (Pascal)[0]"} 0memory_total{gpu="TITAN X (Pascal)[0]"} 12189memory_free{gpu="TITAN X (Pascal)[0]"} 12189memory_used{gpu="TITAN X (Pascal)[0]"} 0temperature_gpu{gpu="TITAN X (Pascal)[1]"} 78utilization_gpu{gpu="TITAN X (Pascal)[1]"} 95utilization_memory{gpu="TITAN X (Pascal)[1]"} 59memory_total{gpu="TITAN X (Pascal)[1]"} 12189memory_free{gpu="TITAN X (Pascal)[1]"} 1738memory_used{gpu="TITAN X (Pascal)[1]"} 10451temperature_gpu{gpu="TITAN X (Pascal)[2]"} 83utilization_gpu{gpu="TITAN X (Pascal)[2]"} 99utilization_memory{gpu="TITAN X (Pascal)[2]"} 82memory_total{gpu="TITAN X (Pascal)[2]"} 12189memory_free{gpu="TITAN X (Pascal)[2]"} 190memory_used{gpu="TITAN X (Pascal)[2]"} 11999temperature_gpu{gpu="TITAN X (Pascal)[3]"} 84utilization_gpu{gpu="TITAN X (Pascal)[3]"} 97utilization_memory{gpu="TITAN X (Pascal)[3]"} 76memory_total{gpu="TITAN X (Pascal)[3]"} 12189memory_free{gpu="TITAN X (Pascal)[3]"} 536memory_used{gpu="TITAN X (Pascal)[3]"} 11653
- job_name: "gpu_exporter" static_configs: - targets: ['localhost:9101']