- Notifications
You must be signed in to change notification settings - Fork351
80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Functions, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.
License
HariSekhon/DevOps-Python-tools
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
DevOps, Cloud, Big Data, NoSQL, Python & Linux tools. All programs have--help
.
Hari Sekhon
Cloud & Big Data Contractor, United Kingdom
(you're welcome to connect with me on LinkedIn)
Make sure you runmake update
if updating and not justgit pull
as you will often need the latest library submodule and possibly new upstream libraries
All programs and their pre-compiled dependencies can be found ready to run onDockerHub.
List all programs:
docker run harisekhon/pytools
Run any given program:
docker run harisekhon/pytools<program><args>
installs git, make, pulls the repo and build the dependencies:
curl -L https://git.io/python-bootstrap| sh
or manually:
git clone https://github.com/HariSekhon/DevOps-Python-tools pytoolscd pytoolsmake
To only install pip dependencies for a single script, you can just type make and the filename with a.pyc
extensioninstead of.py
:
make anonymize.pyc
Make sure to readDetailed Build Instructions further down for more information.
Some Hadoop tools with require Jython, seeJython for Hadoop Utils for details.
All programs come with a--help
switch which includes a program description and the list of command line options.
Environment variables are supported for convenience and also to hide credentials from being exposed in the process listeg.$PASSWORD
,$TRAVIS_TOKEN
. These are indicated in the--help
descriptions in brackets next to each option andoften have more specific overrides with higher precedence eg.$AMBARI_HOST
,$HBASE_HOST
take priority over$HOST
.
- Linux:
anonymize.py
- anonymizes your configs / logs from files or stdin (for pasting to Apache Jira tickets or mailing- lists)
- anonymizations include these and more:
- hostnames / domains / FQDNs
- email addresses
- IP + MAC addresses
- AWS Access Keys, Secret Keys, ARNs, STS tokens
- Kerberos principals
- LDAP sensitive fields (eg. CN, DN, OU, UID, sAMAccountName, member, memberOf...)
- Cisco & Juniper ScreenOS configurations passwords, shared keys and SNMP strings
anonymize_custom.conf
- put regex of your Name/Company/Project/Database/Tables to anonymize to<custom>
- placeholder tokens indicate what was stripped out (eg.
<fqdn>
,<password>
,<custom>
) --ip-prefix
leaves the last IP octect to aid in cluster debugging to still see differentiated nodescommunicating with each other to compare configs and log communications--hash-hostnames
- hashes hostnames to look like Docker temporary container ID hostnames so that vendors supportteams can differentiate hosts in clustersanonymize_parallel.sh
- splits files in to multiple parts and runsanonymize.py
on each part in parallelbefore re-joining back in to a file of the same name with a.anonymized
suffix. Preserves order of evaluationimportant for anonymization rules, as well as maintaining file content order. On servers this parallelization canresult in a 30x speed up for large log files
- anonymizations include these and more:
find_duplicate_files.py
- finds duplicate files in one or more directory trees via multiple methods including filebasename, size, MD5 comparison of same sized files, or bespoke regex capture of partial file basenamefind_active_server.py
- finds fastest responding healthy server or active master in high availability deployments,useful for scripting against clustered technologies (eg. Elasticsearch, Hadoop, HBase, Cassandra etc).Multi-threaded for speed and highly configurable - socket, http, https, ping, url and/or regex content match. Seefurther down for more details and sub-programs that simplify usage for many of the most common cluster technologieswelcome.py
- cool spinning welcome message greeting your username and showing last login time and user to put inyour shell's.profile
(there is also a perl version in myDevOps Perl Tools repo)
- Amazon Web Services:
aws_users_access_key_age.py
- lists all users access keys, status, date of creation and age in days. Optionallyfilters for active keys and older than N days (for key rotation governance)aws_users_unused_access_keys.py
- lists users access keys that haven't been used in the last N days or that havenever been used (these should generally be removed/disabled). Optionally filters for only active keysaws_users_last_used.py
- lists all users and their days since last use across both passwords and access keys.Optionally filters for users not used in the last N days to find old accounts to removeaws_users_pw_last_used.py
- lists all users and dates since their passwords were last used. Optionally filters forusers with passwords not used in the last N days
- Google Cloud Platform:
- GCF - Google Cloud Functions written in Python:
- gcp_cloud_function_sql_export/ - runsCloud SQL export backups toGCS, subscribing toPubSub topic that istriggered byCloud Scheduler
- see theDevOps Bash tools repo for several related GCP SQL to set up service account permissions andCloud Scheduler jobs
- gcp_cloud_function_ifconfig/ - debug your cloud function public networking by determining its public IPaddress - use this to test your VPC connector public routing, comparison with firewall rules etc.
- gcp_cloud_function_proxy/ - debug your cloud function networking by querying a given URL to check itsaccessibility, returning the HTTP status code and content. Use this to validate access through firewall rules viaVPC connector routing
- gcp_cloud_function_sql_export/ - runsCloud SQL export backups toGCS, subscribing toPubSub topic that istriggered byCloud Scheduler
gcp_service_account_credential_keys.py
- lists all GCP service account credential keys for a given project withtheir age and expiry details, optionally filtering by non-expiring, already expired, or will expire within N days
- GCF - Google Cloud Functions written in Python:
- Docker:
docker_registry_show_tags.py
/dockerhub_show_tags.py
/quay_show_tags.py
- shows tags for docker repos in adocker registry or onDockerHub orQuay.io - Docker CLI doesn't support this yet but it's a veryuseful thing to be able to see live on the command line or use in shell scripts (use-q
/--quiet
to return onlythe tags for easy shell scripting). You can use this to pre-download all tags of a docker image before running testsacross versions in a simple bash for loop, eg.docker_pull_all_tags.sh
dockerhub_search.py
- search DockerHub with a configurable number of returned results (older officialdocker search
was limited to only 25 results), using--verbose
will also show you how many results were returnedto the termainal and how many DockerHub has in total (use-q / --quiet
to return only the image names for easyshell scripting). This can be used to download all of my DockerHub images in a simple bash for loop eg.docker_pull_all_images.sh
and can be chained withdockerhub_show_tags.py
to download all tagged versions for alldocker images eg.docker_pull_all_images_all_tags.sh
dockerfiles_check_git*.py
- check Git tags & branches align with the containing Dockerfile'sARG *_VERSION
- Spark & Data Format Converters:
spark_avro_to_parquet.py
- PySpark Avro => Parquet converterspark_parquet_to_avro.py
- PySpark Parquet => Avro converterspark_csv_to_avro.py
- PySpark CSV => Avro converter, supports both inferred and explicit schemasspark_csv_to_parquet.py
- PySpark CSV => Parquet converter, supports both inferred and explicit schemasspark_json_to_avro.py
- PySpark JSON => Avro converterspark_json_to_parquet.py
- PySpark JSON => Parquet converterxml_to_json.py
- XML to JSON converterjson_to_xml.py
- JSON to XML converterjson_to_yaml.py
- JSON to YAML converterjson_docs_to_bulk_multiline.py
- converts json files to bulk multi-record one-line-per-json-document format forpre-processing and loading to big data systems likeHadoop andMongoDB, can recurse directory trees, and mix json-doc-per-file / bulk-multiline-json /directories / standard input, combines all json documents and outputs bulk-one-json-document-per-line to standardoutput for convenient command line chaining and redirection, optionally continues on error, collects broken recordsto standard error for logging and later reprocessing for bulk batch jobs, even supports single quoted json while nottechnically valid json is used by MongoDB and even handles embedded double quotes in 'single quoted json'yaml_to_json.py
- YAML to JSON converter (because some APIs like GitLab CI Validation API require JSON)- see also
validate_*.py
further down for all these formats and more
- Hadoop ecosystem & NoSQL:
ambari_blueprints.py
- Blueprint cluster templating and deployment tool using Ambari API- list blueprints
- fetch all blueprints or a specific blueprint to local json files
- blueprint an existing cluster
- create a new cluster using a blueprint
- sorts and prettifies the resulting JSON template for deterministic config and line-by-line diff necessary forproper revision control
- optionally strips out the excessive and overly specific configs to create generic more reusable templates
- see the
ambari_blueprints/
directory for a variety of Ambari blueprint templates generated by and deployableusing this tool
ambari_ams_*.sh
- query the Ambari Metrics Collector API for a given metrics, list all metrics or hostsambari_cancel_all_requests.sh
- cancel all ongoing operations using the Ambari APIambari_trigger_service_checks.py
- trigger service checks using the Ambari API
Hadoop HDFS:
hdfs_find_replication_factor_1.py
- finds HDFS files with replication factor 1, optionally resetting them toreplication factor 3 to avoid missing block alerts during datanode maintenance windowshdfs_time_block_reads.jy
- HDFS per-block read timing debugger with datanode and rack locations for a given fileor directory tree. Reports the slowest Hadoop datanodes in descending order at the end. Helps find cluster datalayer bottlenecks such as slow datanodes, faulty hardware or misconfigured top-of-rack switch ports.hdfs_files_native_checksums.jy
- fetches native HDFS checksums for quicker file comparisons (about 100x fasterthan doinghdfs dfs -cat | md5sum
)hdfs_files_stats.jy
- fetches HDFS file stats. Useful to generate a list of all files in a directory treeshowing block size, replication factor, underfilled blocks and small files
hive_schemas_csv.py
/impala_schemas_csv.py
- dumps all databases, tables, columns and types out in CSV formatto standard output
The following programs can all optionally filter by database / table name regex:
hive_foreach_table.py
/impala_foreach_table.py
- execute any query or statement against every Hive / Impalatablehive_tables_row_counts.py
/impala_tables_row_counts.py
- outputs tables row counts. Useful for reconciliationbetween cluster migrationshive_tables_column_counts.py
/impala_tables_column_counts.py
- outputs tables column counts. Useful forfinding unusually wide tableshive_tables_row_column_counts.py
/impala_tables_row_column_counts.py
- outputs tables row and column counts.Useful for finding unusually big tableshive_tables_row_counts_any_nulls.py
/impala_tables_row_counts_any_nulls.py
- outputs tables row counts whereany field is NULL. Useful for reconciliation between cluster migrations or catching data quality problems orsubtle ETL bugshive_tables_null_columns.py
/impala_tables_null_columns.py
- outputs tables columns containing only NULLs.Useful for catching data quality problems or subtle ETL bugshive_tables_null_rows.py
/impala_tables_null_rows.py
- outputs tables row counts where all fields containNULLs. Useful for catching data quality problems or subtle ETL bugshive_tables_metadata.py
/impala_tables_metadata.py
- outputs for each table the matching regex metadata DDLproperty from describe tablehive_tables_locations.py
/impala_tables_locations.py
- outputs for each table its data location
hbase_generate_data.py
- inserts random generated data in to a givenHBase table,with optional skew support with configurable skew percentage. Useful for testing region splitting, balancing, CItests etc. Outputs stats for number of rows written, time taken, rows per sec and volume per sec written.hbase_show_table_region_ranges.py
- dumps HBase table region ranges information, useful when pre-splittingtableshbase_table_region_row_distribution.py
- calculates the distribution of rows across regions in an HBase table,giving per region row counts and % of total rows for the table as well as median and quartile row counts perregionshbase_table_row_key_distribution.py
- calculates the distribution of row keys by configurable prefix length inan HBase table, giving per prefix row counts and % of total rows for the table as well as median and quartile rowcounts per prefixhbase_compact_tables.py
- compacts HBase tables (for off-peak compactions). Defaults to finding and iteratingon all tables or takes an optional regex and compacts only matching tables.hbase_flush_tables.py
- flushes HBase tables. Defaults to finding and iterating on all tables or takes anoptional regex and flushes only matching tables.hbase_regions_by_*size.py
- queries given RegionServers JMX to lists topN regions by storeFileSize ormemStoreSize, ascending or descendinghbase_region_requests.py
- calculates requests per second per region across all given RegionServers or averagesince RegionServer startup, configurable intervals and count, can filter to any combination of reads / writes /total requests per second. Useful for watching more granular region stats to detect region hotspottinghbase_regionserver_requests.py
- calculates requests per regionserver second across all given regionservers oraverage since regionserver(s) startup(s), configurable interval and count, can filter to any combination of read,write, total, rpcScan, rpcMutate, rpcMulti, rpcGet, blocked per second. Useful for watching more granularRegionServer stats to detect RegionServer hotspottinghbase_regions_least_used.py
- finds topN biggest/smallest regions across given RegionServers than have receivedthe least requests (requests below a given threshold)
opentsdb_import_metric_distribution.py
- calculates metric distribution in bulk import file(s) to find data skewand help avoid HBase region hotspottingopentsdb_list_metrics*.sh
- lists OpenTSDB metric names, tagk or tagv via OpenTSDB API or directly from HBasetables with optionally their created date, sorted ascending
pig-text-to-elasticsearch.pig
- bulk index unstructured files inHadoop toElasticsearchpig-text-to-solr.pig
- bulk index unstructured files inHadoop toSolr /SolrCloud clusterspig_udfs.jy
- Pig Jython UDFs for Hadoop
find_active_server.py
- returns first available healthy server or active master in high availability deployments,useful for chaining with single argument tools. Configurable tests include socket, http, https, ping, url and/or regexcontent match, multi-threaded for speed. Designed to extend tools that only accept a single--host
option but forwhich the technology has later added multi-master support or active-standby masters (eg. Hadoop, HBase) or where youwant to query cluster wide information available from any online peer (eg. Elasticsearch)- The following are simplified specialisations of the above program, just pass host arguments, all the details havebeen baked in, no switches required
find_active_hadoop_namenode.py
- returns activeHadoop Namenode in HDFS HAfind_active_hadoop_resource_manager.py
- returns activeHadoop Resource Manager in Yarn HAfind_active_hbase_master.py
- returns activeHBase Master in HBase HAfind_active_hbase_thrift.py
- returns first availableHBase Thrift Server (runmultiple of these for load balancing)find_active_hbase_stargate.py
- returns first availableHBase Stargate rest server(run multiple of these for load balancing)find_active_apache_drill.py
- returns first availableApache Drill nodefind_active_cassandra.py
- returns first availableApache Cassandra nodefind_active_impala*.py
- returns first availableImpala node of either Impalad,Catalog or Statestorefind_active_presto_coordinator.py
- returns first availablePresto Coordinatorfind_active_kubernetes_api.py
- returns first availableKubernetes API serverfind_active_oozie.py
- returns first activeOozie serverfind_active_solrcloud.py
- returns first availableSolr /SolrCloud nodefind_active_elasticsearch.py
- returns first availableElasticsearch node- see also:Advanced HAProxy configurations which are part of theAdvanced Nagios Plugins Collection
- The following are simplified specialisations of the above program, just pass host arguments, all the details havebeen baked in, no switches required
- Travis CI:
travis_last_log.py
- fetchesTravis CI latest running / completed / failed build log for given repo -useful for quickly getting the log of the last failed build when CCMenu or BuildNotify applets turn redtravis_debug_session.py
- launches aTravis CI interactive debug build session via Travis API, trackssession creation and drops user straight in to the SSH shell on the remote Travis build, very convenient one shotdebug launcher for Travis CI
selenium_hub_browser_test.py
- checksSelenium Grid Hub / Selenoid is working by calling browsers such asChrome and Firefox to fetch a given URL and content/regex match the result- Data Validation (useful in CI):
validate_*.py
- validate files, directory trees and/or standard input streams- supports the following file formats:
- Avro
- CSV
- INI / Java Properties (also detects duplicate sections and duplicate keys within sections)
- JSON (both normal and json-doc-per-line bulk / big data format as found in MongoDB and Hadoop json data files)
- LDAP LDIF
- Parquet
- XML
- YAML
- directories are recursed, testing any files with relevant matching extensions (
.avro
,.csv
,json
,parquet
,.ini
/.properties
,.ldif
,.xml
,.yml
/.yaml
) - used for Continuous Integration tests of various adjacent Spark data converters as well as configuration files forthings like Presto, Ambari, Apache Drill etc found in myDockerHub imagesDockerfiles master repo which contains docker builds and configurations for many open source Big Data &Linux technologies
- supports the following file formats:
The automated build will use 'sudo' to install required Python PyPI libraries to the system unless running as root or itdetects being inside a VirtualEnv. If you want to install some of the common Python libraries using your OS packagesinstead of installing from PyPI then follow the Manual Build section below.
Enter the pytools directory and run git submodule init and git submodule update to fetch my library repo:
git clone https://github.com/HariSekhon/DevOps-Python-tools pytoolscd pytoolsgit submodule initgit submodule updatesudo pip install -r requirements.txt
Download the DevOps Python Tools and Pylib git repos as zip files:
https://github.com/HariSekhon/DevOps-Python-tools/archive/master.zip
https://github.com/HariSekhon/pylib/archive/master.zip
Unzip both and move Pylib to thepylib
folder under DevOps Python Tools.
unzip devops-python-tools-master.zipunzip pylib-master.zipmv -v devops-python-tools-master pytoolsmv -v pylib-master pylibmv -vf pylib pytools/
Proceed to install PyPI modules for whichever programs you want to use using your usual procedure - usually an internalmirror or proxy server to PyPI, or rpms / debs (some libraries are packaged by Linux distributions).
All PyPI modules are listed in therequirements.txt
andpylib/requirements.txt
files.
Internal Mirror example (JFrog Artifactory or similar):
sudo pip install --index https://host.domain.com/api/pypi/repo/simple --trusted host.domain.com -r requirements.txt
Proxy example:
sudo pip install --proxy hari:mypassword@proxy-host:8080 -r requirements.txt
The automated build also works on Mac OS X but you'll need to installApple XCode (on recent Macs just typinggit
is enough to trigger Xcode install).
I also recommend you getHomeBrew to install other useful tools and libraries you may need like OpenSSL fordevelopment headers and tools such as wget (these are installed automatically if Homebrew is detected on Mac OS X):
bash-tools/install/install_homebrew.sh
brew install openssl wget
If failing to build an OpenSSL lib dependency, just prefix the build command like so:
sudo OPENSSL_INCLUDE=/usr/local/opt/openssl/include OPENSSL_LIB=/usr/local/opt/openssl/lib ...
You may get errors trying to install to Python library paths even as root on newer versions of Mac, sometimes this iscaused by pip 10 vs pip 9 and downgrading will work around it:
sudo pip install --upgrade pip==9.0.1makesudo pip install --upgrade pipmake
The 3 Hadoop utility programs listed below require Jython (as well as Hadoop to be installed and correctly configured)
hdfs_time_block_reads.jyhdfs_files_native_checksums.jyhdfs_files_stats.jy
Run like so:
jython -J-cp$(hadoop classpath) hdfs_time_block_reads.jy --help
The-J-cp $(hadoop classpath)
part dynamically inserts the current Hadoop java classpath required to use the HadoopAPIs.
See below for procedure to install Jython if you don't already have it.
This will download and install jython to /opt/jython-2.7.0:
make jython
Jython is a simple download and unpack and can be fetched fromhttp://www.jython.org/downloads.html
Then add the Jython install bin directory to the $PATH or specify the full path to thejython
binary, eg:
/opt/jython-2.7.0/bin/jython hdfs_time_block_reads.jy ...
Strict validations include host/domain/FQDNs using TLDs which are populated from the official IANA list is done via myPyLib library submodule - see there for details on configuring this to permit custom TLDs like.local
,.intranet
,.vm
,.cloud
etc. (all already included in there because they're common across companies internalenvironments).
If you end up with an error like:
./dockerhub_show_tags.py centos ubuntu[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:765)
It can be caused by an issue with the underlying Python + libraries due to changes in OpenSSL and certificates. Onequick fix is to do the following:
sudo pip uninstall -y certifi&&sudo pip install certifi==2015.04.28
Run:
make update
This will git pull and then git submodule update which is necessary to pick up corresponding library updates.
If you update often and want to just quickly git pull + submodule update but skip rebuilding all those dependencies eachtime then runmake update-no-recompile
(will miss new library dependencies - do fullmake update
if you encounterissues).
Continuous Integration is run on this repo with tests for success and failure scenarios:
- unit tests for the custom supportingpython library
- integration tests of the top level programs using the libraries for things like option parsing
- functional tests for the top level programs using local test data andDocker containers
To trigger all tests run:
maketest
which will start with the underlying libraries, then move on to top level integration tests and functional tests usingdocker containers if docker is available.
Patches, improvements and even general feedback are welcome in the form of GitHub pull requests and issue tickets.
You might also be interested in the following really nice Jupyter notebook for HDFS space analysis created by anotherHortonworks guy Jonas Straub:
https://github.com/mr-jstraub/HDFSQuota/blob/master/HDFSQuota.ipynb
The rest of my original source repos arehere.
Pre-built Docker images are available on myDockerHub.
About
80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Functions, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
Packages0
Uh oh!
There was an error while loading.Please reload this page.
Contributors4
Uh oh!
There was an error while loading.Please reload this page.