dhondta/webgrepPublic

NotificationsYou must be signed in to change notification settings
Fork10
Star110

Grep Web pages with extra features like JS deobfuscation and OCR

License

GPL-3.0 license

110 stars 10 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
doc		doc
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
mkdocs.yml		mkdocs.yml
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py
webgrep		webgrep

Repository files navigation

WebGrep

Grep Web pages and their resources.

This self-contained tool relies on the well-knowngrep tool for grepping Web pages. It binds nearly every option of the original tool and also provides additional features like deobfuscating Javascript or appyling OCR on images before grepping downloaded resources.

$ pip install webgrep-tool

⏩ Quick Start

Help

$ webgrep --helpusage: webgrep [OPTION]... PATTERN [URL]...Search for PATTERN in each input URL and its related resources(images, scripts and style sheets).By default,- resources are NOT downloaded- response HTTP headers are NOT included in grepping ; use '--include-headers'- PATTERN is a basic regular expression (BRE) ; use '-E' for extended (ERE)Important note: webgrep does not handle recursion (in other words, it does not               spider additional web pages).Examples: webgrep example http://www.example.com     # will only grep on HTML code webgrep -r example http://www.example.com  # will only grep on LOCAL images, ... webgrep -R example http://www.example.com  # will only grep on ALL images, ...Regexp selection and interpretation: -e REGEXP, --regexp REGEXP                       use PATTERN for matching -f FILE, --file FILE  obtain PATTERN from FILE -E, --extended-regexp                       PATTERN is an extended regular expression (ERE) -F, --fixed-strings   PATTERN is a set of newline-separated fixed strings -G, --basic-regexp    PATTERN is a basic regular expression (BRE) -P, --perl-regexp     PATTERN is a Perl regular expression -i, --ignore-case     ignore case distinctions -w, --word-regexp     force PATTERN to match only whole words -x, --line-regexp     force PATTERN to match only whole lines -z, --null-data       a data line ends in 0 byte, not newlineMiscellaneous: -s, --no-messages     suppress error messages -v, --invert-match    select non-matching lines -V, --version         print version information and exit --help                display this help and exit --verbose             verbose mode --keep-files          keep temporary files in the temporary directory --temp-dir TMP        define the temporary directory (default: /tmp/webgrep)Output control: -m NUM, --max-count NUM                       stop after NUM matches -b, --byte-offset     print the byte offset with output lines -n, --line-number     print line number with output lines --line-buffered       flush output on every line -H, --with-filename   print the file name for each match -h, --no-filename     suppress the file name prefix on output --label LABEL         use LABEL as the standard input filename prefix -o, --only-matching   show only the part of a line matching PATTERN -q, --quiet, --silent                       suppress all normal output --binary-files TYPE   assume that binary files are TYPE;                       TYPE is 'binary', 'text', or 'without-match' -a, --text            equivalent to --binary-files=text -I                    equivalent to --binary-files=without-match -L, --files-without-match                       print only names of FILEs containing no match -l, --files-with-match                       print only names of FILEs containing matches -c, --count           print only a count of matching lines per FILE -T, --initial-tab     make tabs line up (if needed) -Z, --null            print 0 byte after FILE nameContext control: -B NUM, --before-context NUM                       print NUM lines of leading context -A NUM, --after-context NUM                       print NUM lines of trailing context -C NUM, --context NUM                       print NUM lines of output contextWeb options: -r, --local-resources                       also grep local resources (same-origin) -R, --all-resources   also grep all resources (even non-same-origin) --include-headers     also grep HTTP headers --cookie COOKIE       use a session cookie in the HTTP headers --referer REFERER     provide the referer in the HTTP headersProxy settings (by default, system proxy settings are used): -d, --disable-proxy   manually disable proxy --http-proxy HTTP     manually set the HTTP proxy --https-proxy HTTPS   manually set the HTTPS proxyPlease report bugs on GitHub: https://github.com/dhondta/webgrep

Example

$ ./webgrep -R Welcome https://github.com      Welcome home, <br>developers

📌 ResourceHandlers

Definitions:

Resource (what is being processed): Web page, images, Javascript, CSS
Handler (how a resource is processed): CSS unminifying, OCR, deobfuscation, EXIF data retrieval, ...

The handlers are defined in the# --...-- HANDLERS SECTION --...-- of the code. Currently available handlers :

Images

EXIF: usingexiftool
Steganography: usingsteghide (with a blank password)
Strings: usingstrings
OCR: usingtesseract

Scripts

Javascript beautifying and deobfuscation: usingjsbeautifier

Styles

Unminifying: using regular expressions

Note: images found in the CSS files are also processed.

👏 Supporters

About

Grep Web pages with extra features like JS deobfuscation and OCR

Languages

Python100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

WebGrep

Grep Web pages and their resources.

⏩ Quick Start

📌 ResourceHandlers

👏 Supporters

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages