sammgithub/pythoncode-tutorialsPublic

forked fromx4nth055/pythoncode-tutorials

NotificationsYou must be signed in to change notification settings
Fork0
Star0

Commit67b14ec

committed

added extract links from pdf tutorial

1 parent4d3c786 commit67b14ecCopy full SHA for 67b14ec

File tree

7 files changed

+44

-0

lines changed

README.md
web-scraping/pdf-url-extractor

7 files changed

+44

-0

lines changed

`‎README.md`

Lines changed: 1 addition & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -88,6 +88,7 @@ This is a repository of all the tutorials of [The Python Code](https://www.thepy`
`88`	`88`	`-[How to Extract and Submit Web Forms from a URL using Python](https://www.thepythoncode.com/article/extracting-and-submitting-web-page-forms-in-python). ([code](web-scraping/extract-and-fill-forms))`
`89`	`89`	`-[How to Get Domain Name Information in Python](https://www.thepythoncode.com/article/extracting-domain-name-information-in-python). ([code](web-scraping/get-domain-info))`
`90`	`90`	`-[How to Extract YouTube Comments in Python](https://www.thepythoncode.com/article/extract-youtube-comments-in-python). ([code](web-scraping/youtube-comments-extractor))`
	`91`	`+-[How to Extract All PDF Links in Python](https://www.thepythoncode.com/article/extract-pdf-links-with-python). ([code](web-scraping/pdf-url-extractor))`
`91`	`92`
`92`	`93`	`-###[Python Standard Library](https://www.thepythoncode.com/topic/python-standard-library)`
`93`	`94`	`-[How to Transfer Files in the Network using Sockets in Python](https://www.thepythoncode.com/article/send-receive-files-using-sockets-python). ([code](general/transfer-files/))`

`‎web-scraping/pdf-url-extractor/1710.05006.pdf`

5.09 MB

Binary file not shown.

`‎web-scraping/pdf-url-extractor/1810.04805.pdf`

757 KB

Binary file not shown.

`‎web-scraping/pdf-url-extractor/README.md`

Lines changed: 4 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,4 @@`
	`1`	`+#[How to Extract All PDF Links in Python](https://www.thepythoncode.com/article/extract-pdf-links-with-python)`
	`2`	`+To run this:`
	`3`	+-`pip3 install -r requirements.txt`
	`4`	+- Use`pdf_link_extractor.py` to get clickable links, and`pdf_link_extractor_regex.py` to get links that are in text form.

`‎web-scraping/pdf-url-extractor/pdf_link_extractor.py`

Lines changed: 15 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,15 @@`
	`1`	`+importpikepdf# pip3 install pikepdf`
	`2`	`+`
	`3`	`+file="1810.04805.pdf"`
	`4`	`+# file = "1710.05006.pdf"`
	`5`	`+pdf_file=pikepdf.Pdf.open(file)`
	`6`	`+urls= []`
	`7`	`+# iterate over PDF pages`
	`8`	`+forpageinpdf_file.pages:`
	`9`	`+forannotsinpage.get("/Annots"):`
	`10`	`+uri=annots.get("/A").get("/URI")`
	`11`	`+ifuriisnotNone:`
	`12`	`+print("[+] URL Found:",uri)`
	`13`	`+urls.append(uri)`
	`14`	`+`
	`15`	`+print("[*] Total URLs extracted:",len(urls))`

`‎web-scraping/pdf-url-extractor/pdf_link_extractor_regex.py`

Lines changed: 22 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,22 @@`
	`1`	`+importfitz# pip install PyMuPDF`
	`2`	`+importre`
	`3`	`+`
	`4`	`+# a regular expression of URLs`
	`5`	`+url_regex=r"https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)"`
	`6`	`+# extract raw text from pdf`
	`7`	`+# file = "1710.05006.pdf"`
	`8`	`+file="1810.04805.pdf"`
	`9`	`+# open the PDF file`
	`10`	`+withfitz.open(file)aspdf:`
	`11`	`+text=""`
	`12`	`+forpageinpdf:`
	`13`	`+# extract text of each PDF page`
	`14`	`+text+=page.getText()`
	`15`	`+urls= []`
	`16`	`+# extract all urls using the regular expression`
	`17`	`+formatchinre.finditer(url_regex,text):`
	`18`	`+url=match.group()`
	`19`	`+print("[+] URL Found:",url)`
	`20`	`+urls.append(url)`
	`21`	`+print("[*] Total URLs extracted:",len(urls))`
	`22`	`+`

`‎web-scraping/pdf-url-extractor/requirements.txt`

Lines changed: 2 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,2 @@`
	`1`	`+pikepdf`
	`2`	`+PyMuPDF`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit67b14ec

File tree

7 files changed

7 files changed

`‎README.md`

`‎web-scraping/pdf-url-extractor/1710.05006.pdf`

`‎web-scraping/pdf-url-extractor/1810.04805.pdf`

`‎web-scraping/pdf-url-extractor/README.md`

`‎web-scraping/pdf-url-extractor/pdf_link_extractor.py`

`‎web-scraping/pdf-url-extractor/pdf_link_extractor_regex.py`

`‎web-scraping/pdf-url-extractor/requirements.txt`

0 commit comments