Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit67b14ec

Browse files
committed
added extract links from pdf tutorial
1 parent4d3c786 commit67b14ec

File tree

7 files changed

+44
-0
lines changed

7 files changed

+44
-0
lines changed

‎README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -88,6 +88,7 @@ This is a repository of all the tutorials of [The Python Code](https://www.thepy
8888
-[How to Extract and Submit Web Forms from a URL using Python](https://www.thepythoncode.com/article/extracting-and-submitting-web-page-forms-in-python). ([code](web-scraping/extract-and-fill-forms))
8989
-[How to Get Domain Name Information in Python](https://www.thepythoncode.com/article/extracting-domain-name-information-in-python). ([code](web-scraping/get-domain-info))
9090
-[How to Extract YouTube Comments in Python](https://www.thepythoncode.com/article/extract-youtube-comments-in-python). ([code](web-scraping/youtube-comments-extractor))
91+
-[How to Extract All PDF Links in Python](https://www.thepythoncode.com/article/extract-pdf-links-with-python). ([code](web-scraping/pdf-url-extractor))
9192

9293
-###[Python Standard Library](https://www.thepythoncode.com/topic/python-standard-library)
9394
-[How to Transfer Files in the Network using Sockets in Python](https://www.thepythoncode.com/article/send-receive-files-using-sockets-python). ([code](general/transfer-files/))
5.09 MB
Binary file not shown.
757 KB
Binary file not shown.
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
#[How to Extract All PDF Links in Python](https://www.thepythoncode.com/article/extract-pdf-links-with-python)
2+
To run this:
3+
-`pip3 install -r requirements.txt`
4+
- Use`pdf_link_extractor.py` to get clickable links, and`pdf_link_extractor_regex.py` to get links that are in text form.
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
importpikepdf# pip3 install pikepdf
2+
3+
file="1810.04805.pdf"
4+
# file = "1710.05006.pdf"
5+
pdf_file=pikepdf.Pdf.open(file)
6+
urls= []
7+
# iterate over PDF pages
8+
forpageinpdf_file.pages:
9+
forannotsinpage.get("/Annots"):
10+
uri=annots.get("/A").get("/URI")
11+
ifuriisnotNone:
12+
print("[+] URL Found:",uri)
13+
urls.append(uri)
14+
15+
print("[*] Total URLs extracted:",len(urls))
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
importfitz# pip install PyMuPDF
2+
importre
3+
4+
# a regular expression of URLs
5+
url_regex=r"https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)"
6+
# extract raw text from pdf
7+
# file = "1710.05006.pdf"
8+
file="1810.04805.pdf"
9+
# open the PDF file
10+
withfitz.open(file)aspdf:
11+
text=""
12+
forpageinpdf:
13+
# extract text of each PDF page
14+
text+=page.getText()
15+
urls= []
16+
# extract all urls using the regular expression
17+
formatchinre.finditer(url_regex,text):
18+
url=match.group()
19+
print("[+] URL Found:",url)
20+
urls.append(url)
21+
print("[*] Total URLs extracted:",len(urls))
22+
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
pikepdf
2+
PyMuPDF

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp