Commit83c4743

committed

added pdf image extractor tutorial

1 parent79a41f5 commit83c4743Copy full SHA for 83c4743

File tree

+47

-0

lines changed

+47

-0

lines changed

Lines changed: 1 addition & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -90,6 +90,7 @@ This is a repository of all the tutorials of [The Python Code](https://www.thepy`
`90`	`90`	`-[How to Get Domain Name Information in Python](https://www.thepythoncode.com/article/extracting-domain-name-information-in-python). ([code](web-scraping/get-domain-info))`
`91`	`91`	`-[How to Extract YouTube Comments in Python](https://www.thepythoncode.com/article/extract-youtube-comments-in-python). ([code](web-scraping/youtube-comments-extractor))`
`92`	`92`	`-[How to Extract All PDF Links in Python](https://www.thepythoncode.com/article/extract-pdf-links-with-python). ([code](web-scraping/pdf-url-extractor))`
	`93`	`+-[How to Extract Images from PDF in Python](https://www.thepythoncode.com/article/extract-pdf-images-in-python). ([code](web-scraping/pdf-image-extractor))`
`93`	`94`
`94`	`95`	`-###[Python Standard Library](https://www.thepythoncode.com/topic/python-standard-library)`
`95`	`96`	`-[How to Transfer Files in the Network using Sockets in Python](https://www.thepythoncode.com/article/send-receive-files-using-sockets-python). ([code](general/transfer-files/))`

5.09 MB

Binary file not shown.

Lines changed: 15 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,15 @@`
	`1`	`+#[How to Extract Images from PDF in Python](https://www.thepythoncode.com/article/extract-pdf-images-in-python)`
	`2`	`+To run this:`
	`3`	+-`pip3 install -r requirements.txt`
	`4`	+- To extract and save all images of`1710.05006.pdf` PDF file, you run:
	`5`	+```
	`6`	`+ python pdf_image_extractor.py 1710.05006.pdf`
	`7`	+ ```
	`8`	`+ This will save all available images in the current directory and outputs:`
	`9`	+ ```
	`10`	`+ [!] No images found on page 0`
	`11`	`+ [+] Found a total of 3 images in page 1`
	`12`	`+ [+] Found a total of 3 images in page 2`
	`13`	`+ [!] No images found on page 3`
	`14`	`+ [!] No images found on page 4`
	`15`	+ ```

Lines changed: 30 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,30 @@`
	`1`	`+importfitz# PyMuPDF`
	`2`	`+importio`
	`3`	`+fromPILimportImage`
	`4`	`+`
	`5`	`+# file path you want to extract images from`
	`6`	`+file="1710.05006.pdf"`
	`7`	`+# open the file`
	`8`	`+pdf_file=fitz.open(file)`
	`9`	`+# iterate over PDF pages`
	`10`	`+forpage_indexinrange(len(pdf_file)):`
	`11`	`+# get the page itself`
	`12`	`+page=pdf_file[page_index]`
	`13`	`+image_list=page.getImageList()`
	`14`	`+# printing number of images found in this page`
	`15`	`+ifimage_list:`
	`16`	`+print(f"[+] Found a total of{len(image_list)} images in page{page_index}")`
	`17`	`+else:`
	`18`	`+print("[!] No images found on page",page_index)`
	`19`	`+forimage_index,imginenumerate(page.getImageList(),start=1):`
	`20`	`+# get the XREF of the image`
	`21`	`+xref=img[0]`
	`22`	`+# extract the image bytes`
	`23`	`+base_image=pdf_file.extractImage(xref)`
	`24`	`+image_bytes=base_image["image"]`
	`25`	`+# get the image extension`
	`26`	`+image_ext=base_image["ext"]`
	`27`	`+# load it to PIL`
	`28`	`+image=Image.open(io.BytesIO(image_bytes))`
	`29`	`+# save it to local disk`
	`30`	`+image.save(open(f"image{page_index+1}_{image_index}.{image_ext}","wb"))`

Lines changed: 1 addition & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+PyMuPDF`

Comments

(0)