Movatterモバイル変換

spekulatius/PHPScraperPublic

NotificationsYou must be signed in to change notification settings
Fork76
Star557

A universal web-util for PHP.

phpscraper.de

License

GPL-3.0 license

557 stars 76 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 194 Commits
.github		.github
src		src
tests		tests
.editorconfig		.editorconfig
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.md		LICENSE.md
README.md		README.md
UPGRADING.md		UPGRADING.md
composer.json		composer.json
phpstan-baseline.neon		phpstan-baseline.neon
phpstan.neon		phpstan.neon
phpunit.xml.dist		phpunit.xml.dist
pint.json		pint.json
playground.php		playground.php
rector.php		rector.php

Repository files navigation

For full documentation, visitphpscraper.de.

PHPScraper is a versatile web-utility for PHP. Its primary objective is to streamline the process of extracting information from websites, allowing you to focus on accomplishing tasks without getting caught up in the complexities of selectors, data structure preparation, and conversion.

Under the hood, it uses

BrowserKit (formerlyGoutte) to access the web
League/URI to process URLs
donatello-za/rake-php-plus to extract and analyze keywords

Seecomposer.json for more details.

⏲️ PHPScraper in 5 Minutes explained

Here are a few impressions of the way the library works. More examples are on theproject website.

Basics: Flexible Calling as an Attribute or Method

All scraping functionality can be accessed either as a function call or a property call. For example, the title can be accessed in two ways:

// Prep$web =new \Spekulatius\PHPScraper\PHPScraper;$web->go('https://google.com');// Returns "Google"echo$web->title;// Also returns "Google"echo$web->title();

🔋 Batteries included: Meta data, Links, Images, Headings, Content, Keywords, ...

Many common use cases are covered already. You can find prepared extractors for various HTML tags, including interesting attributes. You can filter and combine these to your needs. In some cases there is an option to get a simple or detailed version, here in the case oflinksWithDetails:

$web =new \Spekulatius\PHPScraper\PHPScraper;// Contains:// <a href="https://placekitten.com/456/500" rel="ugc">//   <img src="https://placekitten.com/456/400">//   <img src="https://placekitten.com/456/300">// </a>$web->go('https://test-pages.phpscraper.de/links/image-urls.html');// Get the first link on the page and print the resultprint_r($web->linksWithDetails[0]);// [//     'url' => 'https://placekitten.com/456/500',//     'protocol' => 'https',//     'text' => '',//     'title' => null,//     'target' => null,//     'rel' => 'ugc',//     'image' => [//         'https://placekitten.com/456/400',//         'https://placekitten.com/456/300'//     ],//     'isNofollow' => false,//     'isUGC' => true,//     'isSponsored' => false,//     'isMe' => false,//     'isNoopener' => false,//     'isNoreferrer' => false,// ]

If there aren't any matching elements (here links) on the page, an empty array will be returned. If a method normally returns a string it might returnnull. Details such asfollow_redirects, etc. are optional configuration parameters (see below).

Most of the DOM should be covered using these methods:

severalmeta-tags and other<head>-information
Social-Media information like Twitter Card and Facebook Open Graph
Content:Headings,Outline,Texts andLists
Images
Links
Keywords

A full list of methods with example code can be found onphpscraper.de. Further examples are in thetests.

Download Files

Besides processing the content on the page itself, you can download files usingfetchAsset:

// Absolute URL$csvString =$web->fetchAsset('https://test-pages.phpscraper.de/test.csv');// Relative URL after navigation$csvString =$web  ->go('https://test-pages.phpscraper.de/meta/lorem-ipsum.html')  ->fetchAsset('/test.csv');

You will only need to write the content into a file or cloud storage.

Process the RSS feeds,`sitemap.xml`, etc.

PHPScraper can assist in collecting feeds such asRSS feeds,sitemap.xml-entries and static search indexes. This can be useful when deciding on the next page to crawl or building up a list of pages on a website.

Here we are processing the sitemap into a set ofFeedEntry-DTOs:

(new \Spekulatius\PHPScraper\PHPScraper)    ->go('https://phpscraper.de')    ->sitemap// array(131) {//   [0]=>//   object(Spekulatius\PHPScraper\DataTransferObjects\FeedEntry)#165 (3) {//     ["title"]=>//     string(0) ""//     ["description"]=>//     string(0) ""//     ["link"]=>//     string(22) "https://phpscraper.de/"//   }//   [1]=>// ...

Whenever post-processing is applied, you can fall back to the underlying*Raw-methods.

Process CSV-, XML- and JSON files and URLs

PHPScraper comes out of the box with file / URL processing methods for CSV-, XML- and JSON:

parseJson
parseXml
parseCsv
parseCsvWithHeader (generates an asso. array using the first row)

Each method can process both strings as well as URLs:

// Parse JSON into array:$json =$web->parseJson('[{"title": "PHP Scraper: a web utility for PHP", "url": "https://phpscraper.de"}]');// [//     'title' => 'PHP Scraper: a web utility for PHP',//     'url' => 'https://phpscraper.de'// ]// Fetch and parse CSV into a simple array:$csv =$web->parseCsv('https://test-pages.phpscraper.de/test.csv');// [//     ['date', 'value'],//     ['1945-02-06', 4.20],//     ['1952-03-11', 42],// ]// Fetch and parse CSV with first row as header into an asso. array structure:$csv =$web->parseCsvWithHeader('https://test-pages.phpscraper.de/test.csv');// [//     ['date' => '1945-02-06', 'value' => 4.20],//     ['date' => '1952-03-11', 'value' => 42],// ]

Additional CSV parsing parameters such as separator, enclosure and escape are possible.

There is more!

There are plenty of examples on thePHPScraper website and in thetests.

Check theplayground.php if you prefer learning by doing. You get it up and running with:

$ git clone git@github.com:spekulatius/PHPScraper.git&& composer update

💪 Roadmap

The future development is organized intomilestones. Releases followsemver.

v1:Building the first stable version

Improve documentation and examples.
Organize code better (move websites into separate repos, etc.)
Add support for feeds and some typical file types.

v2: Service Upgrade:

Switch from Goutte toSymfony BrowserKit. Goutte has been archived.

v3:Expand the functionality and cover more 'types'

Expand to parse a wider range of types, elements, embeds, etc.
Improve performance with caching and concurrent fetching of assets
Minor improvements for parsing methods

v4:Expand to provide more guidance on building custom scrapers on top of PHPScraper

TBC.

😍 Sponsors

PHPScraper is sponsored by:

With your support, PHPScraper can became thePHP swiss army knife for the web. If you find PHPScraper useful to your work, please consider asponsorship ordonation. Thank you 💪

⚙️ Configuration (optional)

If needed, you can use the following configuration options:

User Agent

You can set the browser agent usingsetConfig:

$web->setConfig(['agent' =>'Mozilla/5.0 (X11; Linux x86_64; rv:107.0) Gecko/20100101 Firefox/107.0']);

It defaults toMozilla/5.0 (compatible; PHP Scraper/1.x; +https://phpscraper.de).

Proxy Support

You can configure proxy support withsetConfig:

$web->setConfig(['proxy' =>'http://user:password@127.0.0.1:3128']);

Timeout

You can set thetimeout usingsetConfig:

$web->setConfig(['timeout' =>15]);

Setting the timeout to zero will disable it.

Disabling SSL

While unrecommended, it might be required to disable SSL checks. You can do so using:

$web->setConfig(['disable_ssl' =>true]);

You can callsetConfig multiple times. It stores the config and merges it with previous settings. This should be kept in mind in the unlikely use-case when unsetting values.

🚀 Installation with Composer

composer require spekulatius/phpscraper

After the installation, the package will be picked up by the Composer autoloader. If you are using a common PHP application or framework such as Laravel or Symfony you can start scraping now 🚀

If not or you are building a standalone-scraper, please include the autoloader invendor/ at the top of your file:

<?phprequire__DIR__ .'/vendor/autoload.php';// ...

Now you can now use any of the examples on the documentation website or from thetests/-folder.

Please consider supporting PHPScraper with a star orsponsorship:

composer thanks

Thank you 💪

✅ Testing

The library comes with a PHPUnit test suite. To run the tests, run the following command from the project folder:

composertest

You can find the testshere. The test pages arepublicly available.

MISC:Issues,Ideas,Contributing,CHANGELOG,UPGRADING,LICENSE

About

A universal web-util for PHP.

phpscraper.de

Releases3

3.0.0 Latest

Jul 19, 2024

+ 2 releases

Sponsor this project

https://phpscraper.de/misc/sponsors.html

Learn more about GitHub Sponsors

Contributors15

Languages

PHP100.0%

Movatterモバイル変換

License

spekulatius/PHPScraper

Folders and files

Latest commit

History

Repository files navigation

⏲️ PHPScraper in 5 Minutes explained

Basics: Flexible Calling as an Attribute or Method

🔋 Batteries included: Meta data, Links, Images, Headings, Content, Keywords, ...

Download Files

Process the RSS feeds,sitemap.xml, etc.

Process CSV-, XML- and JSON files and URLs

There is more!

💪 Roadmap

v1:Building the first stable version

v2: Service Upgrade:

v3:Expand the functionality and cover more 'types'

v4:Expand to provide more guidance on building custom scrapers on top of PHPScraper

😍 Sponsors

⚙️ Configuration (optional)

User Agent

Proxy Support

Timeout

Disabling SSL

🚀 Installation with Composer

✅ Testing

MISC:Issues,Ideas,Contributing,CHANGELOG,UPGRADING,LICENSE

About

Topics

Resources

License

Stars

Watchers

Forks

Releases3

Sponsor this project

Contributors15

Languages

Process the RSS feeds,`sitemap.xml`, etc.