Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

Extract links from HTML

License

NotificationsYou must be signed in to change notification settings

nigelhorne/HTML-SimpleLinkExtor

 
 

Repository files navigation

HTML::SimpleLinkExtor - Extract links from HTML

SYNOPSIS

    use HTML::SimpleLinkExtor;    my $extor = HTML::SimpleLinkExtor->new();    $extor->parse_file($filename);    #--or--    $extor->parse($html);    $extor->parse_file($other_file); # get more links    $extor->clear_links; # reset the link list    #extract all of the links    @all_links   = $extor->links;    #extract the img links    @img_srcs    = $extor->img;    #extract the frame links    @frame_srcs  = $extor->frame;    #extract the hrefs    @area_hrefs  = $extor->area;    @a_hrefs     = $extor->a;    @base_hrefs  = $extor->base;    @hrefs       = $extor->href;    #extract the body background link    @body_bg     = $extor->body;    @background  = $extor->background;    @links       = $extor->schemes( 'http' );

DESCRIPTION

This is a simple HTML link extractor designed for the person who doesnot want to deal with the intricacies ofHTML::Parser or thede-referencing needed to get links out ofHTML::LinkExtor.

You can extract all the links or some of the links (based on the HTMLtag name or attribute name). If a<BASE HREF> tag is found,all of the relative URLs will be resolved according to that reference.

This module is simply a subclass aroundHTML::LinkExtor, so it canonly parse what that module can handle. Invalid HTML or XHTML maycause problems.

If you parse multiple files, the link list grows and contains theaggregate list of links for all of the files parsed. If you want toreset the link list between files, use the clear_links method.

Class Methods

  • $extor = HTML::SimpleLinkExtor->new()

    Create the link extractor object.

  • $extor = HTML::SimpleLinkExtor->new('')

  • $extor = HTML::SimpleLinkExtor->new($base)

    Create the link extractor object and resolve the relative URLsaccoridng to the supplied base URL. The supplied base URL overridesany other base URL found in the HTML.

    Create the link extractor object and do not resolve relativelinks.

  • HTML::SimpleLinkExtor->ua;

    Returns the internal user agent, anLWP::UserAgent object.

  • HTML::SimpleLinkExtor->add_tags( TAG [, TAG ] )

    HTML::SimpleLinkExtor keeps an internal list of HTML tags (such as'a' and 'img') that have URLs as values. If you run into another tagthat this module doesn't handle, please send it to me and I'll add it.Until then you can add that tag to the internal list. This affectsthe entire class, including previously created objects.

  • HTML::SimpleLinkExtor->add_attributes( ATTR [, ATTR] )

    HTML::SimpleLinkExtor keeps an internal list of HTML tag attributes(such as 'href' and 'src') that have URLs as values. If you run intoanother attribute that this module doesn't handle, please send it tome and I'll add it. Until then you can add that attribute to theinternal list. This affects the entire class, including previouslycreated objects.

  • can()

    A smartercan that can tell which attributes are also methods.

  • HTML::SimpleLinkExtor->remove_tags( TAG [, TAG ] )

    Take tags out of the internal list thatHTML::SimpleLinkExtor usesto extract URLs. This affects the entire class, including previouslycreated objects.

  • HTML::SimpleLinkExtor->remove_attributes( ATTR [, ATTR] )

    Takes attributes out of the internal list thatHTML::SimpleLinkExtor uses to extract URLs. This affects the entireclass, including previously created objects.

  • HTML::SimpleLinkExtor->attribute_list

    Returns a list of the attributesHTML::SimpleLinkExtor paysattention to.

  • HTML::SimpleLinkExtor->tag_list

    Returns a list of the tagsHTML::SimpleLinkExtor pays attention to.These tags have convenience methods.

Object methods

  • $extor->parse_file( $filename )

    Parse the file for links. Inherited fromHTML::Parser.

  • $extor->parse_url( $url [, $ua] )

    Fetch URL and parse its content for links.

  • $extor->parse( $data )

    Parse the HTML in$data. Inherited fromHTML::Parser.

  • $extor->clear_links

    Clear the link list. This way, you can use the same parser foranother file.

  • $extor->links

    Return a list of the links.

  • $extor->img

    Return a list of the links from all the SRC attributes of theIMG.

  • $extor->frame

    Return a list of all the links from all the SRC attributes ofthe FRAME.

  • $extor->iframe

    Return a list of all the links from all the SRC attributes ofthe IFRAME.

  • $extor->frames

    Returns the combined list from frame and iframe.

  • $extor->src

    Return a list of the links from all the SRC attributes of anytag.

  • $extor->a

    Return a list of the links from all the HREF attributes of theA tags.

  • $extor->area

    Return a list of the links from all the HREF attributes of theAREA tags.

  • $extor->base

    Return a list of the links from all the HREF attributes of theBASE tags. There should only be one.

  • $extor->href

    Return a list of the links from all the HREF attributes of anytag.

  • $extor->body, $extor->background

    Return the link from the BODY tag's BACKGROUND attribute.

  • $extor->script

    Return the link from the SCRIPT tag's SRC attribute

  • $extor->schemes( SCHEME, [ SCHEME, ... ] )

    Return the links that use any of SCHEME. These must be absolute URLs (whichmight include those converted to absolute URLs by specifying abase). SCHEME is case-insensitive. You can specify more than onescheme.

    In list context it returns the links. In scalar context it returnsthe count of the matching links.

  • $extor->absolute_links

    Returns the absolute URLs (which might include those converted toabsolute URLs by specifying a base).

    In list context it returns the links. In scalar context it returnsthe count of the matching links.

  • $extor->relative_links

    Returns the relatives URLs (which might exclude those converted toabsolute URLs by specifying a base or having a base in the document).

    In list context it returns the links. In scalar context it returnsthe count of the matching links.

TO DO

This module doesn't handle all of the HTML tags that mighthave links. If someone wants those, I'll add them, or youcan edit%AUTO_METHODS in the source.

CREDITS

Will Crain who identified a problem with IMG links that hada USEMAP attribute.

AUTHORS

brian d foy,<bdfoy@cpan.org>

Maintained by Nigel Horne,<njh at bandsman.co.uk>

COPYRIGHT AND LICENSE

Copyright © 2004-2019, brian d foybdfoy@cpan.org. All rights reserved.

This program is free software; you can redistribute it and/or modifyit under the terms of the Artistic License 2.0.

Packages

No packages published

Languages

  • Perl89.5%
  • HTML9.9%
  • Raku0.6%

[8]ページ先頭

©2009-2025 Movatter.jp