Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

📰 Build RSS 2.0 feeds from websites (and JSON APIs) automatically or with a few CSS selectors.

License

NotificationsYou must be signed in to change notification settings

html2rss/html2rss

Repository files navigation

html2rss logo

Gem VersionYard DocsRetro Badge: valid RSS

html2rss is a Ruby gem that generates RSS 2.0 feeds from websites.

Itsauto_source scraper finds items for the RSS feed automatically. 🧙🏼

Additionally, you can use theselectors scraper and control the information extraction.It takes plain old CSS selectors and extracts the information with help fromExtractors and chainablepost processors.It supportsscraping JSON responses.

To scrape websites that require JavaScript, html2rss can request these using a headless browser (Puppeteer / browserless.io).Independently of the used request strategy, you canset HTTP request headers.

🤩 Like it?Star it! ⭐️
😍 Endorse it?Sponsor it! 💓

Tip

Want to retrieve your RSS feeds via HTTP?Check outhtml2rss-web!

Getting started

Install Ruby (latest version is recommended) on your machine and rungem install html2rss in your terminal.

After the installation has finished,html2rss help will print usage information.

use automatic generation

html2rss offers an automatic RSS generation feature. Try it on CLI with:

html2rss auto https://unmatchedstyle.com/

creating a feed config file and using it

If the results are not to your satisfaction, you can create a feed config file.

Create a file calledmy_config_file.yml with this sample content:

channel:url:https://unmatchedstyle.comselectors:items:selector:"article[id^='post-']"enhance:true# auto_source: {} # Enables auto_source additionally when uncommented

Build the feed from this config with:html2rss feed ./my_config_file.yml.

Thefeed config and its options

Html2rss is configured usingchannel,selectors,strategy,headers,stylesheets andauto_source.The possible options of each are explained below.

Good to know:

Alright, let's dive in.

Thechannel

attributetypedefaultremark
urlrequiredString
titleoptionalStringauto-generated
descriptionoptionalStringauto-generatedRetrieved from meta description tags
authoroptionalStringblankFormat:email (Name)
ttloptionalIntegerauto-generatedResponses max-age, falls back to360 (minutes)
languageoptionalStringauto-generatedDetermined bylang attribute
time_zoneoptionalString'UTC'TimeZone name

The scraperauto_source: automatically find the items

Theauto_source scraper finds items automatically. To find them its scrapers search for:

  1. schema: parses<script type="json/ld"> tags which contain Schema.org objects likeArticle.
  2. semantic_html looks forsemantic HTML tags
  3. html: tries to find articles by selecting frequently occuring selectors.

It's a good idea to giveauto_source a try, before starting to configure theselectors scraper.

You can fine-tune the scraper settings like this:

channel:url:https://example.comauto_source:scraper:schema:enabled:false# default: truesemantic_html:enabled:false# default: truehtml:enabled:trueminimum_selector_frequency:3# default: 2cleanup:keep_different_domain:false# default: true

The scraperselectors: more control

[!INFO]To build avalid RSS 2.0 item, you need at least atitleor adescription in your item. You can, of course, have both.

Theselectors scraper allows you to specify CSS selectors and by this giving you full control of extraction.

You must give anitems selector hash, which contains the CSS selector. The items selector selects a collection of HTML tags from which the RSS feed items are built. Except for theitems selector, all other keys are scoped to each item of the collection.

Having anitems and atitle selector is enough to build a simple feed:

channel:url:"https://example.com"selectors:items:selector:".article"title:selector:"h1"

Automatically enhance items

Specifying thetitle,url orimage selector in every config quickly becomes cumbersome.html2rss enhances every item automatically.However, if you specify a selector, its value will be used.

channel:url:"https://example.com"selectors:items:selector:".article"enhance:true# default: true

Selectors which will be included in the the RSS feed

Yourselectors hash can contain arbitrary named selectors, but only a few will make it into the RSS feed (due to the RSS 2.0 specification):

RSS 2.0 tagname inhtml2rssremark
titletitle
descriptiondescriptionWill be sanitized when contains HTML
linkurlA URL.
authorauthor
categorycategoriesSee notes below.
guidguidGenerated automatically. See notes below.
enclosureenclosureSee notes below.
pubDatepublished_atAn instance ofTime.
commentscommentsA URL.
sourcesourceNot yet supported.

A selector and its Options

Every named selector (i.e.title,description, see above) in yourselectors can have these attributes:

namevalue
selectorThe CSS selector to select the tag with the information.
extractorName of the extractor. See notes below.
post_processAn array. See notes below.
Using extractors

Extractors help with extracting the information from the selected HTML tag.

  • The default extractor istext, which returns the tag's inner text.
  • Thehtml extractor returns the tag's outer HTML.
  • Thehref extractor returns a URL from the tag'shref attribute and corrects relative ones to absolute ones.
  • Theattribute extractor returns the value of that tag's attribute.
  • Thestatic extractor returns the configured static value (it doesn't extract anything).
  • See file list of extractors.

Extractors might need extra attributes on the selector hash. 👉Read their docs for usage examples.

See a Ruby example
Html2rss.feed(channel:{},selectors:{url:{selector:'a',extractor:'href'}})
See a YAML feed config example
channel:# ... omittedselectors:# ... omittedurl:selector:"a"extractor:"href"
Using post processors

Extracted information can be further manipulated with post processors.You can specify one or more post processors and they'll process in that order.

name
gsubAllows global substitution operations on Strings (Regexp or simple pattern).
html_to_markdownHTML to Markdown, usingreverse_markdown.
markdown_to_htmlconverts Markdown to HTML, usingkramdown.
parse_timeParses a String containing a time in a time zone.
parse_uriParses a String as URL.
sanitize_htmlStrips unsafe and uneeded HTML and adds security related attributes.
substringCuts a part off of a String, starting at a position.
templateBased on a template, it creates a new String filled with other selectors values.

⚠️ Always make use of thesanitize_html post processor for HTML content.Never trust the internet!⚠️

If thedescription contains HTML, it will be sanitized automatically.

YAML example: build the description from a template String (in Markdown) and convert that Markdown to HTML
channel:  # ... omittedselectors:  # ... omittedprice:selector:'.price'description:selector:'.section'post_process:      -name:templatestring:|          # %{self}          Price: %{price}      -name:markdown_to_html
Post processorgsub

The post processorgsub makes use of Ruby'sgsub method.

keytyperequirednote
patternStringyesCan be Regexp or String.
replacementStringyesCan be a backreference.
See a Ruby example
Html2rss.feed(channel:{},selectors:{title:{selector:'a',post_process:[{name:'gsub',pattern:'foo',replacement:'bar'}]}})
See a YAML feed config example
channel:# ... omittedselectors:# ... omittedtitle:selector:"a"post_process:      -name:"gsub"pattern:"foo"replacement:"bar"
Adding<category> tags to an item

Thecategories selector takes an array of selector names. Each value of thoseselectors will become a<category> on the RSS item.

See a Ruby example
Html2rss.feed(channel:{},selectors:{genre:{# ... omittedselector:'.genre'},branch:{selector:'.branch'},categories:%i[genrebranch]})
See a YAML feed config example
channel:  # ... omittedselectors:# ... omittedgenre:selector:".genre"branch:selector:".branch"categories:    -genre    -branch
Custom item GUID

By default, html2rss generates a stable GUID automatically, based on the item's url, or ultimatively ontitle ordescription.

If this is not stable (i.e. your RSS reader shows already read articles as new/unread frequently),you can choose from which attributes the GUID will be build.The principle is the same as for the categories: pass an array of selectors names.

See a Ruby example
Html2rss.feed(channel:{},selectors:{title:{# ... omittedselector:'h1'},url:{selector:'a',extractor:'href'},guid:%i[url]})
See a YAML feed config example
channel:  # ... omittedselectors:# ... omittedtitle:selector:"h1"url:selector:"a"extractor:"href"guid:    -url

In all cases, the GUID is eventually encoded as base-36 CRC32 checksum.

Adding an<enclosure> tag to an item

An enclosure can be any file, e.g. a image, audio or video - think Podcast.

Theenclosure selector needs to return a URL of the content to enclose. If the extracted URL is relative, it will be converted to an absolute one using the channel's URL as base.

Sincehtml2rss does no further inspection of the enclosure, its support comes with trade-offs:

  1. The content-type is guessed from the file extension of the URL, unless one is specified incontent_type.
  2. If the content-type guessing fails, it will default toapplication/octet-stream.
  3. The content-length will always be undetermined and therefore stated as0 bytes.

Read theRSS 2.0 spec for further information on enclosing content.

See a Ruby example
Html2rss.feed(channel:{},selectors:{enclosure:{selector:'audio',extractor:'attribute',attribute:'src',content_type:'audio/mp3'}})
See a YAML feed config example
channel:  # ... omittedselectors:  # ... omittedenclosure:selector:"audio"extractor:"attribute"attribute:"src"content_type:"audio/mp3"

See the more complex formatting options of thesprintf method.

Scraping and handling JSON responses

When the requested website returns a application/json content-typed response (i.e. youAccept: application/json header in the request), the selectors scraper converts that JSON to XML naiively. That XML you can query using CSS selectors.

Note

The JSON response must be an Array or Hash for this to work.

See example of a converted JSON object

This JSON object:

{"data": [{"title":"Headline","url":"https://example.com" }]}

converts to:

<object>  <data>    <array>      <object>        <title>Headline</title>        <url>https://example.com</url>      </object>    </array>  </data></object>

Your items selector would bearray > object, the item's URL selector would beurl.

See example of a converted JSON array

This JSON array:

[{"title":"Headline","url":"https://example.com" }]

converts to:

<array>  <object>    <title>Headline</title>    <url>https://example.com</url>  </object></array>

Your items selector would bearray > object, the item's URL selector would beurl.

See a Ruby example
Html2rss.feed(headers:{Accept:'application/json'},channel:{url:'http://domainname.tld/whatever.json'},selectors:{title:{selector:'foo'}})
See a YAML feed config example
headers:Accept:application/jsonchannel:url:"http://domainname.tld/whatever.json"selectors:title:selector:"foo"

Thestrategy: customization of how requests to the channel URL are sent

By default, html2rss issues a naiive HTTP request and extracts information from the response. That is performant and works for many websites. Under the hood, thefaraday gem is used and gives the name to the defaultstrategy:faraday.

Modern websites often do not render much HTML on the server, but evaluate JavaScript on the client to create the HTML. Because the default strategy does not execute any JavaScript, the faraday strategy will not find the "juicy content". For this scenario, try the browserless strategy.

You can write your custom strategy and make use of it. Consult the docs ofHtml2rss::RequestService.register_strategy().

strategy: browserless: Browserless.io

You can useBrowserless.io to run a headless Chrome browser and return the website's source code after the website generated it.For this, you can either run your own Browserless.io instance (Docker image available --read their license!) or pay them for a hosted instance.

To run a local Browserless.io instance, you can use the following Docker command:

docker run \  --rm \  -p 3000:3000 \  -e"CONCURRENT=10" \  -e"TOKEN=6R0W53R135510" \  ghcr.io/browserless/chromium

To make html2rss use your instance, specify thebrowserless strategy.

# auto:BROWSERLESS_IO_WEBSOCKET_URL="ws://127.0.0.1:3000" BROWSERLESS_IO_API_TOKEN="6R0W53R135510" \  html2rss auto --strategy=browserless https://example.com# feed:BROWSERLESS_IO_WEBSOCKET_URL="ws://127.0.0.1:3000" BROWSERLESS_IO_API_TOKEN="6R0W53R135510" \  html2rss feed --strategy=browserless the_the_config.yml

Tip

When running locally with commands from above, you can skip setting the environment variables, as they are aligned with the default values from above example.

In your config, setstrategy: browserless.

See a YAML feed config example
strategy:browserlessheaders:User-Agent:"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"channel:url:https://www.imdb.com/user/ur67728460/ratingsttl:1440selectors:items:selector:"li.ipc-metadata-list-summary-item"title:selector:".ipc-title__text"post_process:      -name:gsubpattern:"/^(\\d+.)\\s/"replacement:""      -name:templatestring:"%{self} rated with: %{user_rating}"url:selector:"a.ipc-title-link-wrapper"extractor:"href"user_rating:selector:"[data-testid='ratingGroup--other-user-rating'] > .ipc-rating-star--rating"

Theheaders: Set any HTTP request header

To set HTTP request headers, you can add them toheaders. This is useful for i.e. APIs that require anAuthorization header or you'd like to sendAccept: application/json.

headers:Authorization:"Bearer YOUR_TOKEN"Accept:application/jsonchannel:url:"https://example.com/api/resource"selectors:# ... omitted

Or for setting a User-Agent:

headers:User-Agent:"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"channel:url:"https://example.com"selectors:# ... omittedauto_source:{}

Dynamic parameters inchannel andheaders attributes

Sometimes there are structurally similar pages with different URLs or you need to pass some values into the headers.In such cases, you can adddynamic parameters to thechannel andheaders values.

Example of an dynamic parameterid in the channel URL:

channel:url:"http://domainname.tld/whatever/%<id>s.html"headers:X-Something:"%<foo>s"

Command line usage example:

html2rss feed the_feed_config.yml --params id:42 foo:bar
See a Ruby example
Html2rss.feed(channel:{url:'http://domainname.tld/whatever/%<id>s.html'},headers:{'X-Something':'%<foo>s'},params:{id:42,foo:'bar'})

Thestylesheets: Display the RSS feed nicely in a web browser

To display RSS feeds nicely in a web browser, you can:

  • add a plain old CSS stylesheet, or
  • use XSLT (eXtensibleStylesheetLanguageTransformations).

A web browser will apply these stylesheets and show the contents as described.

In a CSS stylesheet, you'd useelement selectors to apply styles.

If you want to do more, then you need to create a XSLT. XSLT allows youto use a HTML template and to freely design the information of the RSS,including using JavaScript and external resources.

You can add as many stylesheets and types as you like. Just add them to your global configuration.

Ruby: a stylesheet config example
Html2rss.feed(stylesheets:[{href:'/relative/base/path/to/style.xls',media::all,type:'text/xsl'},{href:'http://example.com/rss.css',media::all,type:'text/css'}],channel:{},selectors:{})
YAML: a stylesheet config example
stylesheets:  -href:"/relative/base/path/to/style.xls"media:"all"type:"text/xsl"  -href:"http://example.com/rss.css"media:"all"type:"text/css"feeds:# ... omitted

Recommended further readings:

Store feed configuration in YAML file

This step is not required to work with this gem, but is helpful when you plan to use the CLI orhtml2rss-web.

First, create a YAML file, e.g.feeds.yml. This file will contain your multiple feed configs under the keyfeeds. Everything which you specify outside of this, will be applied to every feed you're building.

Example:

headers:"User-Agent":"Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 Mobile/14E304 Safari/602.1""Accept":"text/html"feeds:myfeed:channel:selectors:auto_source:myotherfeedwit:headers:strategy:channel:selectors:

Your feed configs go belowfeeds.

Find a full example of afeeds.yml atspec/fixtures/feeds.test.yml.

If you prefer to have a single feed defined in a YAML, just omit the feeds.Checkout thesingle.test.yml..Now you can build your feeds like this:

Build feeds in Ruby
require'html2rss'myfeed=Html2rss.config_from_yaml_file('feeds.yml','myfeed')Html2rss.feed(myfeed)myotherfeed=Html2rss.config_from_yaml_file('feeds.yml','myotherfeed')Html2rss.feed(myotherfeed)single=Html2rss.config_from_yaml_file('single.test.yml')Html2rss.feed(single)
Build feeds on the command line
html2rss feed feeds.yml myfeedhtml2rss feed feeds.yml myotherfeedhtml2rss feed single.test.yml

Generating a feed with Ruby

You can also install it as a dependency in your Ruby project:

🤩 Like it?Star it! ⭐️
Add this line to yourGemfile:gem 'html2rss'
Then execute:bundle
In your code:require 'html2rss'

Here's a minimal working example using Ruby:

require'html2rss'rss=Html2rss.feed(channel:{url:'https://stackoverflow.com/questions'},auto_source:{})putsrss

and instead withauto_source, provideselectors (you can use both simultaneously):

require'html2rss'rss=Html2rss.feed(channel:{url:'https://stackoverflow.com/questions'},selectors:{items:{selector:'#hot-network-questions > ul > li'},title:{selector:'a'},url:{selector:'a',extractor:'href'}})putsrss

Gotchas and tips & tricks

  • Check that the channel URL does not redirect to a mobile page with a different markup structure.
  • Do not rely on your web browser's developer console when using the standard strategy. It does not execute JavaScript.In such cases, fiddling withcurl andpup to find the selectors seems efficient (curl URL | pup).
  • CSS selectors are versatile. Here's an overview.

Contributing

Find ideas what to contribute in:

  1. https://github.com/orgs/html2rss/discussions
  2. the issues tracker:https://github.com/html2rss/html2rss/issues

To submit changes:

  1. Fork this repo (https://github.com/html2rss/html2rss/fork )
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Implement a commit your changes (git commit -am 'feat: add XYZ')
  4. Push to the branch (git push origin my-new-feature)
  5. Create a new Pull Request using the Github web UI

Development Helpers

  1. bin/setup: installs dependencies and sets up the development environment.
  2. for a modern Ruby development experience: installruby-lsp and integrate it to your IDE.

For example:Ruby in Visual Studio Code.


[8]ページ先頭

©2009-2025 Movatter.jp