hatena/extract-content-javascriptPublic

NotificationsYou must be signed in to change notification settings
Fork19
Star174

You must be signed in to change notification settings

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
lib		lib
sketch		sketch
.gitignore		.gitignore
Makefile		Makefile
README.rdoc		README.rdoc

Repository files navigation

ExtractContentJS ¶↑

本文抽出 JavaScript ライブラリ

やれること¶↑

本文抽出
タグおすすめ

ファイル¶↑

基本的には以下をこの順に読み込めば動く:

lib/lib.js: 共通するもの
lib/extract-content.js: 本文抽出

リポジトリのルートでmake packageするとこれらを連結した extract-content-all.js が生成される.

実際の使い方を詳しく見たくなったら:

sketch/extract-content.test.js: 本文抽出テスト
lib/scoring-words.js: タグのスコアリング(サンプル)

使い方¶↑

本文抽出インタフェース¶↑

本文抽出だけしたい/ハンドラを指定したい場合に使う.

ExtractContentJS.LayeredExtractor¶↑

var ex = new ExtractContentJS.LayeredExtractor();//ex.addHandler( ex.factory.getHandler('Description') );//ex.addHandler( ex.factory.getHandler('Scraper'));//ex.addHandler( ex.factory.getHandler('GoogleAdsence') );ex.addHandler( ex.factory.getHandler('Heuristics') );var res = ex.extract(document);if (res.isSuccess) {    res.url;   // URL string    res.title; // title string    res.engine; // 抽出に用いたハンドラそのもの    res.content; // コンテンツクラスのインスタンス(後述)}

ハンドラはいまのところHeuristicsのみ実装済み.

コンテンツクラス¶↑

content.asLeaves(); // 本文だと判定された葉ノードを含む葉クラスインスタンス(後述)の配列を返すcontent.asNode(); // すべての葉ノードの共通の祖先のうち最深のものを返すcontent.asTextFragment(); // asLeaves()に含まれるノードのテキストを連結したものを返すcontent.toString(); // asNode()のtextContentを返す

葉クラス¶↑

leaf.node; // 葉ノードleaf.depth; // ノードのbodyからの深さ

AUTHOR¶↑

INA Lintaro

Copryright¶↑

Copyright of the original implementation¶↑

labs.cybozu.co.jp/blog/nakatani/2007/09/web_1.html

LICENCE¶↑

MIT License

About

No description, website, or topics provided.

Releases

No releases published

Packages

No packages published

Contributors3

Languages

JavaScript100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

ExtractContentJS ¶↑

やれること¶↑

ファイル¶↑

使い方¶↑

本文抽出インタフェース¶↑

ExtractContentJS.LayeredExtractor¶↑

コンテンツクラス¶↑

葉クラス¶↑

AUTHOR¶↑

Copryright¶↑

Copyright of the original implementation¶↑

LICENCE¶↑

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Contributors3

Uh oh!

Languages

Movatterモバイル変換

hatena/extract-content-javascript

Folders and files

Latest commit

History

Repository files navigation

ExtractContentJS¶↑

やれること¶↑

ファイル¶↑

使い方¶↑

本文抽出インタフェース¶↑

ExtractContentJS.LayeredExtractor¶↑

コンテンツクラス¶↑

葉クラス¶↑

AUTHOR¶↑

Copryright¶↑

Copyright of the original implementation¶↑

LICENCE¶↑

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Contributors3

Uh oh!

Languages

ExtractContentJS ¶↑

Packages