This repository was archived by the owner on Jun 27, 2023. It is now read-only.

appifyhub/goosePublic archive

forked fromGravityLabs/goose

NotificationsYou must be signed in to change notification settings
Fork7
Star34

HTML Content (Article) Extractor for Android - fork of Goose by Gravity Labs

License

Apache-2.0 license

34 stars 320 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 235 Commits
app		app
goose		goose
gradle		gradle
screenshots		screenshots
.gitignore		.gitignore
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
build.gradle		build.gradle
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle		settings.gradle

Repository files navigation

Goose for Android

What's this?

Goose is an article extractor for web pages. This means that the algorithm is capable of determining where to look for relevant article information from a website,properly extracts "interesting" data, picks out the best images from the page and determines a confidence factor for the top-picked image based on various metainformation from the given page.
The command line Goose library (written in Scala) is licensed by Gravity.com under Apache 2.0 license, see the LICENSE file for more details.

About Android specific stuff

Android implementation addresses the issues Scala imposes on Android - such as using external programs (image-magick), using outdated HTTP libraries (Apache),downloading images to any location on the disk (no open disk on Android), managing cache, SD cards, battery consumption, network issues, redirects, etc.Since version1.5.0, Goose for Android also uses the officialHttpURLConnection that makes it compliant with Google's requests about deprecating outdatedSSL libraries such as OpenSSL.
A sample of how to use Goose for Android can be found inDemoActivity.java in theapp folder source.

How it works

Document cleaning

When you pass a URL to Goose, the first thing it starts to do isclean up the document to make it easier to parse. It will go through the whole document and remove comments, common social network sharing elements, convertem and other tags to plain text nodes, try to convertdivs used as text nodes to paragraphs, as well as do a general document cleanup (spaces, new lines, quotes, encoding, etc).

Content / Images Extraction

When dealing with random article links you're bound to come across the craziest of HTML files. Some sites even like to include 2 or more HTML files per site. Goose uses a scoring system based on clustering of English stop words and other factors that you can find in the code. Goose also doesdescending scoring so as the nodes move down - the lower their scores become. The goal is to find the strongest grouping of text nodes inside a parent container and assume that's the relevant group of content as long as it's high enough (up) on the page.

Image extraction is the one that takes the longest. Trying to find the most important image on a page proved to be challenging and required to download all the images to manually inspect them using external tools (not all images are considered, Goose checks mime types, dimensions, byte sizes, compression quality, etc). Java's Image functions were just too unreliable and inaccurate. On Android, Goose uses theBitmapFactory class, it is well documented, tested, and is fast and accurate. Images are analyzed from the top node that Goose finds the content in, then comes a recursive run outwards trying to find good images - Goose also checks if those images are ads, banners or author logos, and ignores them if so.

Output Formatting

Once Goose has the top node where we think the content is, Goose will try to format the content of that node for the output. For example, for NLP-type applications, Goose's output formatter will just suck all the text and ignore everything else, and other (custom) extractors can be built to offer a more Flipboardy-type experience.

Requirements

Android 4.0 or later (Minimum SDK level 14)
Android Studio (to compile and use)
Eclipse is not supported

Getting Started

Download Android Studio
Launch Android Studio
Start your new project
Open your project's main Gradle file, in root directory (/build.gradle)
Make sure you are usingjcenter() in the repository block (mavenCentral() should work too)
Open your app module's Gradle file, for example/app/build.gradle
Independencies block, add the following line:compile 'me.angrybyte.goose:goose:1.8.0'
ClickTools/Android/Sync Project with Gradle Files or click on theSync icon in the top toolbar
ClickRun/Run 'app' to see if it's resolved correctly

This will run the app on your device. You may need to download a newer version of Gradle, which will be available in the Android Studio UI if compile fails.

Sample usage

Make sure you do this on a background thread (or in anAsyncTask). Some of the results are in thescreenshots directory in the project root.

Configurationconfig =newConfiguration(getCacheDir().getAbsolutePath());ContentExtractorextractor =newContentExtractor(config);Articlearticle =extractor.extractContent(url);if (article ==null) {Log.e(TAG,"Couldn't load the article, is your URL correct, is your Internet working?");return;}Stringdetails =article.getCleanedArticleText();if (details ==null) {Log.w(TAG,"Couldn't load the article text, the page is messy. Trying with page description...");details =article.getMetaDescription();}Bitmapphoto =null;if (article.getTopImage() !=null) {// topImage value holds an absolute URLphoto =GooseDownloader.getPhoto(article.getTopImage().getImageSrc());}

Data you might be interested in is always available in theArticle object:

Title: Page title, defined in the<title> tag
Meta Description: Page description, defined in the meta description tag of the<head> parent
Meta Keywords: Page keywords, defined in the meta keywords tag of the<head> parent
Top Image: A best-guess photo for the article. This may be an image from OpenGraph, meta, promo, or just the visually closest image to the relevant text
Cleaned Article Text: Parsed and prepared, HTML-cleaned article text from the given webpage. Usually the longest text contained in an<article> tag, but Goose checks for good<div> and<p> tags as well
Publish Date: A parsed date of publishing for the relevant article. You need a special date extraction in order for this to work. Check source for more information

Support

If you found an error while using the library, pleasefile an issue.All patches are encouraged, and may be submitted byforking this project andsubmitting a pull request through GitHub.Some more help can be found here:

StackOverflowhere orhere
Onmy blog

Help wanted

This is still not very clean and nice. It works, but it is not clean. Some of the things to address:

Comment code and fix javadoc issues. Some comments are... useless, obsolete or just boring
Remove code that produces warnings; the base repo had a lot of them and almost none were fixed
Generally clean up the codebase. Remove obsolete references and make the API readable
Add a task to export the library tojar archive

Tip: to comment all logs use:
find . -name "*\.java" | xargs grep -l 'Log\.' | xargs sed -i 's/Log\./;\/\/ Log\./g'
Tip: to uncomment all logs use:
find . -name "*\.java" | xargs grep -l 'Log\.' | xargs sed -i 's/;\/\/ Log\./Log\./g'

Any help is appreciated.

About

HTML Content (Article) Extractor for Android - fork of Goose by Gravity Labs

Releases4

Android, Version 1.6.1 Latest

Oct 5, 2016

+ 3 releases

Packages

No packages published

Languages

Java100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Goose for Android

What's this?

About Android specific stuff

How it works

Requirements

Getting Started

Sample usage

Support

Help wanted

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases4

Packages

Languages

Movatterモバイル変換

License

appifyhub/goose

Folders and files

Latest commit

History

Repository files navigation

Goose for Android

What's this?

About Android specific stuff

How it works

Requirements

Getting Started

Sample usage

Support

Help wanted

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases4

Packages0

Languages

Packages