- Notifications
You must be signed in to change notification settings - Fork15
Ksoup is a Kotlin Multiplatform library for working with HTML and XML. It's a port of the renowned Java library Jsoup.
License
fleeksoft/ksoup
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Ksoup is a Kotlin Multiplatform library for working with real-world HTML and XML. It's a port of the renowned Java library,jsoup, and offers an easy-to-use API for URL fetching, data parsing, extraction, and manipulation using DOM and CSS selectors.
Ksoup implements theWHATWG HTML5 specification, parsing HTML to the same DOM as modern browsers do, but with support for Android, JVM, and native platforms.
- Scrape and parse HTML from a URL, file, or string
- Find and extract data using DOM traversal or CSS selectors
- Manipulate HTML elements, attributes, and text
- Clean user-submitted content against a safe-list to prevent XSS attacks
- Output tidy HTML
Ksoup is adept at handling all varieties of HTML found in the wild.
Include the dependency incommonMain
. Latest version
Ksoup published in four variants. Pick the one that suits your needs and start building!
Lightweight variant: Use this if you only need to parse HTML from a string.
implementation("com.fleeksoft.ksoup:ksoup:<version>")
This variant usekotlinx-io for I/O andKtor 3 for networking
// Ksoup.parseFile, Ksoup.parseSourceimplementation("com.fleeksoft.ksoup:ksoup-kotlinx:<version>")// Optional: Include only if you need to use network request functions such as// Ksoup.parseGetRequest, Ksoup.parseSubmitRequest, and Ksoup.parsePostRequestimplementation("com.fleeksoft.ksoup:ksoup-network:<version>")
This variant usekorlibs-io for I/O and networking
// Ksoup.parseFile, Ksoup.parseStreamimplementation("com.fleeksoft.ksoup:ksoup-korlibs:<version>")// Optional: Include only if you need to use network request functions such as// Ksoup.parseGetRequest, Ksoup.parseSubmitRequest, and Ksoup.parsePostRequestimplementation("com.fleeksoft.ksoup:ksoup-network-korlibs:<version>")
This variant usekotlinx-io for I/O andKtor 2 for networking
// Ksoup.parseFile, Ksoup.parseSourceimplementation("com.fleeksoft.ksoup:ksoup-kotlinx:<version>")// Optional: Include only if you need to use network request functions such as// Ksoup.parseGetRequest, Ksoup.parseSubmitRequest, and Ksoup.parsePostRequestimplementation("com.fleeksoft.ksoup:ksoup-network-ktor2:<version>")
This variant useokio for I/O andKtor 2 for networking
implementation("com.fleeksoft.ksoup:ksoup-okio:<version>")// Optional: Include only if you need to use network request functions such as// Ksoup.parseGetRequest, Ksoup.parseSubmitRequest, and Ksoup.parsePostRequestimplementation("com.fleeksoft.ksoup:ksoup-network-ktor2:<version>")
Ksoup supportsCharsets
- Standard charsets are already supported byKsoup IO, but for extended charsets, plesae add
com.fleeksoft.charset:charset-ext
, For more details, visit theCharsets Documentation
For API documentation you can checkJsoup. Most of the APIs work without any changes.
val html="<html><head><title>One</title></head><body>Two</body></html>"val doc:Document=Ksoup.parse(html= html)println("title =>${doc.title()}")// Oneprintln("bodyText =>${doc.body().text()}")// Two
This snippet demonstrates how to useKsoup.parse
for parsing an HTML string and extracting the title and body text.
//Please note that the com.fleeksoft.ksoup:ksoup-network library is required for Ksoup.parseGetRequest.val doc:Document=Ksoup.parseGetRequest(url="https://en.wikipedia.org/")// suspend function// orval doc:Document=Ksoup.parseGetRequestBlocking(url="https://en.wikipedia.org/")println("title:${doc.title()}")val headlines:Elements= doc.select("#mp-itn b a")headlines.forEach { headline:Element->val headlineTitle= headline.attr("title")val headlineLink= headline.absUrl("href")println("$headlineTitle =>$headlineLink")}
val doc:Document=Ksoup.parse(xml, parser=Parser=Parser.xmlParser())
//Please note that the com.fleeksoft.ksoup:ksoup-network library is required for Ksoup.parseGetRequest.val doc:Document=Ksoup.parseGetRequest(url="https://en.wikipedia.org/")// suspend functionval metadata:Metadata=Ksoup.parseMetaData(element= doc)// suspend function// orval metadata:Metadata=Ksoup.parseMetaData(html=HTML)println("title:${metadata.title}")println("description:${metadata.description}")println("ogTitle:${metadata.ogTitle}")println("ogDescription:${metadata.ogDescription}")println("twitterTitle:${metadata.twitterTitle}")println("twitterDescription:${metadata.twitterDescription}")// Check com.fleeksoft.ksoup.model.MetaData for more fields
In this example,Ksoup.parseGetRequest
fetches and parses HTML content from Wikipedia, extracting and printing news headlines and their corresponding links.
- Ksoup.parse(html: String, baseUri: String = ""): Document
- Ksoup.parse(html: String, parser: Parser, baseUri: String = ""): Document
- Ksoup.parse(reader: Reader, parser: Parser, baseUri: String = ""): Document
- Ksoup.clean( bodyHtml: String, safelist: Safelist = Safelist.relaxed(), baseUri: String = "", outputSettings: Document.OutputSettings? = null): String
- Ksoup.isValid(bodyHtml: String, safelist: Safelist = Safelist.relaxed()): Boolean
- Ksoup.parseInput(input: InputStream, baseUri: String, charsetName: String? = null, parser: Parser = Parser.htmlParser()) from (ksoup-io, ksoup-okio, ksoup-kotlinx, ksoup-korlibs)
- Ksoup.parseFile from (ksoup-okio, ksoup-kotlinx, ksoup-korlibs)
- Ksoup.parseSource from (ksoup-okio, ksoup-kotlinx)
- Ksoup.parseStream from (ksoup-korlibs)
- Suspend functions
- Ksoup.parseGetRequest
- Ksoup.parseSubmitRequest
- Ksoup.parsePostRequest
- Blocking functions
- Ksoup.parseGetRequestBlocking
- Ksoup.parseSubmitRequestBlocking
- Ksoup.parsePostRequestBlocking
For further documentation, please check here:Jsoup
Ksoup vs. Jsoup Benchmarks: Parsing & Selecting 448KB HTML Filetest.tx
Ksoup is an open source project, a Kotlin Multiplatform port of jsoup, distributed under the Apache License, Version 2.0. The source code of Ksoup is available onGitHub.
For questions about usage and general inquiries, please refer toGitHub Discussions.
If you wish to contribute, please read theContributing Guidelines.
To report any issues, visit ourGitHub issues, Please ensure to check for duplicates before submitting a new issue.
Copyright 2024 FLEEK SOFTLicensed under the Apache License, Version 2.0 (the "License");you may not use this file except in compliance with the License.You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0Unless required by applicable law or agreed to in writing, softwaredistributed under the License is distributed on an "AS IS" BASIS,WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.See the License for the specific language governing permissions andlimitations under the License.
About
Ksoup is a Kotlin Multiplatform library for working with HTML and XML. It's a port of the renowned Java library Jsoup.