patternhelloworld/url-knifePublic

NotificationsYou must be signed in to change notification settings
Fork15
Star355

Extract and decompose (fuzzy) URLs (including emails, which are conceptually a part of URLs) in texts with Area-Pattern-based modularity

License

MIT, Unknown licenses found

Licenses found

355 stars 15 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
dist		dist
public		public
src		src
test		test
.babelrc		.babelrc
.gitignore		.gitignore
.nvmrc		.nvmrc
.prettierrc		.prettierrc
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
gulpfile.babel.js		gulpfile.babel.js
license.txt		license.txt
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json
webpack.config.babel.js		webpack.config.babel.js

Repository files navigation

Url-knife

Overview

Extract and decompose (fuzzy) URLs (including emails, which are conceptually a part of URLs) in texts withArea-Pattern-based modularity.

This library is currently being refactored into TypeScript, as it was originally developed in JavaScript.

URL knife

LIVE DEMO

Area-Pattern-Based Modularity

TheArea represents a designated section of content, such as general text, XML (HTML) areas, URL areas, or EMAIL areas. EachArea is associated with a specific set ofPatterns (regular expressions) tailored to its context.

Example:

In aTextArea (general plain text), the system applies a URL-specific regular expression to extract potential URLs.
Once the area is narrowed down to contain URLs,UrlArea logic is used, applying URL-specific patterns to decompose the URL into its components (e.g., protocol, domain, path, query parameters).

Enhanced Accuracy with Regular Expression Indexes:

To further improve accuracy, the system leverages theindex (oroffset) values from regular expressions. These indexes help pinpoint exact locations of matches within the text, ensuring precise extraction and minimizing false positives.

For example:

If aCommentArea is processed using its specific patterns, the system identifies indexes for matches within that area.
These indexes can then be used to exclude matched URLs from a broaderTextArea, ensuring only relevant URLs are processed and avoiding redundant or incorrect extractions.

Key Benefits:

This modular approach ensures that eachArea is processed efficiently with the most relevant and optimized regular expressions. By incorporating index-based matching, it enables robust, scalable, and highly accurate parsing for various content types while preventing conflicts between overlapping patterns.

Installation

For ES5 users, refer topublic/index.html.

<html><body><scriptsrc="../dist/url-knife.bundle.js"></script><--!OR!--><scriptsrc="https://cdn.jsdelivr.net/gh/patternknife/url-knife@4.1.6/dist/url-knife.bundle.min.js"></script></body></html>

For ES6 npm users, run 'npm install --save url-knife' in the console.(Requred Node v18.20.4)

import {TextArea, UrlArea, XmlArea} from 'url-knife';

For ES5, add Pattern before usage:

Pattern.UrlArea...

Syntax & Usage

Chapter 1. Normalize or parse one URL

Chapter 2. Extract all URLs or emails

Chapter 3. Extract URIs with certain names

Chapter 4. Extract all URLs in raw HTML or XML

Chapter 1. Normalize or parse one URL

The following two methods should be used for processing a single URL, not for multiple URLs within a text.(For handling multiple URLs, refer to Chapters 2 and 4.)

normalizeUrl vs parseUrl

If you need to parse a standard URL without any typos, it is safe to useparseUrl. However,normalizeUrl is designed to handle URLs that may contain human errors.

RunnormalizeUrl

/***@brief* Normalize an url with potential human errors (Intranet urls are not allowed.)*/varsample1=Pattern.UrlArea.normalizeUrl("htp/:/abcgermany.,def;:9094 #park//noon??abc=retry")varsample2=Pattern.UrlArea.normalizeUrl("'://abc.jppp:9091 /park/noon'")varsample3=Pattern.UrlArea.normalizeUrl("ss hd : /university,.acd. ;jpkp: 9091/adc??abc=.com")

Results

{"url":"htp/:/abcgermany.,def;:9094 #park//noon??abc=retry","normalizedUrl":"http://abcgermany.de:9094#park/noon?abc=retry","removedTailOnUrl":"","protocol":"http","onlyDomain":"abcgermany.de","onlyParams":"?abc=retry","onlyUri":"#park/noon","onlyUriWithParams":"#park/noon?abc=retry","onlyParamsJsn":{"abc":"retry"},"type":"domain","port":"9094"}{"url":"'://abc.jppp:9091 /park/noon'","normalizedUrl":"abc.jp:9091/park/noon","removedTailOnUrl":"'","protocol":null,"onlyDomain":"abc.jp","onlyParams":null,"onlyUri":"/park/noon'","onlyUriWithParams":"/park/noon'","onlyParamsJsn":null,"type":"domain","port":"9091"}{"url":"ss hd : /university,.acd. ;jpkp로 접속","normalizedUrl":"ssh://university.ac.jp","removedTailOnUrl":"","protocol":"ssh","onlyDomain":"university.ac.jp","onlyParams":null,"onlyUri":null,"onlyUriWithParams":null,"onlyParamsJsn":null,"type":"domain","port":null}

RunparseUrl

/***@brief* Parse an url with no potential human errors*/varurl=Pattern.UrlArea.parseUrl("xtp://gooppalgo.com/park/tree/?abc=1")

console.log()

{"url":"xtp://gooppalgo.com/park/tree/?abc=1","removedTailOnUrl":"","protocol":"xtp (unknown protocol)","onlyDomain":"gooppalgo.com","onlyParams":"?abc=1","onlyUri":"/park/tree/","onlyUriWithParams":"/park/tree/?abc=1","onlyParamsJsn":{"abc":"1"},"type":"domain","port":null}

Chapter 2. Extract all URLs or emails

The following methods are recommended to use in most cases.

extractAllUrls

vartextStr='http://[::1]:8000에서 http ://www.example.com/wpstyle/?p=364 is ok \n'+'HTTP://foo.com/blah_blah_(wikipedia) https://www.google.com/maps/place/USA/@36.2218457,... tnae1ver.com:8000on the internet  Asterisk\n '+'the packed1book.net. fakeshouldnotbedetected.url?abc=fake s5houl７十七日dbedetected.jp?japan=go&html=<span>가나다@pacbook.net</span>; abc.com/ad/fg/?kk=5 abc@daum.net'+'Have you visited http://goasidaio.ac.kr?abd=5안녕하세요?5...,.&kkk=5rk.,, '+'http://✪df.ws/123\n'+'http://142.42.1.1:8080/\n'+'http://-.~_!$&\'()*+,;=:%40:80%2f::::::@example.com '+'Have <b>you</b> visited goasidaio.ac.kr?abd=5hell0?5...&kkk=5rk.,. ';/*** @brief*Distillallurlsfromnormaltext* @authorAndrewKang* @paramtextStrstringrequired* @paramnoProtocolJsnobject*default :{'ipV4' :false,'ipV6' :false,'localhost' :false,'intranet' :false}varurls=Pattern.TextArea.extractAllUrls(textStr,{'ipV4' :true,'ipV6' :false,'localhost' :false,'intranet' :true})

extractAllEmails

        /**         * @brief         * Distill all emails from normal text         * @author Andrew Kang         * @param textStr string required         * @param prefixSanitizer boolean (default : false)         * @return array         */  var emails = Pattern.TextArea.extractAllEmails(textStr, true)

console.log()

You may be wondering what the 'pass' property below means. If 'pass' is true, that is the email pattern is strictly true following RFC rules.

 [{"value": {"email":"가나다@apacbook.ac.kr","removedTailOnEmail":null,"type":"domain"    },"area":"text","index": {"start":222,"end":240    },"pass":false  },  {"value": {"email":"adssd@asdasd.ac.jp","removedTailOnEmail":null,"type":"domain","removedTailOnUrl":"..."    },"area":"text","index": {"start":242,"end":263    },"pass":true  }]

LIVE DEMO

Chapter 3. Extract URIs with certain names

varsampleText='https://google.com/abc/777?a=5&b=7 abc/def 333/kak abc/55에서 abc/53 abc/533/ka abc/53a/ka /123a/abc/556/dd /abc/123?a=5&b=tkt /xyj/asff'+'a333/kak  nice/guy/ bad/or/nice/guy ssh://nice.guy.com/?a=dkdfl';/**     *@brief     * Distill uris with certain names from normal text     *@author Andrew Kang     *@param textStr string required     *@param uris array required     * for example, [['a','b'], ['c','d']]     * If you use {number}, this means 'only number' ex) [['a','{number}'], ['c','d']]     *@param endBoundary boolean (default : false)     *@return array     */varuris=Pattern.TextArea.extractCertainUris(sampleText,[['{number}','kak'],['nice','guy'],['abc','{number}']],true)// 'If endBoundary is set to false, more uris are detected.'// This detects all URIs containing '{number}/kak' or nice/guy' or 'abc/{number}'

console.log()

[{"uriDetected":{"value":{"url":"/abc/777?a=5&b=7","removedTailOnUrl":"","protocol":null,"onlyDomain":"","onlyParams":"?a=5&b=7","onlyUri":"/abc/777","onlyUriWithParams":"/abc/777?a=5&b=7","onlyParamsJsn":{"a":"5","b":"7"},"type":"domain","port":null},"area":"text","index":{"start":18,"end":34}},"inWhatUrl":{"value":{"url":"https://google.com/abc/777?a=5&b=7","removedTailOnUrl":"","protocol":"https","onlyDomain":"google.com","onlyParams":"?a=5&b=7","onlyUri":"/abc/777","onlyUriWithParams":"/abc/777?a=5&b=7","onlyParamsJsn":{"a":"5","b":"7"},"type":"domain","port":null},"area":"text","index":{"start":0,"end":34}}},{"uriDetected":{"value":{"url":"333/kak","removedTailOnUrl":"","protocol":null,"onlyDomain":null,"onlyParams":null,"onlyUri":"333/kak","onlyUriWithParams":"333/kak","onlyParamsJsn":null,"type":"uri","port":null},"area":"text","index":{"start":43,"end":51}},"inWhatUrl":undefined},{"uriDetected":{"value":{"url":"abc/53","removedTailOnUrl":"","protocol":null,"onlyDomain":null,"onlyParams":null,"onlyUri":"abc/53","onlyUriWithParams":"abc/53","onlyParamsJsn":null,"type":"uri","port":null},"area":"text","index":{"start":60,"end":67}},"inWhatUrl":undefined},{"uriDetected":{"value":{"url":"abc/533/ka","removedTailOnUrl":"","protocol":null,"onlyDomain":null,"onlyParams":null,"onlyUri":"abc/533/ka","onlyUriWithParams":"abc/533/ka","onlyParamsJsn":null,"type":"uri","port":null},"area":"text","index":{"start":67,"end":77}},"inWhatUrl":undefined},{"uriDetected":{"value":{"url":"/123a/abc/556/dd","removedTailOnUrl":"","protocol":null,"onlyDomain":null,"onlyParams":null,"onlyUri":"/123a/abc/556/dd","onlyUriWithParams":"/123a/abc/556/dd","onlyParamsJsn":null,"type":"uri","port":null},"area":"text","index":{"start":89,"end":105}},"inWhatUrl":undefined},{"uriDetected":{"value":{"url":"/abc/123?a=5&b=tkt","removedTailOnUrl":"","protocol":null,"onlyDomain":null,"onlyParams":"?a=5&b=tkt","onlyUri":"/abc/123","onlyUriWithParams":"/abc/123?a=5&b=tkt","onlyParamsJsn":{"a":"5","b":"tkt"},"type":"uri","port":null},"area":"text","index":{"start":106,"end":124}},"inWhatUrl":undefined},{"uriDetected":{"value":{"url":"nice/guy","removedTailOnUrl":"/","protocol":null,"onlyDomain":null,"onlyParams":null,"onlyUri":"nice/guy","onlyUriWithParams":"nice/guy","onlyParamsJsn":null,"type":"uri","port":null},"area":"text","index":{"start":144,"end":153}},"inWhatUrl":undefined},{"uriDetected":{"value":{"url":"/or/nice/guy","removedTailOnUrl":"","protocol":null,"onlyDomain":null,"onlyParams":null,"onlyUri":"/or/nice/guy","onlyUriWithParams":"/or/nice/guy","onlyParamsJsn":null,"type":"uri","port":null},"area":"text","index":{"start":157,"end":170}},"inWhatUrl":null}]

Chapter 4. Extract all URLs in raw HTML or XML

// The sample of 'XML (HTML)'varxmlStr='en.wikipedia.org/wiki/Wikipedia:About\n'+'<body><p>packed1book.net?user[name][first]=tj&user[name][last]=holowaychuk</p>\n'+'fakeshouldnotbedetected.url?abc=fake -s5houl７十七日dbedetected.jp?japan=go- '+'plus.google.co.kr0에서.., \n'+'https://plus.google.com/+google\n'+'https://www.google.com/maps/place/USA/@36.2218457,...'+'<img style=\' = > float : none ; height: 200px;max-width: 50%;margin-top : 3%\' alt="undefined" src="http://www.aaa가가.com/image/showWorkOrderImg?fileName=12345.png"/>\n'+'<!--how about adackedbooked.co.kr-the site?  请发邮件给我abc件给@navered.com ssh://www.aaa가.com" <p >--邮件给aa件给@daum.net</p> www.naver.com\n  <p></p>-->  "abc@daum.net"로 보내주세요. '+'-gigi.dau.ac.kr?mac=10 -dau.ac.kr?mac=10 <p>abcd@daum.co.kr에서 가나다@pacbook.net<span>Please align the paper to the left.</span>&nbsp;</p>\n'+'<p> 구루.com <img src="/image/showWorkOrderImg?fileName=123456.png" alt="undefined" abc/></p>\n'+'http: //ne1ver.com:8000?abc=1&dd=5 localhost:80 estonia.ee/ estonia.ee? <p> https://flaviocopes.com/how-to-inspect-javascript-object/ ※Please ask 203.35.33.555:8000 if you have any issues! ※&nbsp;&nbsp;&nbsp;&nbsp;</p></body> Have you visited goasidaioaaa.ac.kr';varurls=PatternExtractor.XmlArea.extractAllUrls(xmlStr);

console.log()

[// Not all listed{"value":{"url":"packed1book.net?user[name][first]=tj&user[name][last]=holowaychuk","removedTailOnUrl":"","protocol":null,"onlyDomain":"packed1book.net","onlyParams":"?user[name][first]=tj&user[name][last]=holowaychuk","onlyUri":null,"onlyUriWithParams":"?user[name][first]=tj&user[name][last]=holowaychuk","onlyParamsJsn":{"user":{"name":{"first":"tj","last":"holowaychuk"}}},"type":"domain","port":null},"area":"text"},{"value":{"url":"adackedbooked.co.kr","removedTailOnUrl":"","protocol":null,"onlyDomain":"adackedbooked.co.kr","onlyParams":null,"onlyUri":null,"onlyUriWithParams":null,"onlyParamsJsn":null,"type":"domain","port":null},"area":"comment"}.....]

About

Extract and decompose (fuzzy) URLs (including emails, which are conceptually a part of URLs) in texts with Area-Pattern-based modularity

Topics

url-parsing url-parser email-parsing uri-template url-validation email-extractor email-parser pre-processing url-extractor url-normalization url-normalizer

Resources

Readme

License

MIT, Unknown licenses found

Releases4

[v4.1.6] - Area-Pattern-based modularity Latest

Jan 26, 2025

+ 3 releases

Packages

No packages published

Movatterモバイル変換

License

Licenses found

patternhelloworld/url-knife

Folders and files

Latest commit

History

Repository files navigation

Url-knife

Overview

URL knife

Area-Pattern-Based Modularity

Example:

Enhanced Accuracy with Regular Expression Indexes:

Key Benefits:

Installation

Syntax & Usage

Chapter 1. Normalize or parse one URL

normalizeUrl vs parseUrl

RunnormalizeUrl

Results

RunparseUrl

console.log()

Chapter 2. Extract all URLs or emails

The following methods are recommended to use in most cases.

extractAllUrls

extractAllEmails

console.log()

You may be wondering what the 'pass' property below means. If 'pass' is true, that is the email pattern is strictly true following RFC rules.

Chapter 3. Extract URIs with certain names

console.log()

Chapter 4. Extract all URLs in raw HTML or XML

console.log()

About

Topics

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases4

Packages0

Uh oh!

Contributors3

Uh oh!

Languages

Run`normalizeUrl`

Run`parseUrl`

Packages