Thursday, June 30, 2011

PHP regex to match any link on text

//1st match subdomain: xyz. OR
// next, match domain
// then match .
// then match dual extension: .com or .org or ...and .sg OR
// then match single extension: TLD: .com or .org or .net or .... OR
// then match single country name .us
// then match uri, example: /xxx
// also try to match ending /
// and also try to match with ending name /xxx/lastnamehere
$sSubDomainSet = "[^.,\:\+\s]*\."; // xyz.
$sPatSubDomain = "(" . $sSubDomainSet . ")*";
$sPatCountryDomain = "((com|org|net|name|mil|gov|co|info)\.[a-z]{2})";
$sPatTLD = "(com|org|net|name|mil|gov|info)"; //top level domain tld
$sPatCountryTLD = "([a-z]{2})";
$sURIChar = "[a-z0-9\.\?\=\&\(\&amp\;)\-\_\+\%]";
$sPatURI = "([\/]*" . $sURIChar . "*)?";
$sPatURISlash = "([\/])?";
$sPatLastName = "(" . $sURIChar . "*)";
//match last uri name, example domain.com/email/xxx (match xxx)
$sPattern = "/(" . "((http:\/\/)|(https:\/\/))?" . $sPatSubDomain . $sSubDomainSet .
"(" . $sPatCountryDomain . "|" . $sPatTLD . "|" . $sPatCountryTLD . ")" .
$sPatURI . $sPatURISlash . $sPatLastName .
")/is";
preg_match_all($sPattern, $sContent, $matches, PREG_PATTERN_ORDER);
if (count($matches[1]) > 0) {
$sResult = $matches[1][0];//match first title tag
}

Example Results:
Original: this is a link https://www.yahoo.com/, result: https://www.yahoo.com/
Original: this is a link https://yahoo.com, result: https://yahoo.com
Original: this is a link http://yahoo.com, result: http://yahoo.com
Original: this is a link www.yahoo.com, result: www.yahoo.com
Original: this is a link yahoo.com, result: yahoo.com
Original: this is a link www.yahoo.org, result: www.yahoo.org
Original: this is a link www.whitehouse.gov, result: www.whitehouse.gov
Original: this is a link www.whitehouse.mil, result: www.whitehouse.mil
Original: this is a link www.yahoo.net, result: www.yahoo.net
Original: this is a link yahoo.com.my, result: yahoo.com.my
Original: this is a link yahoo.org.my, result: yahoo.org.my
Original: this is a link yahoo.co.sg, result: yahoo.co.sg
Original: this is a link yahoo.co, result: yahoo.co
Original: this is a link yahoo.my, result: yahoo.my
Original: this is a link yahoo.us, result: yahoo.us
Original: this is a link xyz.123.www.yahoo.com, result: xyz.123.www.yahoo.com
Original: this is a link yahoo.com/news is this right?, result: yahoo.com/news
Original: this is a link yahoo.com/ i think so, result: yahoo.com/
Original: this is a link yahoo.com/aboutus.php, result: yahoo.com/aboutus.php
Original: this is a link yahoo.com/email/ maybe its right, result: yahoo.com/email/
Original: this is a link yahoo.com/email/compose,gotta love it, result: yahoo.com/email/compose
Original: this is a link yahoo.COM/email/compose.html?search=233 haha, result: yahoo.COM/email/compose.html?search=233
Original: this is a link yahoo.com/email/compose.html?search=233&xx=22 haha, result: yahoo.com/email/compose.html?search=233&xx=22
Original: this is a link yahoo.com/email/compose.html?search=233&uu=232 hehe, result: yahoo.com/email/compose.html?search=233&uu=232
Original: this is a link yahoo.com/email/compose.html?search=233&xx=2?ehh, result: yahoo.com/email/compose.html?search=233&xx=2?ehh
Original: this is a link yahoo.com/email/com-pose.html, result: yahoo.com/email/com-pose.html
Original: this is a link yahoo.com/email/com-pose_.html?a=1, result: yahoo.com/email/com-pose_.html?a=1
Original: this is a link yahoo.com/email/com+20%pose.html?a=1, result: yahoo.com/email/com+20%pose.html?a=1


p/s: yes i know the line with double ? is incorrect, im lazy to write it :)

Full pattern string:
/(((http:\/\/)|(https:\/\/))?([^.,\:\+\s]*\.)*[^.,\:\+\s]*\.(((com|org|net|name|mil|gov|co|info)\.[a-z]{2})|(com|org|net|name|mil|gov|info)|([a-z]{2}))([\/]*[a-z0-9\.\?\=\&\(\&amp\;)\-\_\+\%]*)?([\/])?([a-z0-9\.\?\=\&\(\&amp\;)\-\_\+\%]*))/is

updated aug 28, avoid +,: in name, added http and https support

No comments: