Scott Donnelly

Email Validation - via Polymorphic PHP / JS hybrid script, internationalized CC TLD ready

I was looking around for a RegEx to match an email address last night and headed straight to, what must be, the RegEx Mecca - http://www.regular-expressions.info . The page discussing email validation is very comprehensive and informative. The section that got my mind going was under the heading “Trade-Offs in Validating Email Addresses”; in particular the lengthy RegEx and it’s associated critique:

^[A-Z0-9._%+-]+@[A-Z0-9.-]+.(?:[A-Z]{2}|com|org|net|edu|gov|mil|biz|info|mobi|name|aero|asia|jobs|museum)$ Analyze this regular expression with RegexBuddy could be used to allow any two-letter country code top level domain, and only specific generic top level domains. By the time you read this, the list might already be out of date. If you use this regular expression, I recommend you store it in a global constant in your application, so you only have to update it in one place. You could list all country codes in the same manner, even though there are almost 200 of them.
So I thought to myself, what if you could somehow automatically update the list of valid TLD’s, generate the RegEx from that, and then just use a simple function to test the RegEx? JS would be out of the window for the auto-update, unless you wanted the client to have an extra HTTP request for every visit to a page with your function. But how about a PHP script that could update the TLD list, direct from IANA, say, once per day, and cache this request. Combine the JS with the PHP by having the PHP script generate the JS programatically, by setting the script’s header’s content type attribute as application/x-javascript. And for fun, and because I never get to do any self-modifying code, have the PHP script store its updates from IANA in itself - so that it would only need to refresh its TLD list when it checks it’s own file modification time and finds itself to be unmodified for over a day.

Well, here is the result (download link at bottom of post):

< ?php
/*****************************************************
* valEmail.js.php
* 
* @author Scott Donnelly 
* @version 1.0
* Released under the LGPL V3 license.
* @license http://www.gnu.org/licenses/lgpl-3.0.txt
*
* See http://scott.donnel.ly/archive/140 for info
* thanks to http://www.regular-expressions.info
******************************************************/

// TLDs last updated: Never
$TLDs = Array();

// update list of top-level domains from IANA if current list is too old.
if (empty($TLDs) || filemtime('./valEmail.js.php') < time() - 86400) {

    // get the file from IANA; gracefully fall back to current data on fail
    if ($iana_TLD_file = file_get_contents('http://data.iana.org/TLD/tlds-alpha-by-domain.txt')) {

        $TLDs = Array();
        $iana_TLD_file = explode("\n", $iana_TLD_file);

        // parse all TLD's into TLD array - ignore comments/too short lines
        foreach($iana_TLD_file as $line)
            if (strlen(trim($line)) > 1 && $line[0] != "#")
                array_push($TLDs, trim($line));

        // self-modify to update cached TLD Array
        if ($this_file = file_get_contents('valEmail.js.php')) {

            // rip the document block at the start of the file.
            $start_of_file = substr($this_file, 0, strpos($this_file, "// TLDs last updated"));

            // synthesize the first 3 lines of the file after the docblock...
            $start_of_file .= "// TLDs last updated: "
                . date('D/n/Y G:i:s T') . "\n"
                . '$TLDs = Array(';

            $first = true;
            foreach($TLDs as $TLD) {
                if ($first) $first = false; else $start_of_file .= ", ";
                $start_of_file .= "'$TLD'";
            }
            $start_of_file .= ");\n";

            // ... append the rest of the file as per current file...
            $this_file = $start_of_file . substr($this_file, strpos($this_file, ");\n") + 3);

            // .. and write it out to the file system.
            if ($fp = fopen('./valEmail.js.php', 'wt')) {
                fwrite($fp, $this_file);
                fclose($fp);
            }
        }
    }
}

// output the email check function as a JS file.
header('Content-type: application/x-javascript');

echo "function valEmail(email) {   
        emailRE = new RegExp('[a-z0-9!#$%\&\'*+/=?^_`{|}~-]+(?:\\.[a-z0-9!#$%\&\'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\\.)+(?:";

$first = true;
foreach($TLDs as $TLD) {
    if ($first) $first = false; else echo "|";    
    echo(strtolower($TLD));
}

echo ")$')\n\treturn emailRE.test(email);\n}\n";
?>

Executing the code the first time triggers the script to modify itself, rewriting lines 13 and 14 to:

// TLDs last updated: Mon/9/2010 21:43:00 BST
$TLDs = Array('AC', 'AD', 'AE', 'AERO', 'AF', 'AG', 'AI', 'AL', 'AM', 'AN', 'AO', 'AQ', 'AR', 'ARPA', 'AS', 'ASIA', 'AT', 'AU', 'AW', 'AX', 'AZ', 'BA', 'BB', 'BD', 'BE', 'BF', 'BG', 'BH', 'BI', 'BIZ', 'BJ', 'BM', 'BN', 'BO', 'BR', 'BS', 'BT', 'BV', 'BW', 'BY', 'BZ', 'CA', 'CAT', 'CC', 'CD', 'CF', 'CG', 'CH', 'CI', 'CK', 'CL', 'CM', 'CN', 'CO', 'COM', 'COOP', 'CR', 'CU', 'CV', 'CX', 'CY', 'CZ', 'DE', 'DJ', 'DK', 'DM', 'DO', 'DZ', 'EC', 'EDU', 'EE', 'EG', 'ER', 'ES', 'ET', 'EU', 'FI', 'FJ', 'FK', 'FM', 'FO', 'FR', 'GA', 'GB', 'GD', 'GE', 'GF', 'GG', 'GH', 'GI', 'GL', 'GM', 'GN', 'GOV', 'GP', 'GQ', 'GR', 'GS', 'GT', 'GU', 'GW', 'GY', 'HK', 'HM', 'HN', 'HR', 'HT', 'HU', 'ID', 'IE', 'IL', 'IM', 'IN', 'INFO', 'INT', 'IO', 'IQ', 'IR', 'IS', 'IT', 'JE', 'JM', 'JO', 'JOBS', 'JP', 'KE', 'KG', 'KH', 'KI', 'KM', 'KN', 'KP', 'KR', 'KW', 'KY', 'KZ', 'LA', 'LB', 'LC', 'LI', 'LK', 'LR', 'LS', 'LT', 'LU', 'LV', 'LY', 'MA', 'MC', 'MD', 'ME', 'MG', 'MH', 'MIL', 'MK', 'ML', 'MM', 'MN', 'MO', 'MOBI', 'MP', 'MQ', 'MR', 'MS', 'MT', 'MU', 'MUSEUM', 'MV', 'MW', 'MX', 'MY', 'MZ', 'NA', 'NAME', 'NC', 'NE', 'NET', 'NF', 'NG', 'NI', 'NL', 'NO', 'NP', 'NR', 'NU', 'NZ', 'OM', 'ORG', 'PA', 'PE', 'PF', 'PG', 'PH', 'PK', 'PL', 'PM', 'PN', 'PR', 'PRO', 'PS', 'PT', 'PW', 'PY', 'QA', 'RE', 'RO', 'RS', 'RU', 'RW', 'SA', 'SB', 'SC', 'SD', 'SE', 'SG', 'SH', 'SI', 'SJ', 'SK', 'SL', 'SM', 'SN', 'SO', 'SR', 'ST', 'SU', 'SV', 'SY', 'SZ', 'TC', 'TD', 'TEL', 'TF', 'TG', 'TH', 'TJ', 'TK', 'TL', 'TM', 'TN', 'TO', 'TP', 'TR', 'TRAVEL', 'TT', 'TV', 'TW', 'TZ', 'UA', 'UG', 'UK', 'US', 'UY', 'UZ', 'VA', 'VC', 'VE', 'VG', 'VI', 'VN', 'VU', 'WF', 'WS', 'XN--0ZWM56D', 'XN--11B5BS3A9AJ6G', 'XN--80AKHBYKNJ4F', 'XN--9T4B11YI5A', 'XN--DEBA0AD', 'XN--FIQS8S', 'XN--FIQZ9S', 'XN--FZC2C9E2C', 'XN--G6W251D', 'XN--HGBK6AJ7F53BBA', 'XN--HLCJ6AYA9ESC7A', 'XN--J6W193G', 'XN--JXALPDLP', 'XN--KGBECHTV', 'XN--KPRW13D', 'XN--KPRY57D', 'XN--MGBAAM7A8H', 'XN--MGBAYH7GPA', 'XN--MGBERP4A5D4AR', 'XN--O3CW4H', 'XN--P1AI', 'XN--PGBS0DH', 'XN--WGBH1C', 'XN--XKC2AL3HYE2A', 'XN--YGBI2AMMX', 'XN--ZCKZAH', 'YE', 'YT', 'ZA', 'ZM', 'ZW');

Anyway, the script produces (at the time of writing!) the following JavaScript function:

function valEmail(email) {    
        emailRE = new RegExp('[a-z0-9!#$%\&\'*+/=?^_`{|}~-]+(?:\\.[a-z0-9!#$%\&\'*+/=?^_`{|}~-]+)*@
(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\\.)+?:ac|ad|ae|aero|af|ag|ai|al|am|an|ao|aq|ar|arpa|as|asia|at|au|
aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|biz|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cat|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|
com|coop|cr|cu|cv|cx|cy|cz|de|dj|dk|dm|do|dz|ec|edu|ee|eg|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|
gl|gm|gn|gov|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|info|int|io|iq|ir|is|it|je|jm|jo|jobs|jp|ke|
kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mil|mk|ml|mm|mn|mo|mobi|
mp|mq|mr|ms|mt|mu|museum|mv|mw|mx|my|mz|na|name|nc|ne|net|nf|ng|ni|nl|no|np|nr|nu|nz|om|org|pa|pe|
pf|pg|ph|pk|pl|pm|pn|pr|pro|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|sk|sl|sm|sn|so|sr|st|su|sv|
sy|sz|tc|td|tel|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|travel|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|
xn--0zwm56d|xn--11b5bs3a9aj6g|xn--80akhbyknj4f|xn--9t4b11yi5a|xn--deba0ad|xn--fiqs8s|xn--fiqz9s|
xn--fzc2c9e2c|xn--g6w251d|xn--hgbk6aj7f53bba|xn--hlcj6aya9esc7a|xn--j6w193g|xn--jxalpdlp|xn--kgbechtv|
xn--kprw13d|xn--kpry57d|xn--mgbaam7a8h|xn--mgbayh7gpa|xn--mgberp4a5d4ar|xn--o3cw4h|xn--p1ai|
xn--pgbs0dh|xn--wgbh1c|xn--xkc2al3hye2a|xn--ygbi2ammx|xn--zckzah|ye|yt|za|zm|zw)$')
    return emailRE.test(email);
}

Not pretty, but “fire-and-forget” - you will never need to look at this ugly JS as it will never need editing (These sound like famous last words… no more than 640k anyone??! :-) )

Anyways, you can give it a try here:


You can downoad the zipped valEmail.js.php file using this link.