Wednesday, October 16, 2013

Email::Valid Peculiarities

Over here at Dada Mail, we do a lot of verification of email addresses when confirming subscriptions. Is the form of the email correct? Has this address been used to try to subscribe with before? Has the user themselves confirmed their subscription? What happens when there's an error? There's a lot to it (sadly) and if something in one of these steps goes wrong, a subscription doesn't happen, or worse yet, a mailing list is tarnished with not-so-valid addresses. This little tale is about invalid in form addresses getting on people's mailing lists.

A user posted on the Dada Mail support forums: "There's an address with a space in it, that's entered my list! It's causing problems with delivery of the ENTIRE list? How did that happen?"

A lot of the time developing a very complicated piece of software is supporting it (I think, anyways). When a question like this comes down the tubes, I have no great answer, except it must have been an outside thing. So I ask, "Was the address added directly into the backend?" That's something that happens from time to time, although there are many pitfalls: importing data can be done incorrectly - and even if it's done correctly once, the design of the backend can change, and then it's not done correctly anymore.

That's why, if it's an option, it's a good idea to use the API that presented to the developer, rather than try to retinker it, yourself. 

When the user replied back with, "Nope!" I knew I was really stuck.  "But I have tests!"  I protested to my Macbook Pro. "Coverage!" "I didn't roll my own email validator!" "I did everything you're supposed to do!"

 So, current tests be-damned, I had to start from square one: Attempt to recreate the problem, in the simplest way I could. So, I tried to join a list, with an address with a space in it, like this:

myaddress@example .com

And, wouldn't you know it, the app took the address like nothing was wrong. Now,  totally believing my user, but being in total disbelief myself, it was time to hit the code.  We use Email::Valid in Dada Mail, to test the validity of an email address, like so:



#!/usr/bin/perl 

use Email::Valid; 

my $address = 'someone@example.com'; 

if(Email::Valid->address($address)){ 
 print "Valid!"; 
}
else { 
 print "NOPE!"; 
} 
# prints, Valid! 


This is wrong.  Observe:


#!/usr/bin/perl 

use Email::Valid; 

my $address = 'someone@example .com'; # notice the space! 

if(Email::Valid->address($address)){ 
 print "Valid!"; 
}
else { 
 print "NOPE!"; 
}
# Prints, Valid!

And, there's my problem, in a pretty simple case.

So, what's going on? Basically my code had assumed that Email::Valid's address method returned a boolean value. It does not. From Email::Valid's docs (emphasis is mine):
This is the primary method which determines whether an email address is valid. It's behavior is modified by the values of mxcheck(), tldcheck(), local_rules(), fqdn(), and fudge(). If the address passes all checks, the (possibly modified) address is returned as a string. Otherwise, the undefined value is returned. In a list context, the method also returns an instance of the Mail::Address class representing the email address.
Big difference! So let's try that code snippet, again:

 
#!/usr/bin/perl 

use Email::Valid; 

my $address = 'someone@example .com'; # notice the space! 

my $validated_address = undef; 
if($validated_address = Email::Valid->address($address)){ 
 print "Valid!: $validated_address"; 
}
else { 
 print "NOPE!"; 
}
# prints, Valid!: someone@example.com

Ah-ha: Email::Valid will take it upon itself to modify the email address for you - but only in certain circumstances - spaces in certain places of the address. So, this still returns undef:


#!/usr/bin/perl 

use Email::Valid; 

my $address = 'some one@example.com'; # notice the space! 

my $validated_address = undef; 
if($validated_address = Email::Valid->address($address)){ 
 print "Valid!: $validated_address"; 
}
else { 
 print "NOPE!"; 
}
# prints, NOPE!

This certainly makes things confusing. The reason that the email address gets modified, is that under the hood, Email::Valid turns the address you give it, into a Mail::Address object (it'll also just accept an Mail::Address object, and Mail::Address itself will take out that pesky space in the domain part of our email address, when it's own address() method returns a string:


#!/usr/bin/perl 

use Mail::Address;

my    $address  = 'someone@example .com'; # notice the space!
my    $modified = ( Mail::Address->parse($address) )[0]->address;
print $modified;

# prints,  someone@example.com

This makes it difficult to patch up Email::Valid, since then I'd have to delve into Mail::Address - and perhaps Mail::Address has a perfectly good reason to do this. Since both Email::Valid and Mail::Address have been around for so long, and many programs have been written with their API the way it is, it's a pretty good bet they'll stay this way.

So, what to do? Since my app's validation system is a little more than, "is the form correct?" it's difficult in the  workflow to take in account of the changes Email::Valid makes to the address I pass - I'm not sure if I even like the idea: getting a boolean back in a validation method seems a better idea - most of my own valiation methods in the app have the pattern of, return a boolean status, plus a hashref saying what may be wrong,

 
# In a perfect world: 
my ($status, $errors) = $validater->check_out_this_address($address); 
if($status == 1){ 
    # Good to go!
}
else { 
    print "Dude, PROBLEMS:\n"; 
    for (keys %$errors){ 
        print '* ' . $_ . "\n"; 
    }
}


So, for example, say I'm importing a whole address book of addresses - thousands of addresses! And before I do that, I sort out the duplicate addresses, as the validation steps are many, and cpu consuming. Using Email::Valid the correct way could introduce subtle bugs into the system. Simplifying things:

 
#!/usr/bin/perl 

use Email::Valid;

# Say, I've already sorted out dupes: 
my @check_these = 
(
'one@foobar.com',
'another@foobar.com',
'another@foobar .com',
'two@foobar.com',
'blech'
); 

my @validated = (); 
my @invalid   = (); 

for(@check_these){ 
    my $address; 
    if($address = Email::Valid->address($_)) { 
        push(@validated, $address); 
    }
    else { 
        push(@invalid, $_); 
    } 
}

print "Good! Addresses:\n";
print "* $_\n" for @validated;

print "BAD! Addresses:\n";
print "* $_\n" for @invalid;
This prints out,
 
Good! Addresses:
* one@foobar.com
* another@foobar.com
* another@foobar.com
* two@foobar.com
BAD! Addresses:
* blech


As you can see, some weird things happen:
  • The number of addresses total that are returned isn't the same as what gets passed
  • All the invalid address are not reported
  • my once unique list now isn't so unique anymore. 
 
I'm not sure yet what the best thing to do is. For now, sadly, I have to do something hacky in my own email (in form) validator - I just look for a space, if there's a space, it's automatically invalid:



#!/usr/bin/perl 

use Email::Valid; 

my $address = 'someone@example. com'; # notice the space! 

if(is_valid_email($address)){ 
 print "Valid!"; 
}
else { 
 print "Nope!"; 
}

sub is_valid_email { 
 my $a = shift;
 
 if($a =~ m/\s/){ 
  return 0; 
 } 
 if(Email::Valid->address($a)){ 
  return 1; 
 }
 else { 
  return 0; 
 }
}

# prints, Nope!

So many miscalculations on my part, which! means much to learn from my own dumb self:

Firstly, I couldn't believe it was a bug, since the way things have worked hadn't been changed in eons.

Next: way back when I first wrote put this validation stuff in the app yarns ago, I was thinking that Email::Valid was nothing but a nice wrapper around a very, very complex regex created by Jeffrey Friedl back in the day for a book on Regular Expressions that he wrote. It looks a little like this. It's a doozy.

Another miscalculation I had made was, "I have tests! How could this fail!" and I made the assumption the user was in fact on the wrong. So, I looked at my tests ,and whatayaknow: 
 

TODO: {
    local $TODO = 'This test fails.';
    ok(valid_email_sub('test@example .com')           == 0); 
};

So I actually knew about this problem before, and forgot - there's actually a link to a bug tracker in the actual test, which means I had handed off the problem to someone else to fix. Lazy!

Things to learn: 

  • Read the docs! This very peculiarity is documented in Email::Valid's own docs. Don't assume!

Let's see an example of how the address may be modified:
$addr = Email::Valid->address('Alfred Neuman ');
print "$addr\n"; # prints Neuman@foo.bar

  • Just because you think everything's working, doesn't mean there aren't weird little edge cases -  it's probably a given.
  • Tests are good! Skipping tests that don't pass doesn't always help you. 
  • Reporting bugs in other modules is good! Trying to fix them? Probably better. 

My thanks to rjbs for being patient as I stumble around a module he has been nice enough to adopt, warts and all.
 

2 comments:

Christopher Causer said...

Could the is_valid_email sub look something like (untested):

sub is_valid_email {
my $address = shift;
my $valid_address = Email::Valid->address($address);
return ( $valid_address and $valid_address eq $address );
}

?

This would have the benefit that any modifications, not just white-space removal would return false.

Paul Seamons said...

It is good you are doing your own searching for a space. RFC totally allows for a valid space in the email address:

my $address = '"some one"@example.com';


Valid!: "some one"@example.com

While it may be valid by the RFC, in practice I have not seen anybody actually use a valid email address with a space during the past 16 years. So - kudos to Email::Valid for doing the entire valid RFC email checking - but perhaps it is time for somebody to write an Email::Valid::Sane module (or perhaps you already can set options in Email::Valid to remove some of the esoteric types of valid addresses - checking the docs and trying -local_rules didn't help).