Implementation of an email parser. More...

#include "libtld/tld.h"
#include <stdio.h>
#include <string.h>
#include <memory>
#include <iostream>
#include <algorithm>

Include dependency graph for tld_emails.cpp:

Go to the source code of this file.

Functions
struct tld_email_list *	tld_email_alloc ()
	Allocate a list of emails object.

int	tld_email_count (struct tld_email_list *list)
	Return the number of emails found after a parse.

void	tld_email_free (struct tld_email_list *list)
	Free the list of emails.

int	tld_email_next (struct tld_email_list list, struct tld_email e)
	Retrieve the next email.

tld_result	tld_email_parse (struct tld_email_list list, char const emails, int flags)
	Parse a list of emails in the email list object.

void	tld_email_rewind (struct tld_email_list *list)
	Rewind the reading of the emails.

Detailed Description

This file includes all the functions available in the C library of libtld. The format of emails is described in RFC 5322 paragraph 3.4. That RFC uses the ABNF defined in RFC 5234. We limit our implementation to reading a line of email addresses, not a full email buffer. Thus we are limited to the content of a field such as the "To:" field. We support emails that are written as:

usern.nosp@m.ame@.nosp@m.domai.nosp@m.n.tl.nosp@m.d "First & Last Name" usern.nosp@m.ame@.nosp@m.domai.nosp@m.n.tl.nosp@m.d

And we support lists thereof (emails separated by commas.)

Also, emails may include internationalized characters (Unicode). Since our systems make use of UTF-8, the input format can be considered as UTF-8 in which case we simply accept all characters from 0xA0 to 0x10FFFF (the full Unicode range.) However, we also support the Q and B encoding to directly support email fields. The B encoding is base64 of UTF-8 data which works in ASCII 7 bit. The Q is ASCII with characters marked with the equal sign and their 2 byte codes. This works well when all the characters fit in one character set. Note that all characters can be represented because more than one encoding can be used within a phrase, but it is unlikely to be used that way.

Text versions:

http://www.ietf.org/rfc/rfc5322.txt http://www.ietf.org/rfc/rfc5234.txt http://www.ietf.org/rfc/rfc1522.txt

HTML versions (with links):

http://tools.ietf.org/html/rfc5322 http://tools.ietf.org/html/rfc5234 http://tools.ietf.org/html/rfc1522

Note: At this point we do not foresee offering group capabilities. Therefore the code does not support such. It will certainly be added later. Note that the parser will skip all white spaces, including comments. This means once parsed, all those white spaces and comments are lost.; The following code comes from a mix versions starting with RFC 2822 (http://www.ietf.org/rfc/rfc2822.txt) which still accepted all control characters everywhere. Now only white spaces are allowed in most places (\r\n\t and the space \x20). We also do not allow control characters all over the place because it is likely not valid.

(this part is not implemented, it just shows what is expected to be used for such
and such field.)
from            =       "From:" (mailbox-list / address-list) CRLF
sender          =       "Sender:" (mailbox / address) CRLF
reply-to        =       "Reply-To:" address-list CRLF
to              =       "To:" address-list CRLF
cc              =       "Cc:" address-list CRLF
bcc             =       "Bcc:" (address-list / [CFWS]) CRLF
 
address         =       mailbox / group
mailbox         =       name-addr / addr-spec
name-addr       =       [display-name] angle-addr
angle-addr      =       [CFWS] "<" addr-spec ">" [CFWS] / obs-angle-addr
group           =       display-name ":" [mailbox-list / CFWS] ";" [CFWS]
display-name    =       phrase
mailbox-list    =       (mailbox *("," mailbox)) / obs-mbox-list
address-list    =       (address *("," address)) / obs-addr-list
addr-spec       =       local-part "@" domain
local-part      =       dot-atom / quoted-string / obs-local-part
domain          =       dot-atom / domain-literal / obs-domain
domain-literal  =       [CFWS] "[" *([FWS] dcontent) [FWS] "]" [CFWS]
dcontent        =       dtext / quoted-pair
dtext           =       NO-WS-CTL /     ; Non white space controls
                        %d33-90 /       ; The rest of the US-ASCII
                        %d94-126        ;  characters not including "[",
                                        ;  "]", or "\"
NO-WS-CTL       =       %d1-8 /         ; US-ASCII control characters
                        %d11 /          ;  that do not include the
                        %d12 /          ;  carriage return, line feed,
                        %d14-31 /       ;  and white space characters
                        %d127
text            =       %d1-9 /         ; Characters excluding CR and LF
                        %d11 /
                        %d12 /
                        %d14-127 /
                        obs-text
specials        =       "(" / ")" /     ; Special characters used in
                        "<" / ">" /     ;  other parts of the syntax
                        "[" / "]" /
                        ":" / ";" /
                        "@" / "\" /
                        "," / "." /
                        DQUOTE
DQUOTE          =       %x22
ALPHA           =       %x41-5A / %x61-7A   ; A-Z / a-z
DIGIT           =       %x30-39             ; 0-9
SP              =       %x20
HTAB            =       %x09
WSP             =       SP / HTAB
CR              =       %x0D
LF              =       %x0A
CRLF            =       CR LF
FWS             =       ([*WSP CRLF] 1*WSP) /   ; Folding white space
                        obs-FWS
quoted-pair     =       ("\" text) / obs-qp
ctext           =       NO-WS-CTL /     ; Non white space controls
                        %d33-39 /       ; The rest of the US-ASCII
                        %d42-91 /       ;  characters not including "(",
                        %d93-126        ;  ")", or "\"
ccontent        =       ctext / quoted-pair / comment / encoded-word
comment         =       "(" *([FWS] ccontent) [FWS] ")"
CFWS            =       *([FWS] comment) (([FWS] comment) / FWS)
atext           =       ALPHA / DIGIT / ; Any character except controls,
                        "!" / "#" /     ;  SP, and specials.
                        "$" / "%" /     ;  Used for atoms
                        "&" / "'" /
                        "*" / "+" /
                        "-" / "/" /
                        "=" / "?" /
                        "^" / "_" /
                        "`" / "{" /
                        "|" / "}" /
                        "~"
atom            =       [CFWS] 1*atext [CFWS]
dot-atom        =       [CFWS] dot-atom-text [CFWS]
dot-atom-text   =       1*atext *("." 1*atext)
qtext           =       NO-WS-CTL /     ; Non white space controls
                        %d33 /          ; The rest of the US-ASCII
                        %d35-91 /       ;  characters not including "\"
                        %d93-126        ;  or the quote character
qcontent        =       qtext / quoted-pair
quoted-string   =       [CFWS]
                        DQUOTE *([FWS] qcontent) [FWS] DQUOTE
                        [CFWS]
word            =       atom / quoted-string
phrase          =       1*word / obs-phrase
 
# Added by RFC-1522
encoded-word    =       "=?" charset "?" encoding "?" encoded-text "?="
charset         =       token
encoding        =       token
token           =       1*<Any CHAR except SPACE, CTLs, and especials>
                        ; equivalent to:
                        ; 1*(%d33 / %d35-39 / %d42-43 / %d45 / %d48-57 /
                        ; %d65-90 / %d92 / %d94-126)
especials       =       "(" / ")" / "<" / ">" / "@" / "," / ";" / ":" /
                        DQUOTE / "/" / "[" / "]" / "?" / "." / "="
encoded-text    =       1*<Any printable ASCII character other than "?" or SPACE>
                        ; %d33-62 / %d64-126
 
# Obsolete syntax "extensions"
obs-from        =       "From" *WSP ":" mailbox-list CRLF
obs-sender      =       "Sender" *WSP ":" mailbox CRLF
obs-reply-to    =       "Reply-To" *WSP ":" mailbox-list CRLF
obs-to          =       "To" *WSP ":" address-list CRLF
obs-cc          =       "Cc" *WSP ":" address-list CRLF
obs-bcc         =       "Bcc" *WSP ":" (address-list / [CFWS]) CRLF
obs-qp          =       "\" (%d0-127)
obs-text        =       *LF *CR *(obs-char *LF *CR)
obs-char        =       %d0-9 / %d11 /          ; %d0-127 except CR and
                        %d12 / %d14-127         ;  LF
obs-utext       =       obs-text
obs-phrase      =       word *(word / "." / CFWS)
obs-phrase-list =       phrase / 1*([phrase] [CFWS] "," [CFWS]) [phrase]
obs-FWS         =       1*WSP *(CRLF 1*WSP)
obs-angle-addr  =       [CFWS] "<" [obs-route] addr-spec ">" [CFWS]
obs-route       =       [CFWS] obs-domain-list ":" [CFWS]
obs-domain-list =       "@" domain *(*(CFWS / "," ) [CFWS] "@" domain)
obs-local-part  =       word *("." word)
obs-domain      =       atom *("." atom)
obs-mbox-list   =       1*([mailbox] [CFWS] "," [CFWS]) [mailbox]
obs-addr-list   =       1*([address] [CFWS] "," [CFWS]) [address]

The ABNF is a bit complicated to use as is, so there is a lex and yacc which I find easier to implement to my point of view:

(lex part)
[-A-Za-z0-9!#$%&'*+/=?^_`{|}~]+                                          atom_text_repeat (ALPHA+DIGIT+some other characters)
([\x09\x0A\x0D\x20-\x27\x2A-\x5B\x5D-\x7E]|\\[\x09\x20-\x7E])+           comment_text_repeat
([\x33-\x5A\x5E-\x7E])+                                                  domain_text_repeat
([\x21\x23-\x5B\x5D-\x7E]|\\[\x09\x20-\x7E])+                            quoted_text_repeat
\x22                                                                     DQUOTE
[\x20\x09]*\x0D\x0A[\x20\x09]+                                           FWS
.                                                                        any other character
 
(lex definitions merged in more complex lex definitions)
[\x01-\x08\x0B\x0C\x0E-\x1F\x7F]                                         NO_WS_CTL
[()<>[\]:;@\\,.]                                                         specials
[\x01-\x09\x0B\x0C\x0E-\x7F]                                             text
\\[\x09\x20-\x7E]                                                        quoted_pair ('\\' text)
[A-Za-z]                                                                 ALPHA
[0-9]                                                                    DIGIT
[\x20\x09]                                                               WSP
\x20                                                                     SP
\x09                                                                     HTAB
\x0D\x0A                                                                 CRLF
\x0D                                                                     CR
\x0A                                                                     LF
 
(yacc part)
address_list: address
            | address ',' address_list
address: mailbox
       | group
mailbox_list: mailbox
            | mailbox ',' mailbox_list
mailbox: name_addr
       | addr_spec
group: display_name ':' mailbox_list ';' CFWS
     | display_name ':' CFWS ';' CFWS
name_addr: angle_addr
         | display_name angle_addr
display_name: phrase
angle_addr: CFWS '<' addr_spec '>' CFWS
addr_spec: local_part '@' domain
local_part: dot_atom
          | quoted_string
domain: dot_atom
      | domain_literal
domain_literal: CFWS '[' FWS domain_text_repeat FWS ']' CFWS
phrase: word
      | word phrase
word: atom
    | quoted_string
atom: CFWS atom_text_repeat CFWS
dot_atom: CFWS dot_atom_text CFWS
dot_atom_text: atom_text_repeat
             | atom_text_repeat '.' dot_atom_text
quoted_string: CFWS DQUOTE quoted_text_repeat DQUOTE CFWS
CFWS: <empty>
    | FWS comment
    | CFWS comment FWS
comment: '(' comment_content ')'
comment_content: comment_text_repeat
               | comment
               | ccontent ccontent

Definition in file tld_emails.cpp.

Function Documentation

◆ tld_email_alloc()

struct tld_email_list * tld_email_alloc ( )

This function allocates a list of emails object that can then be used to parse a string representing a list of emails and retrieve those emails with the use of the tld_email_next() function.

Note: The object is a C++ class.

Returns: A pointer to a list of emails object.

See also: tld_email_next()

Definition at line 1480 of file tld_emails.cpp.

References tld_email_list::tld_email_list().

Referenced by PHP_FUNCTION().

◆ tld_email_count()

int tld_email_count ( struct tld_email_list * list )

This function returns the number of emails that were found in the list of emails passed to the tld_email_parse() function.

Parameters

[in] list The email list object.

Returns: The number of emails defined in the object, it may be zero.

Definition at line 1525 of file tld_emails.cpp.

References list().

◆ tld_email_free()

void tld_email_free ( struct tld_email_list * list )

This function frees the list of emails as allocated by the tld_email_alloc(). Afterward the list pointer is not valid anymore.

Parameters

[in] list The list to be freed.

Definition at line 1493 of file tld_emails.cpp.

References list().

Referenced by PHP_FUNCTION().

◆ tld_email_next()

int tld_email_next	(	struct tld_email_list *	list,
		struct tld_email *	e
	)

This function retrieves the next email found when parsing the emails passed to to the tld_email_parse() function. The function returns 1 when another email was defined. It returns 0 when no more emails exist and the e parameter does not get set. The function can be called any number of times after it returned zero (0).

Parameters

[in]	list	The list from which the email is to be read.
[out]	e	The buffer where the email is to be written.

Returns: The function returns 0 if the end of the list was reached, it returns 1 if e was defined with the next email.

See also: tld_email_parse()

Definition at line 1559 of file tld_emails.cpp.

References list().

Referenced by PHP_FUNCTION().

◆ tld_email_parse()

tld_result tld_email_parse	(	struct tld_email_list *	list,
		char const *	emails,
		int	flags
	)

This function parses the email listed in the emails parameter and saves the result in the list parameter. The function saves the information as a list of email list in the list object.

Parameters

[in]	list	The list of emails object.
[in]	emails	The list of emails to be parsed.
[in]	flags	The flags are used to change the behavior of the parser.

Returns: TLD_RESULT_SUCCESS if the email was parsed successfully, another TLD_RESULT_... when an error is detected

Definition at line 1511 of file tld_emails.cpp.

References list(), and tld_email_list::parse().

Referenced by PHP_FUNCTION().

◆ tld_email_rewind()

void tld_email_rewind ( struct tld_email_list * list )

This function resets the position to the start of the list. The next call to the tld_email_next() function will return the first email again.

Parameters

[in] list The list of email object to reset.

Definition at line 1538 of file tld_emails.cpp.

References list().

Functions

Detailed Description

Function Documentation

◆ tld_email_alloc()

◆ tld_email_count()

◆ tld_email_free()

◆ tld_email_next()

◆ tld_email_parse()

◆ tld_email_rewind()