LCOV - coverage.info

LCOV - code coverage report

Current view:	top level - src - tld.c (source / functions)		Hit	Total	Coverage
Test:	coverage.info	Lines:	236	236	100.0 %
Date:	2021-05-08 12:27:55	Functions:	7	7	100.0 %
Legend:	Lines: hit not hit

          Line data    Source code

       1             : /* TLD library -- TLD, domain name, and sub-domain extraction
       2             :  * Copyright (c) 2011-2021  Made to Order Software Corp.  All Rights Reserved
       3             :  *
       4             :  * Permission is hereby granted, free of charge, to any person obtaining a
       5             :  * copy of this software and associated documentation files (the
       6             :  * "Software"), to deal in the Software without restriction, including
       7             :  * without limitation the rights to use, copy, modify, merge, publish,
       8             :  * distribute, sublicense, and/or sell copies of the Software, and to
       9             :  * permit persons to whom the Software is furnished to do so, subject to
      10             :  * the following conditions:
      11             :  *
      12             :  * The above copyright notice and this permission notice shall be included
      13             :  * in all copies or substantial portions of the Software.
      14             :  *
      15             :  * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
      16             :  * OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
      17             :  * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
      18             :  * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
      19             :  * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
      20             :  * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
      21             :  * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
      22             :  */
      23             : 
      24             : /** \file
      25             :  * \brief Implementation of the TLD parser library.
      26             :  *
      27             :  * This file includes all the functions available in the C library
      28             :  * of libtld that pertain to the parsing of URIs and extraction of
      29             :  * TLDs.
      30             :  */
      31             : 
      32             : #include "libtld/tld.h"
      33             : #include "tld_data.h"
      34             : #if defined(MO_DARWIN)
      35             : #   include <malloc/malloc.h>
      36             : #endif
      37             : #if !defined(MO_DARWIN) && !defined(MO_FREEBSD)
      38             : #include <malloc.h>
      39             : #endif
      40             : #include <stdlib.h>
      41             : #include <limits.h>
      42             : #include <string.h>
      43             : #include <ctype.h>
      44             : 
      45             : #ifdef WIN32
      46             : #define strncasecmp _strnicmp
      47             : #endif
      48             : 
      49             : /** \mainpage
      50             :  *
      51             :  * \section introduction The libtld Library
      52             :  *
      53             :  * The libtld project is a library that gives you the capability to
      54             :  * determine the TLD part of any Internet URI or email address.
      55             :  *
      56             :  * The main function of the library, tld(), takes a URI string and a
      57             :  * tld_info structure. From that information it computes the position
      58             :  * where the TLD starts in the URI. For email addresses (see the
      59             :  * tld_email_list C++ object, or the tld_email.cpp file for the C
      60             :  * functions,) it breaks down a full list of emails verifying the
      61             :  * syntax as defined in RFC 5822.
      62             :  *
      63             :  * \section c_programmers For C Programmers
      64             :  *
      65             :  * The C functions that you are expected to use are listed here:
      66             :  *
      67             :  * \li tld_version() -- return a string representing the TLD library version
      68             :  * \li tld() -- find the position of the TLD of any URI
      69             :  * \li tld_domain_to_lowercase() -- force lowercase on the domain name before
      70             :  *                                  calling other tld function
      71             :  * \li tld_check_uri() -- verify a full URI, with scheme, path, etc.
      72             :  * \li tld_clear_info() -- reset a tld_info structure for use with tld()
      73             :  * \li tld_email_alloc() -- allocate a tld_email_list object
      74             :  * \li tld_email_free() -- free a tld_email_list object
      75             :  * \li tld_email_parse() -- parse a list of email addresses
      76             :  * \li tld_email_count() -- number of emails found by tld_email_parse()
      77             :  * \li tld_email_rewind() -- go back at the start of the list of emails
      78             :  * \li tld_email_next() -- read the next email from the list of emails
      79             :  *
      80             :  * \section cpp_programmers For C++ Programmers
      81             :  *
      82             :  * For C++ users, please make use of these tld classes:
      83             :  *
      84             :  * \li tld_object
      85             :  * \li tld_email_list
      86             :  *
      87             :  * In C++, you may also make use of the tld_version() to check the current
      88             :  * version of the library.
      89             :  *
      90             :  * To check whether the version is valid for your tool, you may look at the
      91             :  * version handling of the libdebpackages library of the wpkg project. The
      92             :  * libtld version is always a Debian compatible version.
      93             :  *
      94             :  * http://windowspackager.org/documentation/implementation-details/debian-version-api
      95             :  *
      96             :  * \section php_programmers For PHP Programmers
      97             :  *
      98             :  * At this point I do not have a very good environment to recompile everything
      99             :  * for PHP. The main reason is because the library is being compiled with cmake
     100             :  * opposed to the automake toolchain that Zend expects.
     101             :  *
     102             :  * This being said, the php directory includes all you need to make use of the
     103             :  * library under PHP. It works like a charm for me and there should be no reason
     104             :  * for you not to be able to do the same with the library.
     105             :  *
     106             :  * The way I rebuild everything for PHP:
     107             :  *
     108             :  * \code
     109             :  * # from within the libtld directory:
     110             :  * mkdir ../BUILD
     111             :  * (cd ../BUILD; cmake ../libtld)
     112             :  * make -C ../BUILD
     113             :  * cd php
     114             :  * ./build
     115             :  * \endcode
     116             :  *
     117             :  * The build script will copy the resulting php_libtld.so file where it
     118             :  * needs to go using sudo. Your system (Red Hat, Mandrake, etc.) may use
     119             :  * su instead. Update the script as required.
     120             :  *
     121             :  * Note that the libtld will be linked statically inside the php_libtld.so
     122             :  * so you do not have to actually install the libtld environment to make
     123             :  * everything work as expected.
     124             :  *
     125             :  * The resulting functions added to PHP via this extension are:
     126             :  *
     127             :  * \li %check_tld()
     128             :  * \li %check_uri()
     129             :  * \li %check_email()
     130             :  *
     131             :  * For information about these functions, check out the php/php_libtld.c
     132             :  * file which describes each function, its parameters, and its results
     133             :  * in great details.
     134             :  *
     135             :  * \section not_linux Compiling on Other Platforms
     136             :  *
     137             :  * We can successfully compile the library under MS-Windows with cygwin
     138             :  * and the Microsoft IDE. To do so, we use the CMakeLists.txt file found
     139             :  * under the dev directory. Overwrite the CMakeLists.txt file in the
     140             :  * main directory before configuring and you'll get a library without
     141             :  * having to first compile Qt4.
     142             :  *
     143             :  * \code
     144             :  * cp dev/libtld-only-CMakeLists.txt CMakeListst.txt
     145             :  * \endcode
     146             :  *
     147             :  * At this point this configuration only compiles the library. It gives
     148             :  * you a shared (.DLL) and a static (.lib) version. With the IDE you may
     149             :  * create a debug and a release version.
     150             :  *
     151             :  * Later we'll look into having a single CMakeLists.txt so you do not
     152             :  * have to make this copy.
     153             :  *
     154             :  * \section example Example
     155             :  *
     156             :  * We offer a file named example.c that shows you how to use the
     157             :  * library in C. It is very simple, one main() function so it is
     158             :  * very easy to get started with libtld.
     159             :  *
     160             :  * For a C++ example, check out the src/validate_tld.cpp tool which was
     161             :  * created as a command line tool coming with the libtld library.
     162             :  *
     163             :  * \include example.c
     164             :  *
     165             :  * \section dev Programmers & Maintainers
     166             :  *
     167             :  * If you want to work on the library, there are certainly things to
     168             :  * enhance. We could for example offer more offsets in the info
     169             :  * string, or functions to clearly define each part of the URI.
     170             :  *
     171             :  * However, the most important part of this library is the XML file
     172             :  * which defines all the TLDs. Maintaining that file is what will
     173             :  * help the most. It includes all the TLDs known at this point
     174             :  * (as defined in different places such as Wikipedia and each
     175             :  * different authority in that area.) The file is easy to read so
     176             :  * you can easily find whether your extension is defined and if not
     177             :  * you can let us know.
     178             :  *
     179             :  * \section requirements Library Requirements
     180             :  *
     181             :  * \li Usage
     182             :  *
     183             :  * The library doesn't need anything special. It's a few C functions.
     184             :  *
     185             :  * The library also offers a C++ classes. You do not need a C++ compiler
     186             :  * to use the library, but if you do program in C++, you can use the
     187             :  * tld_object and tld_email_list instead of the C functions. It makes
     188             :  * things a lot easier!
     189             :  *
     190             :  * Also if you are programming using PHP, the library includes a PHP
     191             :  * extension so you can check URIs and emails directly from PHP without
     192             :  * trying to create crazy regular expressions (that most often do not work
     193             :  * right!)
     194             :  *
     195             :  * \li Compiling
     196             :  *
     197             :  * To compile the library, you'll need CMake, a C++ compiler for different
     198             :  * parts and the Qt library as we use the QtXml and QtCore (Qt4). The QtXml
     199             :  * library is used to parse the XML file (tld_data.xml) which defines all
     200             :  * the TLDs, worldwide.
     201             :  *
     202             :  * To regenerate the documentation we use Doxygen. It is optional, though.
     203             :  *
     204             :  * \li PHP
     205             :  *
     206             :  * In order to recompile the PHP extension the Zend environment is required.
     207             :  * Under a Debian or Ubuntu system you can install the php5-dev package.
     208             :  *
     209             :  * \section tests Tests Coming with the Library
     210             :  *
     211             :  * We have the following tests at this time:
     212             :  *
     213             :  * \li tld_test.c
     214             :  *
     215             :  * \par
     216             :  * This test checks the tld() function as end users of the
     217             :  * library. It checks all the existing TLDs, a few unknown TLDs,
     218             :  * and invalid TLDs.
     219             :  *
     220             :  * \li tld_test_object.cpp
     221             :  *
     222             :  * \par
     223             :  * This test verifies that the tld_object works as expected. It is not
     224             :  * exhaustive in regard to the tld library itself, only of the tld_object.
     225             :  *
     226             :  * \li tld_internal_test.c
     227             :  *
     228             :  * \par
     229             :  * This test includes the tld.c directly so it can check each
     230             :  * internal function directly. This test checks the cmp() and
     231             :  * search() functions, with full coverage.
     232             :  *
     233             :  * \li tld_test_domain_lowercase.c
     234             :  *
     235             :  * \par
     236             :  * This test runs 100% coverage of the tld_domain_to_lowercase() function.
     237             :  * This includes conversion of %XX encoded characters and UTF-8 to wide
     238             :  * characters that can be case folded and saved back as encoded %XX
     239             :  * characters. The test verifies that all characters are properly
     240             :  * supported and that errors are properly handled.
     241             :  *
     242             :  * \li tld_test_tld_names.cpp
     243             :  *
     244             :  * \par
     245             :  * The Mozilla foundation offers a file with a complete list of all the
     246             :  * domain names defined throughout the world. This test reads that list
     247             :  * and checks all the TLDs against the libtld system. Some TLDs may be
     248             :  * checked in multiple ways. We support the TLDs that start with an
     249             :  * asterisk (*) and those that start with an exclamation mark (!) which
     250             :  * means all the TLDs are now being checked out as expected.
     251             :  * This test reads the public_suffix_list.dat file which has to be
     252             :  * available in your current directory.
     253             :  *
     254             :  * \par
     255             :  * A copy of the Mozilla file is included with each version of the TLD
     256             :  * library. It is named tests/public_suffix_list.dat and should be
     257             :  * up to date when we produce a new version for download on
     258             :  * SourceForge.net.
     259             :  *
     260             :  * \li tld_test_full_uri.c
     261             :  *
     262             :  * \par
     263             :  * The library includes an advanced function that checks the validity
     264             :  * of complete URIs making it very simple to test such in any software.
     265             :  * The URI must include a scheme (often called protocol), fully qualified
     266             :  * domain (sub-domains, domain, TLD), an absolute path, variables (after
     267             :  * the question mark,) and an anchor. The test ensures that all the
     268             :  * checks the parser uses are working as expected and allow valid URIs
     269             :  * while it forbids any invalid URIs.
     270             :  *
     271             :  * \li tld_test_emails.cpp
     272             :  *
     273             :  * \par
     274             :  * The libtld supports verifying and breaking up emails in different
     275             :  * parts. This is done to make sure users enter valid emails (although
     276             :  * it doesn't mean that the email address exists, it at least allows
     277             :  * us to know when an email is definitively completely incorrect and
     278             :  * should be immediately rejected.) The test ensures that all the
     279             :  * different types of invalid emails are properly being caught (i.e.
     280             :  * emails with control characters, invalid domain name, missing parts,
     281             :  * etc.)
     282             :  *
     283             :  * \li tld_test_versions.c
     284             :  *
     285             :  * \par
     286             :  * This test checks that the versions in all the files (two
     287             :  * CMakeLists.txt and the changelog) are equal. If one of those
     288             :  * does not match, then the test fails.
     289             :  *
     290             :  * \li tld_test_xml.sh
     291             :  *
     292             :  * \par
     293             :  * Shell script to run against the tld_data.xml file to ensure its validity.
     294             :  * This is a good idea any time you make changes to the file. It runs with
     295             :  * the xmllint tool. If you do not have the tool, it won't work. The tool
     296             :  * is part of the libxml2-utils package under Ubuntu.
     297             :  */
     298             : 
     299             : 
     300             : 
     301             : 
     302             : /** \brief Compare two strings, one of which is limited by length.
     303             :  * \internal
     304             :  *
     305             :  * This internal function was created to handle a simple string
     306             :  * (no locale) comparison with one string being limited in length.
     307             :  *
     308             :  * The comparison does not require locale since all characters are
     309             :  * ASCII (a URI with Unicode characters encode them in UTF-8 and
     310             :  * changes all those bytes with %XX.)
     311             :  *
     312             :  * The length applied to the string in \p b. This allows us to make
     313             :  * use of the input string all the way down to the cmp() function.
     314             :  * In other words, we avoid a copy of the string.
     315             :  *
     316             :  * The string in \p a is 'nul' (\0) terminated. This means \p a
     317             :  * may be longer or shorter than \p b. In other words, the function
     318             :  * is capable of returning the correct result with a single call.
     319             :  *
     320             :  * If parameter \p a is "*", then it always matches \p b.
     321             :  *
     322             :  * \param[in] a  The pointer in an f_tld field of the tld_descriptions.
     323             :  * \param[in] b  Pointer directly in referencing the user domain string.
     324             :  * \param[in] n  The number of characters that can be checked in \p b.
     325             :  *
     326             :  * \return -1 if a < b, 0 when a == b, and 1 when a > b
     327             :  */
     328     1021033 : static int cmp(const char *a, const char *b, int n)
     329             : {
     330             :     /* if `a == "*"` then we have a bug in the table
     331             :     if(a[0] == '*'
     332             :     && a[1] == '\0')
     333             :     {
     334             :         return 0;
     335             :     }
     336             :     */
     337             : 
     338             :     /* n represents the maximum number of characters to check in b */
     339     3333894 :     while(n > 0 && *a != '\0')
     340             :     {
     341     2128108 :         if(*a < *b)
     342             :         {
     343      417441 :             return -1;
     344             :         }
     345     1710667 :         if(*a > *b)
     346             :         {
     347      418839 :             return 1;
     348             :         }
     349     1291828 :         ++a;
     350     1291828 :         ++b;
     351     1291828 :         --n;
     352             :     }
     353      184753 :     if(*a == '\0')
     354             :     {
     355      148212 :         if(n > 0)
     356             :         {
     357             :             /* in this case n > 0 so b is larger */
     358        4295 :             return -1;
     359             :         }
     360      143917 :         return 0;
     361             :     }
     362             :     /* in this case n == 0 so a is larger */
     363       36541 :     return 1;
     364             : }
     365             : 
     366             : 
     367             : /** \brief Search for the specified domain.
     368             :  * \internal
     369             :  *
     370             :  * This function executes one search for one domain. The
     371             :  * search is binary, which means the tld_descriptions are
     372             :  * expected to be 100% in order at all levels.
     373             :  *
     374             :  * The \p i and \p j parameters represent the boundaries
     375             :  * of the current level to be checked. Know that for a
     376             :  * given TLD, there is a start and end boundary that is
     377             :  * used to define \p i and \p j. So except for the top
     378             :  * level, the bounds are limited to one TLD, sub-TLD, etc.
     379             :  * (for example, .uk has a sub-layer with .co, .ac, etc.
     380             :  * and that ground is limited to the second level entries
     381             :  * accepted within the .uk TLD.)
     382             :  *
     383             :  * This search does one search at one level. If sub-levels
     384             :  * are available for that TLD, then it is the responsibility
     385             :  * of the caller to call the function again to find out whether
     386             :  * one of those sub-domain name is in use.
     387             :  *
     388             :  * When the TLD cannot be found, the function returns -1.
     389             :  *
     390             :  * \param[in] i  The start point of the search (included.)
     391             :  * \param[in] j  The end point of the search (excluded.)
     392             :  * \param[in] domain  The domain name to search.
     393             :  * \param[in] n  The length of the domain name.
     394             :  *
     395             :  * \return The offset of the domain found, or -1 when not found.
     396             :  */
     397      159730 : int search(int i, int j, const char *domain, int n)
     398             : {
     399      159730 :     int auto_match = -1, p, r;
     400             :     const struct tld_description *tld;
     401             : 
     402      159730 :     if(i < j)
     403             :     {
     404             :         /* the "*" breaks the binary search, we have to handle it specially */
     405      149685 :         tld = tld_descriptions + i;
     406      149685 :         if(tld->f_tld[0] == '*' && tld->f_tld[1] == '\0')
     407             :         {
     408        1167 :             auto_match = i;
     409        1167 :             ++i;
     410             :         }
     411             : 
     412     1176448 :         while(i < j)
     413             :         {
     414     1020991 :             p = (j - i) / 2 + i;
     415     1020991 :             tld = tld_descriptions + p;
     416     1020991 :             r = cmp(tld->f_tld, domain, n);
     417     1020991 :             if(r < 0)
     418             :             {
     419             :                 /* eliminate the first half */
     420      421726 :                 i = p + 1;
     421             :             }
     422      599265 :             else if(r > 0)
     423             :             {
     424             :                 /* eliminate the second half */
     425      455352 :                 j = p;
     426             :             }
     427             :             else
     428             :             {
     429             :                 /* match */
     430      143913 :                 return p;
     431             :             }
     432             :         }
     433             :     }
     434             : 
     435       15817 :     return auto_match;
     436             : }
     437             : 
     438             : 
     439             : /** \brief Clear the info structure.
     440             :  *
     441             :  * This function initializes the info structure with defaults.
     442             :  * The different TLD functions that make use of this structure
     443             :  * will generally call this function first to represent a
     444             :  * failure case.
     445             :  *
     446             :  * Note that by default the category and status are set to
     447             :  * undefined (TLD_CATEGORY_UNDEFINED and TLD_STATUS_UNDEFINED).
     448             :  * Also the country and tld pointer are set to NULL and thus
     449             :  * they cannot be used as strings.
     450             :  *
     451             :  * \param[out] info  The tld_info structure to clear.
     452             :  */
     453       62702 : void tld_clear_info(struct tld_info *info)
     454             : {
     455       62702 :     info->f_category = TLD_CATEGORY_UNDEFINED;
     456       62702 :     info->f_status = TLD_STATUS_UNDEFINED;
     457       62702 :     info->f_country = (const char *) 0;
     458       62702 :     info->f_tld = (const char *) 0;
     459       62702 :     info->f_offset = -1;
     460       62702 : }
     461             : 
     462             : 
     463             : /** \brief Get information about the TLD for the specified URI.
     464             :  *
     465             :  * The tld() function searches for the specified URI in the TLD
     466             :  * descriptions. The results are saved in the info parameter for
     467             :  * later interpretetation (i.e. extraction of the domain name,
     468             :  * sub-domains and the exact TLD.)
     469             :  *
     470             :  * The function extracts the last \em extension of the URI. For
     471             :  * example, in the following:
     472             :  *
     473             :  * \code
     474             :  * example.co.uk
     475             :  * \endcode
     476             :  *
     477             :  * the function first extracts ".uk". With that \em extension, it
     478             :  * searches the list of official TLDs. If not found, an error is
     479             :  * returned and the info parameter is set to \em unknown.
     480             :  *
     481             :  * When found, the function checks whether that TLD (".uk" in our
     482             :  * previous example) accepts sub-TLDs (second, third, forth and
     483             :  * fifth level TLDs.) If so, it extracts the next TLD entry (the
     484             :  * ".co" in our previous example) and searches for that second
     485             :  * level TLD. If found, it again tries with the third level, etc.
     486             :  * until all the possible TLDs were exhausted. At that point, it
     487             :  * returns the last TLD it found. In case of ".co.uk", it returns
     488             :  * the information of the ".co" TLD, second-level domain name.
     489             :  *
     490             :  * All the comparisons are done in lowercase. This is because
     491             :  * all the data is saved in lowercase and we expect the input
     492             :  * of the tld() function to already be in lowercase. If you
     493             :  * have a doubt and your input may actually be in uppercase,
     494             :  * make sure to call the tld_domain_to_lowercase() function
     495             :  * first. That function makes a duplicate of your domain name
     496             :  * in lowercase. It understands the %XX characters (since the
     497             :  * URI is expected to still be encoded) and properly handles
     498             :  * UTF-8 characters in order to define the lowercase characters
     499             :  * of the input. Note that the function returns a newly
     500             :  * allocated pointer that you are responsible to free once
     501             :  * you are done with it.
     502             :  *
     503             :  * \warning
     504             :  * If you call tld() with the pointer return by
     505             :  * tld_domain_to_lowercase(), keep in mind that the tld()
     506             :  * function saves pointers of the input string directly in
     507             :  * the tld_info structure. In other words, you want to free()
     508             :  * that string AFTER you are done with the tld_info structure.
     509             :  *
     510             :  * The \p info structure includes:
     511             :  *
     512             :  * \li f_category -- the category of TLD, unless set to
     513             :  * TLD_CATEGORY_UNDEFINED, it is considered valid
     514             :  * \li f_status -- the status of the TLD, unless set to
     515             :  * TLD_STATUS_UNDEFINED, it was defined from the tld_data.xml file;
     516             :  * however, only those marked as TLD_STATUS_VALID are considered to
     517             :  * currently be in use, all the other statuses can be used by your
     518             :  * software, one way or another, but it should not be accepted as
     519             :  * valid in a URI
     520             :  * \li f_country -- if the category is set to TLD_CATEGORY_COUNTRY
     521             :  * then this pointer is set to the name of the country
     522             :  * \li f_tld -- is set to the full TLD of your domain name; this is
     523             :  * a pointer WITHIN your uri string so make sure you keep your URI
     524             :  * string valid if you intend to use this f_tld string
     525             :  * \li f_offset -- the offset to the first period within the domain
     526             :  * name TLD (i.e. in our previous example, it would be the offset to
     527             :  * the first period in ".co.uk", so in "example.co.uk" the offset would
     528             :  * be 7. Assuming you prepend "www." to have the URI "www.example.co.uk"
     529             :  * then the offset would be 11.)
     530             :  *
     531             :  * \note
     532             :  * In our previous example, the ".uk" TLD is properly used: it includes
     533             :  * a second level domain name (".co".) The URI "example.uk" should have
     534             :  * returned TLD_RESULT_INVALID since .uk by itself was not supposed to be
     535             :  * acceptable. This changed a few years ago. The good thing is that it
     536             :  * resolves some problems as some companies were given a simple ".uk"
     537             :  * TLD and these were exceptions the library does not need to support
     538             :  * anymore. There are still some countries, such as ".bd", which do not
     539             :  * accept second level names, so "example.bd" does return
     540             :  * an \em error (TLD_RESULT_INVALID).
     541             :  *
     542             :  * Assuming that you always get valid URIs, you should get one of those
     543             :  * results:
     544             :  *
     545             :  * \li TLD_RESULT_SUCCESS -- success! the URI is valid and the TLD was
     546             :  * properly determined; use the f_tld or f_offset to extract the TLD
     547             :  * domain and sub-domains
     548             :  * \li TLD_RESULT_INVALID -- known TLD, but not currently valid; this
     549             :  * result is returned when we know that the TLD is not to be accepted
     550             :  *
     551             :  * Other results are returned when the input string is considered invalid.
     552             :  *
     553             :  * \note
     554             :  * The function only accepts a bare URI, in other words: no protocol, no
     555             :  * path, no anchor, no query string, and still URI encoded. Also, it
     556             :  * should not start and/or end with a period or you are likely to get
     557             :  * an invalid response. (i.e. don't use any of ".example.co.uk.",
     558             :  * "example.co.uk.", nor ".example.co.uk")
     559             :  *
     560             :  * \include example.c
     561             :  *
     562             :  * \param[in] uri  The URI to be checked.
     563             :  * \param[out] info  A pointer to a tld_info structure to save the result.
     564             :  *
     565             :  * \return One of the TLD_RESULT_... enumeration values.
     566             :  */
     567       62434 : enum tld_result tld(const char *uri, struct tld_info *info)
     568             : {
     569       62434 :     const char *end = uri;
     570             :     const char **level_ptr;
     571       62434 :     int level = 0, start_level, i, r, p, offset;
     572             :     enum tld_result result;
     573             : 
     574             :     /* set defaults in the info structure */
     575       62434 :     tld_clear_info(info);
     576             : 
     577       62434 :     if(uri == (const char *) 0 || uri[0] == '\0')
     578             :     {
     579           3 :         return TLD_RESULT_NULL;
     580             :     }
     581             : 
     582       62431 :     level_ptr = malloc(sizeof(const char *) * tld_max_level);
     583             : 
     584     3169926 :     while(*end != '\0')
     585             :     {
     586     3045066 :         if(*end == '.')
     587             :         {
     588      362668 :             if(level >= tld_max_level)
     589             :             {
     590             :                 /* At this point the maximum number of levels in the
     591             :                  * TLDs is 5
     592             :                  */
     593      742700 :                 for(i = 1; i < tld_max_level; ++i)
     594             :                 {
     595      594160 :                     level_ptr[i - 1] = level_ptr[i];
     596             :                 }
     597      148540 :                 level_ptr[tld_max_level - 1] = end;
     598             :             }
     599             :             else
     600             :             {
     601      214128 :                 level_ptr[level] = end;
     602      214128 :                 ++level;
     603             :             }
     604      362668 :             if(level >= 2 && level_ptr[level - 2] + 1 == level_ptr[level - 1])
     605             :             {
     606             :                 /* two periods one after another */
     607           2 :                 free(level_ptr);
     608           2 :                 return TLD_RESULT_BAD_URI;
     609             :             }
     610             :         }
     611     3045064 :         ++end;
     612             :     }
     613             :     /* if level is not at least 1 then there are no periods */
     614       62429 :     if(level == 0)
     615             :     {
     616             :         /* no TLD */
     617          10 :         free(level_ptr);
     618          10 :         return TLD_RESULT_NO_TLD;
     619             :     }
     620             : 
     621       62419 :     start_level = level;
     622       62419 :     --level;
     623      124838 :     r = search(tld_start_offset, tld_end_offset,
     624      124838 :                 level_ptr[level] + 1, (int) (end - level_ptr[level] - 1));
     625       62419 :     if(r == -1)
     626             :     {
     627             :         /* unknown */
     628          17 :         free(level_ptr);
     629          17 :         return TLD_RESULT_NOT_FOUND;
     630             :     }
     631             : 
     632             :     /* check for the next level if there is one */
     633       62402 :     p = r;
     634      196480 :     while(level > 0 && tld_descriptions[r].f_start_offset != USHRT_MAX)
     635             :     {
     636      225744 :         r = search(tld_descriptions[r].f_start_offset,
     637       75248 :                 tld_descriptions[r].f_end_offset,
     638       75248 :                 level_ptr[level - 1] + 1,
     639       75248 :                 (int) (level_ptr[level] - level_ptr[level - 1] - 1));
     640       75248 :         if(r == -1)
     641             :         {
     642             :             /* we are done, return the previous level */
     643        3572 :             break;
     644             :         }
     645       71676 :         p = r;
     646       71676 :         --level;
     647             :     }
     648       62402 :     offset = (int) (level_ptr[level] - uri);
     649             : 
     650             :     /* if there are exceptions we may need to search those now if level is 0 */
     651       62402 :     if(level == 0)
     652             :     {
     653       23084 :         r = search(tld_descriptions[p].f_start_offset,
     654       11542 :                 tld_descriptions[p].f_end_offset,
     655             :                 uri,
     656       11542 :                 (int) (level_ptr[0] - uri));
     657       11542 :         if(r != -1)
     658             :         {
     659         346 :             p = r;
     660         346 :             offset = 0;
     661             :         }
     662             :     }
     663             : 
     664       62402 :     info->f_status = tld_descriptions[p].f_status;
     665       62402 :     switch(info->f_status)
     666             :     {
     667       59911 :     case TLD_STATUS_VALID:
     668       59911 :         result = TLD_RESULT_SUCCESS;
     669       59911 :         break;
     670             : 
     671         109 :     case TLD_STATUS_EXCEPTION:
     672             :         /* return the actual TLD and not the exception
     673             :          * i.e. "nacion.ar" is valid and the TLD is just ".ar"
     674             :          * even though top level ".ar" is forbidden by default
     675             :          */
     676         109 :         p = tld_descriptions[p].f_exception_apply_to;
     677         109 :         level = start_level - tld_descriptions[p].f_exception_level;
     678         109 :         offset = (int) (level_ptr[level] - uri);
     679         109 :         info->f_status = TLD_STATUS_VALID;
     680         109 :         result = TLD_RESULT_SUCCESS;
     681         109 :         break;
     682             : 
     683        2382 :     default:
     684        2382 :         result = TLD_RESULT_INVALID;
     685        2382 :         break;
     686             : 
     687             :     }
     688             : 
     689       62402 :     info->f_category = tld_descriptions[p].f_category;
     690       62402 :     info->f_country = tld_descriptions[p].f_country;
     691       62402 :     info->f_tld = level_ptr[level];
     692       62402 :     info->f_offset = offset;
     693             : 
     694       62402 :     free(level_ptr);
     695             : 
     696       62402 :     return result;
     697             : }
     698             : 
     699             : 
     700             : /** \brief Internal function used to transform %XX values.
     701             :  *
     702             :  * This function transforms an hexadecimal (h) character to (2) a
     703             :  * decimal number (d).
     704             :  *
     705             :  * \param[in] c  The hexadecimal character to transform
     706             :  *
     707             :  * \return The number the hexadecimal character represents (0 to 15)
     708             :  */
     709           4 : static int h2d(int c)
     710             : {
     711           4 :     if(c >= 'a')
     712             :     {
     713           1 :         return c - 'a' + 10;
     714             :     }
     715           3 :     if(c >= 'A')
     716             :     {
     717           1 :         return c - 'A' + 10;
     718             :     }
     719           2 :     return c - '0';
     720             : }
     721             : 
     722             : 
     723             : /** \brief Check that a URI is valid.
     724             :  *
     725             :  * This function very quickly parses a URI to determine whether it
     726             :  * is valid.
     727             :  *
     728             :  * Note that it does not (currently) support local naming conventions
     729             :  * which means that a host such as "localhost" will fail the test.
     730             :  *
     731             :  * The \p protocols variable can be set to a list of protocol names
     732             :  * that are considered valid. For example, for HTTP protocol one
     733             :  * could use "http,https". To accept any protocol use an asterisk
     734             :  * as in: "*". The protocol must be only characters, digits, or
     735             :  * underscores ([0-9A-Za-z_]+) and it must be at least one character.
     736             :  *
     737             :  * The flags can be set to the following values, or them to set multiple
     738             :  * flags at the same time:
     739             :  *
     740             :  * \li VALID_URI_ASCII_ONLY -- refuse characters that are not in the
     741             :  * first 127 range (we expect the URI to be UTF-8 encoded and any
     742             :  * byte with bit 7 set is considered invalid if this flag is set,
     743             :  * including encoded bytes such as %A0)
     744             :  * \li VALID_URI_NO_SPACES -- refuse spaces whether they are encoded
     745             :  * with + or %20 or verbatim.
     746             :  *
     747             :  * The return value is generally TLD_RESULT_BAD_URI when an invalid
     748             :  * character is found in the URI string. The TLD_RESULT_NULL is
     749             :  * returned if the URI is a NULL pointer or an empty string.
     750             :  * Other results may be returned by the tld() function. If a result
     751             :  * other than TLD_RESULT_SUCCESS is returned then the info structure
     752             :  * may or may not be updated.
     753             :  *
     754             :  * \param[in] uri  The URI which validity is being checked.
     755             :  * \param[out] info  The resulting information about the URI domain and TLD.
     756             :  * \param[in] protocols  List of comma separated protocols accepted.
     757             :  * \param[in] flags  A set of flags to tell the function what is valid/invalid.
     758             :  *
     759             :  * \return The result of the operation, TLD_RESULT_SUCCESS if the URI is
     760             :  * valid.
     761             :  *
     762             :  * \sa tld()
     763             :  */
     764         268 : enum tld_result tld_check_uri(const char *uri, struct tld_info *info, const char *protocols, int flags)
     765             : {
     766             :     const char      *p, *q, *username, *password, *host, *port, *n, *a, *query_string;
     767             :     char            domain[256];
     768             :     int             protocol_length, length, valid, c, i, j, anchor;
     769             :     enum tld_result result;
     770             : 
     771             :     /* set defaults in the info structure */
     772         268 :     tld_clear_info(info);
     773             : 
     774         268 :     if(uri == NULL || uri[0] == '\0')
     775             :     {
     776           2 :         return TLD_RESULT_NULL;
     777             :     }
     778             : 
     779             :     /* check the protocol: [0-9A-Za-z_]+ */
     780        1337 :     for(p = uri; *uri != '\0' && *uri != ':'; ++uri)
     781             :     {
     782        1072 :         if((*uri < 'a' || *uri > 'z')
     783           5 :         && (*uri < 'A' || *uri > 'Z')
     784           1 :         && (*uri < '0' || *uri > '9')
     785           1 :         && *uri != '_')
     786             :         {
     787           1 :             return TLD_RESULT_BAD_URI;
     788             :         }
     789             :     }
     790         265 :     valid = 0;
     791         265 :     protocol_length = (int) (uri - p);
     792         265 :     c = tolower(*p);
     793        4304 :     for(q = protocols; *q != '\0';)
     794             :     {
     795        4037 :         if(q[0] == '*' && (q[1] == '\0' || q[1] == ','))
     796             :         {
     797           1 :             valid = 1;
     798           1 :             break;
     799             :         }
     800        4036 :         if(tolower(*q) == c)
     801             :         {
     802         273 :             if(strncasecmp(p, q, protocol_length) == 0
     803         262 :             && (q[protocol_length] == '\0' || q[protocol_length] == ','))
     804             :             {
     805         262 :                 valid = 1;
     806         262 :                 break;
     807             :             }
     808             :         }
     809             :         /* move to the next protocol */
     810        3774 :         for(; *q != '\0' && *q != ','; ++q);
     811        3774 :         for(; *q == ','; ++q);
     812             :     }
     813         265 :     if(valid == 0)
     814             :     {
     815           2 :         return TLD_RESULT_BAD_URI;
     816             :     }
     817         263 :     if(uri[1] != '/' || uri[2] != '/')
     818             :     {
     819           3 :         return TLD_RESULT_BAD_URI;
     820             :     }
     821         260 :     uri += 3; /* skip the '://' */
     822             : 
     823             :     /* extract the complete domain name with sub-domains, etc. */
     824         260 :     username = NULL;
     825         260 :     host = uri;
     826        4671 :     for(; *uri != '/' && *uri != '\0'; ++uri)
     827             :     {
     828        4419 :         if((unsigned char) *uri < ' ')
     829             :         {
     830             :             /* forbid control characters in domain name */
     831           1 :             return TLD_RESULT_BAD_URI;
     832             :         }
     833        4418 :         if(*uri == '@')
     834             :         {
     835           7 :             if(username != NULL)
     836             :             {
     837             :                 /* two '@' signs is not possible */
     838           1 :                 return TLD_RESULT_BAD_URI;
     839             :             }
     840           6 :             username = host;
     841           6 :             host = uri + 1;
     842             :         }
     843        4411 :         else if(*uri & 0x80)
     844             :         {
     845           1 :             if(flags & VALID_URI_ASCII_ONLY)
     846             :             {
     847             :                 /* only ASCII allowed by caller */
     848           1 :                 return TLD_RESULT_BAD_URI;
     849             :             }
     850             :         }
     851        4410 :         else if(*uri == ' ' || *uri == '+')
     852             :         {
     853             :             /* spaces not allowed in domain name */
     854           2 :             return TLD_RESULT_BAD_URI;
     855             :         }
     856        4408 :         else if(*uri == '%')
     857             :         {
     858             :             /* the next two digits must be hex
     859             :              * note that the first digit must be at least 2 because
     860             :              * we do not allow control characters
     861             :              */
     862           5 :             if(((uri[1] < '2' || uri[1] > '9')
     863           2 :              && (uri[1] < 'a' || uri[1] > 'f')
     864           2 :              && (uri[1] < 'A' || uri[1] > 'F'))
     865           4 :             || ((uri[2] < '0' || uri[2] > '9')
     866           2 :              && (uri[2] < 'a' || uri[2] > 'f')
     867           1 :              && (uri[2] < 'A' || uri[2] > 'F')))
     868             :             {
     869           1 :                 return TLD_RESULT_BAD_URI;
     870             :             }
     871           4 :             if(uri[1] == '2' && uri[2] == '0')
     872             :             {
     873             :                 /* spaces not allowed in domain name */
     874           1 :                 return TLD_RESULT_BAD_URI;
     875             :             }
     876           3 :             if(uri[1] >= '8' && (flags & VALID_URI_ASCII_ONLY))
     877             :             {
     878             :                 /* only ASCII allowed by caller */
     879           1 :                 return TLD_RESULT_BAD_URI;
     880             :             }
     881             :             /* skip the two digits right away */
     882           2 :             uri += 2;
     883             :         }
     884             :     }
     885         252 :     if(username != NULL)
     886             :     {
     887           5 :         password = username;
     888           5 :         for(; *password != '@' && *password != ':'; ++password);
     889           5 :         if(*password == ':')
     890             :         {
     891           4 :             if((host - 1) - (password + 1) <= 0)
     892             :             {
     893             :                 /* empty password are not acceptable */
     894           2 :                 return TLD_RESULT_BAD_URI;
     895             :             }
     896             :         }
     897           3 :         if(password - username - 1 <= 0)
     898             :         {
     899             :             /* username cannot be empty */
     900           2 :             return TLD_RESULT_BAD_URI;
     901             :         }
     902             :     }
     903         248 :     for(port = host; *port != ':' && port < uri; ++port);
     904         248 :     if(*port == ':')
     905             :     {
     906             :         /* we have a port, it must be digits [0-9]+ */
     907           6 :         for(n = port + 1; *n >= '0' && *n <= '9'; ++n);
     908           6 :         if(n != uri || n == port + 1)
     909             :         {
     910             :             /* port is empty or includes invalid characters */
     911           3 :             return TLD_RESULT_BAD_URI;
     912             :         }
     913             :     }
     914             : 
     915             :     /* check the address really quick */
     916         245 :     query_string = NULL;
     917         245 :     anchor = 0;
     918         774 :     for(a = uri; *a != '\0'; ++a)
     919             :     {
     920         544 :         if((unsigned char) *a < ' ')
     921             :         {
     922             :             /* no control characters allowed */
     923           2 :             return TLD_RESULT_BAD_URI;
     924             :         }
     925         542 :         else if(*a == '+' || *a == ' ') /* old space encoding */
     926             :         {
     927           2 :             if(flags & VALID_URI_NO_SPACES)
     928             :             {
     929             :                 /* spaces not allowed by caller */
     930           2 :                 return TLD_RESULT_BAD_URI;
     931             :             }
     932             :         }
     933         540 :         else if(*a == '?')
     934             :         {
     935           7 :             query_string = a + 1;
     936             :         }
     937         533 :         else if(*a == '&' && anchor == 0)
     938             :         {
     939           4 :             if(query_string == NULL)
     940             :             {
     941             :                 /* & must be encoded if used before ? */
     942           1 :                 return TLD_RESULT_BAD_URI;
     943             :             }
     944           3 :             query_string = a + 1;
     945             :         }
     946         529 :         else if(*a == '=')
     947             :         {
     948          10 :             if(query_string != NULL && a - query_string == 0)
     949             :             {
     950             :                 /* a query string variable name cannot be empty */
     951           3 :                 return TLD_RESULT_BAD_URI;
     952             :             }
     953             :         }
     954         519 :         else if(*a == '#')
     955             :         {
     956           1 :             query_string = NULL;
     957           1 :             anchor = 1;
     958             :         }
     959         518 :         else if(*a == '%')
     960             :         {
     961             :             /* the next two digits must be hex
     962             :              * note that the first digit must be at least 2 because
     963             :              * we do not allow control characters
     964             :              */
     965           7 :             if(((a[1] < '2' || a[1] > '9')
     966           3 :              && (a[1] < 'a' || a[1] > 'f')
     967           3 :              && (a[1] < 'A' || a[1] > 'F'))
     968           4 :             || ((a[2] < '0' || a[2] > '9')
     969           3 :              && (a[2] < 'a' || a[2] > 'f')
     970           1 :              && (a[2] < 'A' || a[2] > 'F')))
     971             :             {
     972           4 :                 return TLD_RESULT_BAD_URI;
     973             :             }
     974           3 :             if(a[1] == '2' && a[2] == '0' && (flags & VALID_URI_NO_SPACES))
     975             :             {
     976             :                 /* spaces not allowed by caller */
     977           1 :                 return TLD_RESULT_BAD_URI;
     978             :             }
     979           2 :             if(a[1] >= '8' && (flags & VALID_URI_ASCII_ONLY))
     980             :             {
     981             :                 /* only ASCII allowed by caller */
     982           1 :                 return TLD_RESULT_BAD_URI;
     983             :             }
     984             :             /* skip the two digits right away */
     985           1 :             a += 2;
     986             :         }
     987         511 :         else if(*a & 0x80)
     988             :         {
     989           3 :             if(flags & VALID_URI_ASCII_ONLY)
     990             :             {
     991             :                 /* only ASCII allowed by caller */
     992           1 :                 return TLD_RESULT_BAD_URI;
     993             :             }
     994             :         }
     995             :     }
     996             : 
     997             :     /* check the domain */
     998             : 
     999             : /** \todo
    1000             :  * The following is WRONG:
    1001             :  * \li the domain \%XX are not being checked properly, as it stands the
    1002             :  *     characters following % can be anything!
    1003             :  * \li the tld() function must be called with the characters still
    1004             :  *     encoded; if you look at the data, you will see that I kept
    1005             :  *     the data encoded (i.e. with the \%XX characters)
    1006             :  * \li what could be checked (which I guess could be for the entire
    1007             :  *     domain name) is whether the entire string represents valid
    1008             :  *     UTF-8; I don't think I'm currently doing so here. (I have
    1009             :  *     such functions in the tld_domain_to_lowercase() now)
    1010             :  */
    1011             : 
    1012         230 :     length = (int) (port - host);
    1013         230 :     if(length >= (int) (sizeof(domain) / sizeof(domain[0])))
    1014             :     {
    1015             :         /* sub-domains + domain + TLD is more than 255 characters?!
    1016             :          * note that the host main include many %XX characters but
    1017             :          * we ignore the fact here at this time; we could move this
    1018             :          * test in the for() loop below though.
    1019             :          */
    1020           1 :         return TLD_RESULT_BAD_URI;
    1021             :     }
    1022         229 :     if(length == 0)
    1023             :     {
    1024             :         /* although we could return TLD_RESULT_NULL it would not be
    1025             :          * valid here because "http:///blah.com" is invalid, not NULL
    1026             :          */
    1027           1 :         return TLD_RESULT_BAD_URI;
    1028             :     }
    1029        3787 :     for(i = 0, j = 0; i < length; ++i, ++j)
    1030             :     {
    1031        3559 :         if(host[i] == '%') {
    1032           2 :             domain[j] = (char) (h2d(host[i + 1]) * 16 + h2d(host[i + 2]));
    1033           2 :             i += 2; /* skip the 2 digits */
    1034             :         }
    1035             :         else
    1036             :         {
    1037        3557 :             domain[j] = host[i];
    1038             :         }
    1039             :         /* TODO: check that characters are acceptable in a domain name */
    1040             :     }
    1041         228 :     domain[j] = '\0';
    1042         228 :     result = tld(domain, info);
    1043         228 :     if(info->f_tld != NULL)
    1044             :     {
    1045             :         /* define the TLD inside the source string which "unfortunately"
    1046             :          * is not null terminated by '\0'; also fix the offset since in
    1047             :          * the complete URI the TLD is a bit further away
    1048             :          */
    1049         227 :         info->f_tld = host + info->f_offset;
    1050         227 :         info->f_offset = (int) (info->f_tld - p);
    1051             :     }
    1052         228 :     return result;
    1053             : }
    1054             : 
    1055             : 
    1056             : /** \brief Return the version of the library.
    1057             :  *
    1058             :  * This functino returns the version of this library. The version
    1059             :  * is defined with three numbers: \<major>.\<minor>.\<patch>.
    1060             :  *
    1061             :  * You should be able to use the libversion to compare different
    1062             :  * libtld versions and know which one is the newest version.
    1063             :  *
    1064             :  * \return A constant string with the version of the library.
    1065             :  */
    1066           9 : const char *tld_version()
    1067             : {
    1068           9 :     return LIBTLD_VERSION;
    1069             : }
    1070             : 
    1071             : 
    1072             : /** \def LIBTLD_EXPORT
    1073             :  * \brief The export API used by MS-Windows DLLs.
    1074             :  *
    1075             :  * This definition is used to mark functions and classes as exported
    1076             :  * from the library. This allows other programs to automatically use
    1077             :  * functions defined in the library.
    1078             :  *
    1079             :  * The LIBTLD_EXPORT may be set to dllexport or dllimport depending
    1080             :  * on whether you compile the library or you intend to link against it.
    1081             :  */
    1082             : 
    1083             : /** \def LIBTLD_VERSION
    1084             :  * \brief The version of the library as a string.
    1085             :  *
    1086             :  * This definition represents the version of the libtld header you
    1087             :  * are compiling against. You can compare it to the returned value
    1088             :  * of the tld_version() function to make sure that everything is
    1089             :  * compatible (i.e. if the version is not the same, then the
    1090             :  * tld_info structure may have changed.)
    1091             :  */
    1092             : 
    1093             : /** \def LIBTLD_VERSION_MAJOR
    1094             :  * \brief The major version as a number.
    1095             :  *
    1096             :  * This definition represents the major version of the libtld header
    1097             :  * you are compiling against.
    1098             :  */
    1099             : 
    1100             : /** \def LIBTLD_VERSION_MINOR
    1101             :  * \brief The minor version as a number.
    1102             :  *
    1103             :  * This definition represents the minor version of the libtld header
    1104             :  * you are compiling against.
    1105             :  */
    1106             : 
    1107             : /** \def LIBTLD_VERSION_PATCH
    1108             :  * \brief The patch version as a number.
    1109             :  *
    1110             :  * This definition represents the patch version of the libtld header
    1111             :  * you are compiling against. Some people call this number the release
    1112             :  * number.
    1113             :  */
    1114             : 
    1115             : /** \def VALID_URI_ASCII_ONLY
    1116             :  * \brief Whether to check that the URI only includes ASCII.
    1117             :  *
    1118             :  * By default the tld_check_uri() function accepts any extended character
    1119             :  * (i.e. characters over 0x80). This flag can be used to refuse such
    1120             :  * characters.
    1121             :  */
    1122             : 
    1123             : /** \def VALID_URI_NO_SPACES
    1124             :  * \brief Whether to check that the URI do not include any spaces.
    1125             :  *
    1126             :  * By default the tld_check_uri() function accepts spaces as valid
    1127             :  * characters in a URI (whether they are explicit " ", or written as
    1128             :  * "+" or "%20".) This flag can be used to refuse all spaces (i.e.
    1129             :  * this means the "+" and "%20" are also refused.)
    1130             :  */
    1131             : 
    1132             : /** \enum tld_category
    1133             :  * \brief The list of categories for the different TLDs.
    1134             :  *
    1135             :  * Defines the category of the TLD. The most well known categories
    1136             :  * are International TLDs (such as .com and .info) and the countries
    1137             :  * TLDs (such as .us, .uk, .fr, etc.)
    1138             :  *
    1139             :  * IANA offers and is working on other extensions such as .pro for
    1140             :  * profesionals, and .arpa for their internal infrastructure.
    1141             :  */
    1142             : 
    1143             : /** \var TLD_CATEGORY_INTERNATIONAL
    1144             :  * \brief International TLDs
    1145             :  *
    1146             :  * This category represents TLDs that can be used by anyone anywhere
    1147             :  * in the world. In some cases, these have some limits (i.e. only a
    1148             :  * museum can register a .museum TLD.) However, the most well known
    1149             :  * international extension is .com and this one has absolutely no
    1150             :  * restrictions.
    1151             :  */
    1152             : 
    1153             : /** \var TLD_CATEGORY_PROFESSIONALS
    1154             :  * \brief Professional TLDs
    1155             :  *
    1156             :  * This category is offered to professionals. Some countries already
    1157             :  * offer second-level domain name registrations for professionals and
    1158             :  * either way they are not used very much. These are reserved for people
    1159             :  * such as accountants, attorneys, and doctors.
    1160             :  *
    1161             :  * Only people who have a lisence with a government can register a .pro
    1162             :  * domain name.
    1163             :  */
    1164             : 
    1165             : /** \var TLD_CATEGORY_LANGUAGE
    1166             :  * \brief Language specific TLDs
    1167             :  *
    1168             :  * At time of writing, there is one language extension: .cat for the
    1169             :  * Catalan language. The idea of the language extensions is to offer
    1170             :  * a language, rather than a country, a way to have a website that
    1171             :  * all the people on the Earth can read in their language.
    1172             :  */
    1173             : 
    1174             : /** \var TLD_CATEGORY_GROUPS
    1175             :  * \brief Groups specific TLDs
    1176             :  *
    1177             :  * The concept of groups is similar to the language grouping, but in
    1178             :  * this case it may reference to a specific group of people (but not
    1179             :  * based on anything such as etnicity.)
    1180             :  *
    1181             :  * Examples of groups are Kids, Gay people, Ecologists, etc. This is
    1182             :  * only proposed at this point.
    1183             :  */
    1184             : 
    1185             : /** \var TLD_CATEGORY_REGION
    1186             :  * \brief Region specific TLDs
    1187             :  *
    1188             :  * It has been proposed, like the .eu, to have extensions based on
    1189             :  * well defined regions such as .asia for all of Asia. We currently
    1190             :  * also have .aq for Antartique. Some proposed regions are .africa
    1191             :  * and city names such as .paris and .wien.
    1192             :  *
    1193             :  * Old TLDs that were for countries but are not assigned to those
    1194             :  * because the country \em disappeared (i.e. in general was split in
    1195             :  * two and both new countries have different names,) and future
    1196             :  * regions appear in this category.
    1197             :  *
    1198             :  * We keep old TLDs because it is not unlikely that such will be
    1199             :  * used every now and then and they can, in this way, cleanly be
    1200             :  * refused by your software.
    1201             :  */
    1202             : 
    1203             : /** \var TLD_CATEGORY_TECHNICAL
    1204             :  * \brief Technical extensions are considered internal.
    1205             :  *
    1206             :  * These are likely valid (i.e. the .arpa is valid) but are used for
    1207             :  * technical reasons and not for regular URIs. So they are present
    1208             :  * but must certainly be ignored by your software.
    1209             :  *
    1210             :  * To avoid returning TLD_RESULT_SUCCESS when a TLD with such a
    1211             :  * category is found, we mark these with the
    1212             :  * TLD_STATUS_INFRASTRUCTURE.
    1213             :  */
    1214             : 
    1215             : /** \var TLD_CATEGORY_COUNTRY
    1216             :  * \brief A country extension.
    1217             :  *
    1218             :  * Most of the extensions are country extensions. Country extensions
    1219             :  * are generally further broken down with second-level domain names.
    1220             :  * Some countries even have third, forth, and fifth level domain
    1221             :  * names.
    1222             :  */
    1223             : 
    1224             : /** \var TLD_CATEGORY_ENTREPRENEURIAL
    1225             :  * \brief A private extension.
    1226             :  *
    1227             :  * Some private companies and individuals purchased domains that they
    1228             :  * then use as a TLD reselling sub-domains from that main domain name.
    1229             :  *
    1230             :  * For example, the ".blogspot.com" domain is offered by blogspot as
    1231             :  * a TLD to their users. This gives the users the capability to
    1232             :  * define a cookie at the ".blogspot.com" level but not directly
    1233             :  * under ".com". In other words, two distinct site such as:
    1234             :  *
    1235             :  * \li "a.blogspot.com", and
    1236             :  * \li "b.blogspot.com"
    1237             :  *
    1238             :  * cannot share their cookies. Yet, ".com" by itself is also a
    1239             :  * top-level domain name that anyone can use.
    1240             :  */
    1241             : 
    1242             : /** \var TLD_CATEGORY_BRAND
    1243             :  * \brief The TLD is owned and represents a brand.
    1244             :  *
    1245             :  * This category is used to mark top level domain names that are
    1246             :  * specific to one company. Note that certain TLDs are owned by
    1247             :  * companies now, but they are not automatically marked as a
    1248             :  * brand (i.e. ".lol").
    1249             :  */
    1250             : 
    1251             : /** \var TLD_CATEGORY_UNDEFINED
    1252             :  * \brief The TLD was not found.
    1253             :  *
    1254             :  * This category is used to initialize the information structure and
    1255             :  * is used to show that the TLD was not found.
    1256             :  */
    1257             : 
    1258             : /** \enum tld_status
    1259             :  * \brief Defines the current status of the TLD.
    1260             :  *
    1261             :  * Each TLD has a status. By default, it is generally considered valid,
    1262             :  * however, many TLDs are either proposed or deprecated.
    1263             :  *
    1264             :  * Proposed TLDs are not yet officially accepted by the official entities
    1265             :  * taking care of those TLDs. They should be refused, but may become
    1266             :  * available later.
    1267             :  *
    1268             :  * Deprecated TLDs were in use before but got dropped. They may be dropped
    1269             :  * because a country doesn't follow up on their Internet TLD, or because
    1270             :  * the extension is found to be \em boycotted.
    1271             :  */
    1272             : 
    1273             : /** \var TLD_STATUS_VALID
    1274             :  * \brief The TLD is currently valid.
    1275             :  *
    1276             :  * This status represents a TLD that is currently fully valid and supported
    1277             :  * by the owners.
    1278             :  *
    1279             :  * These can be part of URIs representing valid resources.
    1280             :  */
    1281             : 
    1282             : /** \var TLD_STATUS_PROPOSED
    1283             :  * \brief The TLD was proposed but not yet accepted.
    1284             :  *
    1285             :  * The TLD is nearly considered valid, at least it is in the process to get
    1286             :  * accepted. The TLD will not work until officially accepted.
    1287             :  *
    1288             :  * No valid URIs can include this TLD until it becomes TLD_STATUS_VALID.
    1289             :  */
    1290             : 
    1291             : /** \var TLD_STATUS_DEPRECATED
    1292             :  * \brief The TLD was once in use.
    1293             :  *
    1294             :  * This status is used by TLDs that were valid (TLD_STATUS_VALID) at some point
    1295             :  * in time and was changed to another TLD rendering that one useless (or
    1296             :  * \em incorrect in the case of a country name change.)
    1297             :  *
    1298             :  * This status means such URIs are not to be considered valid. However, it may
    1299             :  * be possible to emit a 301 (in terms of HTTP protocol) to fix the problem.
    1300             :  */
    1301             : 
    1302             : /** \var TLD_STATUS_UNUSED
    1303             :  * \brief The TLD was officially assigned but not put to use.
    1304             :  *
    1305             :  * This special status is used for all the TLDs that were assigned to a specific
    1306             :  * entity, but never actually put to use. Many smaller countries (especially
    1307             :  * islands) are assigned this status.
    1308             :  *
    1309             :  * Unused TLDs are not valid in any URI until marked valid.
    1310             :  */
    1311             : 
    1312             : /** \var TLD_STATUS_RESERVED
    1313             :  * \brief The TLD is reserved so no one can use it.
    1314             :  *
    1315             :  * This special case forces the specified TLDs into a "do not use" list. Seeing
    1316             :  * such TLDs may happen by people who whish it were official, but it is not
    1317             :  * considered \em legal.
    1318             :  *
    1319             :  * A reserved TLD may represent a second TLD that was assigned to a specific
    1320             :  * country or other category. It may be possible to do a transfer from that
    1321             :  * TLD to the official TLD (i.e. Great Britain was assigned .gb, but instead
    1322             :  * uses .uk; URIs with .gb could be transformed with .uk and checked for
    1323             :  * validity.)
    1324             :  */
    1325             : 
    1326             : /** \var TLD_STATUS_INFRASTRUCTURE
    1327             :  * \brief These TLDs are reserved for the Internet infrastructure.
    1328             :  *
    1329             :  * These TLDs cannot be used with standard URIs. These are used to make the
    1330             :  * Internet functional instead.
    1331             :  *
    1332             :  * All URIs for standard resources must refuse these URIs.
    1333             :  */
    1334             : 
    1335             : /** \var TLD_STATUS_UNDEFINED
    1336             :  * \brief Special status to indicate we did not find the TLD.
    1337             :  *
    1338             :  * The info structure is returned with an \em undefined status whenever the
    1339             :  * TLD could not be found in the list of existing TLDs. This means the URI
    1340             :  * is completely invalid. (The only exception would be if you support some
    1341             :  * internal TLDs.)
    1342             :  *
    1343             :  * URI what cannot get a TLD_STATUS_VALID should all be considered invalid.
    1344             :  * But those marked as TLD_STATUS_UNDEFINED are completely invalid. This
    1345             :  * being said, you may want to make sure you passed the correct string.
    1346             :  * The URI must be just and only the set of sub-domains, the domain, and
    1347             :  * the TLDs. No protocol, slashes, colons, paths, query strings, anchors
    1348             :  * are accepted in the URI.
    1349             :  */
    1350             : 
    1351             : /** \var TLD_STATUS_EXCEPTION
    1352             :  * \brief Special status to indicate an exception which is not directly a TLD.
    1353             :  *
    1354             :  * When a NIC decides to change their setup it can generate exceptions. For
    1355             :  * example, the UK first made use of .uk and as such offered a few customers
    1356             :  * to use .uk. Later they decided to only offer second level domain names
    1357             :  * such as the .co.uk and .ac.uk. This generates a few exceptions on the .uk
    1358             :  * domain name. For example, the police.uk domain is still in use and thus
    1359             :  * it is an exception. We reference it as ".police.uk" in our XML data file
    1360             :  * yet the TLD in that case is just ".uk".
    1361             :  */
    1362             : 
    1363             : 
    1364             : /** \enum tld_result
    1365             :  * \brief The result returned by tld().
    1366             :  *
    1367             :  * This enumeration defines all the possible results of the tld() function.
    1368             :  *
    1369             :  * Only the TLD_RESULT_SUCCESS is considered to represent a valid result.
    1370             :  *
    1371             :  * The TLD_RESULT_INVALID represents a TLD that was found but is not currently
    1372             :  * marked as valid (it may be deprecated or proposed, for example.)
    1373             :  */
    1374             : 
    1375             : /** \var TLD_RESULT_SUCCESS
    1376             :  * \brief Success! The TLD of the specified URI is valid.
    1377             :  *
    1378             :  * This result is returned when the URI includes a valid TLD. The function
    1379             :  * further includes valid results in the tld_info structure.
    1380             :  *
    1381             :  * You can accept this URI as valid.
    1382             :  */
    1383             : 
    1384             : /** \var TLD_RESULT_INVALID
    1385             :  * \brief The TLD was found, but it is marked as invalid.
    1386             :  *
    1387             :  * This result represents a TLD that is not valid as is for a URI, but it
    1388             :  * was defined in the TLD data. The function includes further information
    1389             :  * in the tld_info structure. There you can check the category, status,
    1390             :  * and other parameters to determine what the TLD really represents.
    1391             :  *
    1392             :  * It may be possible to use such a TLD, although as far as web addresses
    1393             :  * are concerned, these are not considered valid. As mentioned in the
    1394             :  * statuses, some may mean that the TLD can be changed for another and
    1395             :  * work (i.e. a country name that changed.)
    1396             :  */
    1397             : 
    1398             : /** \var TLD_RESULT_NULL
    1399             :  * \brief The input URI is empty.
    1400             :  *
    1401             :  * The tld() function returns this value whenever the input URI pointer is
    1402             :  * NULL or the empty string (""). Obviously, no TLD is found in this case.
    1403             :  */
    1404             : 
    1405             : /** \var TLD_RESULT_NO_TLD
    1406             :  * \brief The input URI has no TLD defined.
    1407             :  *
    1408             :  * Whenever the URI does not include at least one period (.), this error
    1409             :  * is returned. Local URIs are considered valid and don't generally include
    1410             :  * a period (i.e. "localhost", "my-computer", "johns-computer", etc.) We
    1411             :  * expect that the tld() function would not be called with such URIs.
    1412             :  *
    1413             :  * A valid Internet URI must include a TLD.
    1414             :  */
    1415             : 
    1416             : /** \var TLD_RESULT_BAD_URI
    1417             :  * \brief The URI includes characters that are not accepted by the function.
    1418             :  *
    1419             :  * This value is returned if a character is found to be incompatible or a
    1420             :  * sequence of characters is found incompatible.
    1421             :  *
    1422             :  * At this time, tld() returns this error if two periods (.) are found one
    1423             :  * after another. The errors will be increased with time to detect invalid
    1424             :  * characters (anything outside of [-a-zA-Z0-9.%].)
    1425             :  *
    1426             :  * Note that the URI should not start or end with a period. This error will
    1427             :  * also be returned (at some point) when the function detects such problems.
    1428             :  */
    1429             : 
    1430             : /** \var TLD_RESULT_NOT_FOUND
    1431             :  * \brief The URI has a TLD that could not be determined.
    1432             :  *
    1433             :  * The TLD of the URI was searched in the TLD data and could not be found
    1434             :  * there. This means the TLD is not a valid Internet TLD.
    1435             :  */
    1436             : 
    1437             : 
    1438             : /** \struct tld_info
    1439             :  * \brief Set of information returned by the tld() function.
    1440             :  *
    1441             :  * This structure is used by the tld() function to define the results to
    1442             :  * return to the caller.
    1443             :  *
    1444             :  * Remember that this is a C structure. By default, the fields are undefined.
    1445             :  * The tld() function will first defined these fields, before returning any
    1446             :  * result.
    1447             :  *
    1448             :  * It is acceptable to clear the structure before calling the tld() function
    1449             :  * but it is not required.
    1450             :  */
    1451             : 
    1452             : /** \var enum tld_category tld_info::f_category;
    1453             :  * \brief The category of the TLD.
    1454             :  *
    1455             :  * This represents the category of the TLD. One of the tld_category enumeration
    1456             :  * values can be found in this field.
    1457             :  *
    1458             :  * \sa enum tld_category
    1459             :  */
    1460             : 
    1461             : /** \var enum tld_status tld_info::f_status;
    1462             :  * \brief The status of the TLD.
    1463             :  *
    1464             :  * This value defines the current status of the TLD. Most of the TLDs we define
    1465             :  * are valid, but some are either deprecated, unused, or proposed.
    1466             :  *
    1467             :  * Only a TLD marked as TLD_STATUS_VALID should be considered valid, although
    1468             :  * otherwise may be accepted in some circumstances.
    1469             :  *
    1470             :  * \sa enum tld_status
    1471             :  */
    1472             : 
    1473             : /** \var const char *tld_info::f_country;
    1474             :  * \brief The country where this TLD is used.
    1475             :  *
    1476             :  * When the f_category is set to TLD_CATEGORY_COUNTRY then this field is a
    1477             :  * pointer to the name of the country in English (although some may include
    1478             :  * accents, the strings are in UTF-8.)
    1479             :  *
    1480             :  * This field is set to NULL if the category is not Country or the TLD was
    1481             :  * not found.
    1482             :  *
    1483             :  * \sa tld_info::f_category
    1484             :  * \sa enum tld_category
    1485             :  */
    1486             : 
    1487             : /** \var const char *tld_info::f_tld;
    1488             :  * \brief Pointer to the TLD in the URI string you supplied.
    1489             :  *
    1490             :  * This is a pointer to the TLD section that the tld() function found in
    1491             :  * your URI. Note that it is valid only as long as your URI string pointer.
    1492             :  *
    1493             :  * It is also possible to make use of the tld_info::f_offset value to
    1494             :  * extract the TLD, domain, or sub-domains.
    1495             :  *
    1496             :  * If the TLD is not found, this field is NULL.
    1497             :  */
    1498             : 
    1499             : /** \var int tld_info::f_offset;
    1500             :  * \brief The offset to the TLD in the URI string you supplied.
    1501             :  *
    1502             :  * This offset, when added to the URI string pointer, gets you to the
    1503             :  * TLD of that URI. The offset can also be used to start searching
    1504             :  * for the beginning of the domain name by searching for the previous
    1505             :  * period from that offset minus one. In effect, this gives you a
    1506             :  * way to determine the list of sub-domain.
    1507             :  */
    1508             : 
    1509             : /** \struct tld_description
    1510             :  * \brief [internal] The description of one TLD.
    1511             :  * \internal
    1512             :  *
    1513             :  * The XML data is transformed in an array of TLD description saved in this
    1514             :  * structure.
    1515             :  *
    1516             :  * This structure is internal to the database. You never are given direct
    1517             :  * access to it. However, some of the constant pointers (i.e. country names)
    1518             :  * do point to that data.
    1519             :  */
    1520             : 
    1521             : /** \var tld_description::f_category
    1522             :  * \brief The category of this entry.
    1523             :  *
    1524             :  * The XML data must defined the different TLDs inside catageorized area
    1525             :  * tags. This variable represents that category.
    1526             :  */
    1527             : 
    1528             : /** \var tld_description::f_country
    1529             :  * \brief The name of the country owning this TLD.
    1530             :  *
    1531             :  * The name of the country owning this entry. Many TLDs do not have a
    1532             :  * country attached to it (i.e. .com and .info, for example, do not have
    1533             :  * a country attached to them) in which case this pointer is NULL.
    1534             :  */
    1535             : 
    1536             : /** \var tld_description::f_start_offset
    1537             :  * \brief The first offset of a list of TLDs.
    1538             :  *
    1539             :  * This offset represents the start of a list of TLDs. The start offset is
    1540             :  * inclusive so that very offset IS included in the list.
    1541             :  *
    1542             :  * The TLDs being referenced from this TLD are those between f_start_offset
    1543             :  * and f_end_offset - 1 also writte:
    1544             :  *
    1545             :  * [f_start_offset, f_end_offset)
    1546             :  */
    1547             : 
    1548             : /** \var tld_description::f_end_offset
    1549             :  * \brief The last offset of a list of TLDs.
    1550             :  *
    1551             :  * This offset represents the end of a list of TLDs. The end offset is
    1552             :  * exclusive so that very offset is NOT included in the list.
    1553             :  *
    1554             :  * The TLDs being referenced from this TLD are those between f_start_offset
    1555             :  * and f_end_offset - 1 also writte:
    1556             :  *
    1557             :  * [f_start_offset, f_end_offset)
    1558             :  */
    1559             : 
    1560             : /** \var tld_description::f_exception_apply_to
    1561             :  * \brief This TLD is an exception of the "apply to" TLD.
    1562             :  *
    1563             :  * With time, some TLDs were expected to have or not have certain sub-domains
    1564             :  * and when removal of those was partial (i.e. did not force existing owners
    1565             :  * to lose their domain) then we have exceptions. This variable holds the
    1566             :  * necessary information to support such exceptions.
    1567             :  *
    1568             :  * The "apply to" is only defined if the entry is an exception (see f_status.)
    1569             :  * The f_exception_apply_to value is an offset to the very TLD we want to
    1570             :  * return when we get this exception.
    1571             :  */
    1572             : 
    1573             : /** \var tld_description::f_exception_level
    1574             :  * \brief This entry is an exception representing a TLD at this specified level.
    1575             :  *
    1576             :  * When we find an exception, it may be more than 1 level below the TLD it uses
    1577             :  * (a.b.c.d may be viewed as part of TLD .d thus .a has to be bumped 3 levels
    1578             :  * up.) In most cases, this is equal to this TLD level - 1.
    1579             :  */
    1580             : 
    1581             : /** \var tld_description::f_status
    1582             :  * \brief The status of this TLD.
    1583             :  *
    1584             :  * The status of a TLD is TLD_STATUS_VALID by default. Using the different
    1585             :  * tags available in the XML file we can defined other statuses such as the
    1586             :  * TLD_STATUS_DEPRECATED status.
    1587             :  *
    1588             :  * In the TLD table the status can be TLD_STATUS_EXCEPTION.
    1589             :  */
    1590             : 
    1591             : /** \var tld_description::f_tld
    1592             :  * \brief The actual TLD of this entry.
    1593             :  *
    1594             :  * In this table, the TLD is actually just one name and no period. Other
    1595             :  * parts of a multi-part TLD are found at the [f_start_offset, f_end_offset).
    1596             :  *
    1597             :  * The TLD is built by starting a search at the top level which is defined as 
    1598             :  * [tld_start_offset, tld_end_offset). These offsets are global variables defined
    1599             :  * in the tld_data.c file.
    1600             :  */
    1601             : 
    1602             : /* vim: ts=4 sw=4 et
    1603             :  */

Generated by: LCOV version 1.13