![]() |
libtld 2.0.14
A library to determine the Top-Level Domain name of any Internet URI.
|
Implementation of the TLD parser library. More...
#include "libtld/tld.h"
#include "libtld/tld_data.h"
#include "libtld/tld_file.h"
#include <sstream>
#include <malloc.h>
#include <stdlib.h>
#include <limits.h>
#include <string.h>
#include <ctype.h>
Go to the source code of this file.
Functions | |
static int | cmp (const char *a, int l, const char *b, int n) |
Compare two strings, one of which is limited by length. | |
static int | search (int i, int j, char const *domain, int n) |
Search for the specified domain. | |
enum tld_result | tld (char const *uri, struct tld_info *info) |
Get information about the TLD for the specified URI. | |
enum tld_result | tld_check_uri (const char *uri, struct tld_info *info, const char *protocols, int flags) |
Check that a URI is valid. | |
void | tld_clear_info (struct tld_info *info) |
Clear the info structure. | |
void | tld_free_tlds () |
Clear the allocated TLD file. | |
uint32_t | tld_get_static_tlds_buffer_size () |
Get the size of the TLDs static buffer. | |
enum tld_result | tld_get_tag (struct tld_info *info, int tag_idx, struct tld_tag_definition *tag) |
const struct tld_file * | tld_get_tlds () |
Return a pointer to the current list of TLDs. | |
enum tld_result | tld_load_tlds (char const *filename, int fallback) |
Load a TLDs file as the file to be used by the tld() function. | |
static enum tld_result | tld_load_tlds_if_not_loaded () |
Load the TLDs if not yet loaded. | |
enum tld_result | tld_next_tld (struct tld_enumeration_state *state, struct tld_info *info) |
Read the next TLD and return its info. | |
int | tld_tag_count (struct tld_info *info) |
const char * | tld_version () |
Return the version of the library. | |
Variables | |
static struct tld_file * | g_tld_file = nullptr |
The TLD file currently loaded or NULL. | |
This file includes all the functions available in the C library of libtld that pertain to the parsing of URIs and extraction of TLDs.
Definition in file tld.cpp.
|
static |
This internal function was created to handle a simple string (no locale) comparison with one string being limited in length.
The comparison does not require locale since all characters are ASCII (a URI with Unicode characters encode them in UTF-8 and changes all those bytes with XX.)
The l length applies to the string in a
. The TLD data does not include null terminated strings. Instead we have one superstring with lengths pre-calculated.
The n length applies to the string in b
. This allows us to make use of the input string all the way down to the cmp() function without making useless copies.
If parameter a
is "*", then it always matches b
. However, it is expected that this function never gets called when a == "*".
[in] | a | The pointer in an f_tld field of the tld_descriptions. |
[in] | l | The number of characters that can be checked in a . |
[in] | b | Pointer directly in referencing the user domain string. |
[in] | n | The number of characters that can be checked in b . |
Definition at line 499 of file tld.cpp.
Referenced by search().
|
static |
This function executes one search for one domain. The search is binary, which means the tld_descriptions are expected to be 100% in order at all levels.
The i
and j
parameters represent the boundaries of the current level to be checked. Know that for a given TLD, there is a start and end boundary that is used to define i
and j
. So except for the top level, the bounds are limited to one TLD, sub-TLD, etc. (for example, .uk has a sub-layer with .co, .ac, etc. and that ground is limited to the second level entries accepted within the .uk TLD.)
This search does one search at one level. If sub-levels are available for that TLD, then it is the responsibility of the caller to call the function again to find out whether one of those sub-domain name is in use.
When the TLD cannot be found, the function returns -1.
[in] | i | The start point of the search (included.) |
[in] | j | The end point of the search (excluded.) |
[in] | domain | The domain name to search. |
[in] | n | The length of the domain name. |
Definition at line 569 of file tld.cpp.
References cmp(), g_tld_file, tld(), tld_load_tlds_if_not_loaded(), and TLD_RESULT_SUCCESS.
Referenced by tld().
enum tld_result tld | ( | char const * | uri, |
struct tld_info * | info | ||
) |
The tld() function searches for the specified URI in the TLD descriptions. The results are saved in the info parameter for later interpretetation (i.e. extraction of the domain name, sub-domains and the exact TLD.)
The function extracts the last extension of the URI. For example, in the following:
the function first extracts ".uk". With that extension, it searches the list of official TLDs. If not found, an error is returned and the info parameter is set to unknown.
When found, the function checks whether that TLD (".uk" in our previous example) accepts sub-TLDs (second, third, forth and fifth level TLDs.) If so, it extracts the next TLD entry (the ".co" in our previous example) and searches for that second level TLD. If found, it again tries with the third level, etc. until all the possible TLDs were exhausted. At that point, it returns the last TLD it found. In case of ".co.uk", it returns the information of the ".co" TLD, second-level domain name.
All the comparisons are done in lowercase. This is because all the data is saved in lowercase and we expect the input of the tld() function to already be in lowercase. If you have a doubt and your input may actually be in uppercase, make sure to call the tld_domain_to_lowercase() function first. That function makes a duplicate of your domain name in lowercase. It understands the XX characters (since the URI is expected to still be encoded) and properly handles UTF-8 characters in order to define the lowercase characters of the input. Note that the tld_domain_to_lowercase() function returns a newly allocated pointer that you are responsible to free once you are done with it.
The info
structure includes:
Assuming that you always get valid URIs, you should get one of those results:
Other results are returned when the input string is considered invalid.
[in] | uri | The URI to be checked. |
[out] | info | A pointer to a tld_info structure to save the result. |
Definition at line 1113 of file tld.cpp.
References tld_info::f_offset, tld_info::f_status, tld_info::f_tld, g_tld_file, search(), tld(), tld_clear_info(), tld_load_tlds_if_not_loaded(), TLD_RESULT_BAD_URI, TLD_RESULT_INVALID, TLD_RESULT_NO_TLD, TLD_RESULT_NOT_FOUND, TLD_RESULT_NULL, TLD_RESULT_SUCCESS, TLD_STATUS_EXCEPTION, and TLD_STATUS_VALID.
Referenced by tld_email_list::tld_email_t::parse(), PHP_FUNCTION(), search(), tld_object::set_domain(), tld(), tld_check_uri(), tld_file_to_json(), and tld_next_tld().
enum tld_result tld_check_uri | ( | const char * | uri, |
struct tld_info * | info, | ||
const char * | protocols, | ||
int | flags | ||
) |
This function very quickly parses a URI to determine whether it is valid.
Note that it does not (currently) support local naming conventions which means that a host such as "localhost" will fail the test.
The protocols
variable can be set to a list of protocol names that are considered valid. For example, for HTTP protocol one could use "http,https". To accept any protocol use an asterisk as in: "*". The protocol must be only characters, digits, or underscores ([0-9A-Za-z_]+) and it must be at least one character.
The flags can be set to the following values, or them to set multiple flags at the same time:
The return value is generally TLD_RESULT_BAD_URI when an invalid character is found in the URI string. The TLD_RESULT_NULL is returned if the URI is a NULL pointer or an empty string. Other results may be returned by the tld() function. If a result other than TLD_RESULT_SUCCESS is returned then the info structure may or may not be updated.
[in] | uri | The URI which validity is being checked. |
[out] | info | The resulting information about the URI domain and TLD. |
[in] | protocols | List of comma separated protocols accepted. |
[in] | flags | A set of flags to tell the function what is valid/invalid. |
Definition at line 1311 of file tld.cpp.
References tld_info::f_offset, tld_info::f_tld, tld(), tld_clear_info(), TLD_RESULT_BAD_URI, TLD_RESULT_NULL, VALID_URI_ASCII_ONLY, and VALID_URI_NO_SPACES.
Referenced by check_uri(), and PHP_FUNCTION().
void tld_clear_info | ( | struct tld_info * | info | ) |
This function initializes the info structure with defaults. The different TLD functions that make use of this structure will generally call this function first to represent a failure case.
Note that by default the category and status are set to undefined (TLD_CATEGORY_UNDEFINED and TLD_STATUS_UNDEFINED). Also the country and tld pointer are set to NULL and thus they cannot be used as strings.
[out] | info | The tld_info structure to clear. |
Definition at line 705 of file tld.cpp.
References tld_info::f_category, tld_info::f_country, tld_info::f_offset, tld_info::f_status, tld_info::f_tld, TLD_CATEGORY_UNDEFINED, and TLD_STATUS_UNDEFINED.
Referenced by tld(), tld_check_uri(), and tld_next_tld().
void tld_free_tlds | ( | ) |
Once you are done with the library and if you want to make sure you do not have a memory leak, you can use this function to delete the TLD file which resides in memory.
You can also re-use the library later by either calling the tld_load_tlds() function or just functions that call tld() in which case you'll get the default .tld file loaded or the fallback. However, you cannot use the tld_info and other such structures after this call. Some of the pointers found in those structures may not be valid anymore since we use pointers directly to the TLD file data.
Definition at line 828 of file tld.cpp.
References g_tld_file.
uint32_t tld_get_static_tlds_buffer_size | ( | ) |
This function is used to retrieve the size of the TLD buffer saved statically inside the library. This buffer gets used whenever the external tlds.tld file cannot be used for whatever reason. The size is used to create an std::stringstream file with the static data which is read as if the data came from a disk file.
Definition at line 1662 of file tld.cpp.
Referenced by tld_load_tlds().
enum tld_result tld_get_tag | ( | struct tld_info * | info, |
int | tag_idx, | ||
struct tld_tag_definition * | tag | ||
) |
const struct tld_file * tld_get_tlds | ( | ) |
This function returns the list of TLDs that were loaded by the tld_load_tlds() function. If the TLDs were not yet loaded, then the function returns a nullptr.
The structure must be considered 100% read-only. It is possible that the TLDs were loaded from the tld_data.c buffer which means it is read-only data from the library.
Definition at line 809 of file tld.cpp.
References g_tld_file.
enum tld_result tld_load_tlds | ( | char const * | filename, |
int | fallback | ||
) |
This function loads the specified filename
as the current set of data to be used by the tld() function.
You generally do not need to call this function, instead, it will be automatically called with a null pointer which will load the default file as expected.
The fallback
flag can be set to true (the default) to fallback to the static version of the data compiled internally. This is used if the specified or default external file cannot be loaded.
[in] | filename | The file to load or NULL to load the default. |
[in] | fallback | Whether to fallback to the internal data if the input file cannot be loaded. |
Definition at line 744 of file tld.cpp.
References g_tld_file, tld_get_static_tlds_buffer_size(), TLD_RESULT_INVALID, TLD_RESULT_NOT_FOUND, and TLD_RESULT_SUCCESS.
Referenced by tld_load_tlds_if_not_loaded().
|
static |
This user can call the tld_load_tlds() function to load or reload the TLDs from a file the user chooses.
However, if one of the functions, such as tld(), gets called before the TLDs are loaded, it would crash since the pointer is still nullptr. Instead, these functions call the tld_load_tlds_if_not_loaded() function to make sure that the g_tld_file is not a null pointer anymore.
Definition at line 460 of file tld.cpp.
References g_tld_file, tld_load_tlds(), and TLD_RESULT_SUCCESS.
Referenced by search(), tld(), and tld_next_tld().
enum tld_result tld_next_tld | ( | struct tld_enumeration_state * | state, |
struct tld_info * | info | ||
) |
This function is used to read all the TLDs one at a time.
To read the first TLD, make sure the state structure is cleared the first time you call the tld_next_tld() function:
The function may return various values and it is important to verify those value to know the state of the info
parameter. In particular, the TLD_RESULT_INVALID means that the returned domain name is considered to exist but it is currently not a valid domain name (i.e. it could be a deprecated or unused intermediate).
info
is considered to be a valid domain name. [in] | state | The current state. Reset to get the very first domain name. |
[in] | info | The structure where the information of the next domain name is saved. |
Definition at line 888 of file tld.cpp.
References tld_description::f_end_offset, tld_info::f_offset, tld_info::f_status, tld_info::f_tld, g_tld_file, tld(), tld_clear_info(), tld_load_tlds_if_not_loaded(), TLD_RESULT_BAD_URI, TLD_RESULT_INVALID, TLD_RESULT_NO_TLD, TLD_RESULT_NOT_FOUND, TLD_RESULT_NULL, TLD_RESULT_SUCCESS, and TLD_STATUS_VALID.
const char * tld_version | ( | ) |
This functino returns the version of this library. The version is defined with three numbers: <major>.<minor>.<patch>.
You should be able to use the libversion to compare different libtld versions and know which one is the newest version.
Definition at line 1646 of file tld.cpp.
References LIBTLD_VERSION.
Referenced by main().
|
static |
This pointer is the TLD file that was specifically or automatically loaded. The tld() function calls the tld_load_tlds() if this pointer is still NULL. This loads the TLDs in memory.
You can change the TLDs at any one time by calling the tld_load_tlds() again.
The loading of the TLDs is not thread safe. If you want to use the library in a multi-threaded environment, make sure to call the tld_load_tlds() before you start your threads. Then you'll be safe as long as you do not want to reload a file of TLDs while running your threads.
The tld_load_tlds_if_not_loaded() can be used to load the TLDs if the g_tld_file is still a null pointer. At the moment, this is only an internal function.
Definition at line 340 of file tld.cpp.
Referenced by search(), tld(), tld_free_tlds(), tld_get_tlds(), tld_load_tlds(), tld_load_tlds_if_not_loaded(), and tld_next_tld().
This document is part of the Snap! Websites Project.
Copyright by Made to Order Software Corp.