Line data Source code
1 : /* TLD library -- TLD, domain name, and sub-domain extraction
2 : * Copyright (c) 2011-2021 Made to Order Software Corp. All Rights Reserved
3 : *
4 : * Permission is hereby granted, free of charge, to any person obtaining a
5 : * copy of this software and associated documentation files (the
6 : * "Software"), to deal in the Software without restriction, including
7 : * without limitation the rights to use, copy, modify, merge, publish,
8 : * distribute, sublicense, and/or sell copies of the Software, and to
9 : * permit persons to whom the Software is furnished to do so, subject to
10 : * the following conditions:
11 : *
12 : * The above copyright notice and this permission notice shall be included
13 : * in all copies or substantial portions of the Software.
14 : *
15 : * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
16 : * OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
17 : * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
18 : * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
19 : * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
20 : * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
21 : * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
22 : */
23 :
24 : /** \file
25 : * \brief Implementation of the TLD parser library.
26 : *
27 : * This file includes all the functions available in the C library
28 : * of libtld that pertain to the parsing of URIs and extraction of
29 : * TLDs.
30 : */
31 :
32 : #include "libtld/tld.h"
33 : #include "tld_data.h"
34 : #if defined(MO_DARWIN)
35 : # include <malloc/malloc.h>
36 : #endif
37 : #if !defined(MO_DARWIN) && !defined(MO_FREEBSD)
38 : #include <malloc.h>
39 : #endif
40 : #include <stdlib.h>
41 : #include <limits.h>
42 : #include <string.h>
43 : #include <ctype.h>
44 :
45 : #ifdef WIN32
46 : #define strncasecmp _strnicmp
47 : #endif
48 :
49 : /** \mainpage
50 : *
51 : * \section introduction The libtld Library
52 : *
53 : * The libtld project is a library that gives you the capability to
54 : * determine the TLD part of any Internet URI or email address.
55 : *
56 : * The main function of the library, tld(), takes a URI string and a
57 : * tld_info structure. From that information it computes the position
58 : * where the TLD starts in the URI. For email addresses (see the
59 : * tld_email_list C++ object, or the tld_email.cpp file for the C
60 : * functions,) it breaks down a full list of emails verifying the
61 : * syntax as defined in RFC 5822.
62 : *
63 : * \section c_programmers For C Programmers
64 : *
65 : * The C functions that you are expected to use are listed here:
66 : *
67 : * \li tld_version() -- return a string representing the TLD library version
68 : * \li tld() -- find the position of the TLD of any URI
69 : * \li tld_domain_to_lowercase() -- force lowercase on the domain name before
70 : * calling other tld function
71 : * \li tld_check_uri() -- verify a full URI, with scheme, path, etc.
72 : * \li tld_clear_info() -- reset a tld_info structure for use with tld()
73 : * \li tld_email_alloc() -- allocate a tld_email_list object
74 : * \li tld_email_free() -- free a tld_email_list object
75 : * \li tld_email_parse() -- parse a list of email addresses
76 : * \li tld_email_count() -- number of emails found by tld_email_parse()
77 : * \li tld_email_rewind() -- go back at the start of the list of emails
78 : * \li tld_email_next() -- read the next email from the list of emails
79 : *
80 : * \section cpp_programmers For C++ Programmers
81 : *
82 : * For C++ users, please make use of these tld classes:
83 : *
84 : * \li tld_object
85 : * \li tld_email_list
86 : *
87 : * In C++, you may also make use of the tld_version() to check the current
88 : * version of the library.
89 : *
90 : * To check whether the version is valid for your tool, you may look at the
91 : * version handling of the libdebpackages library of the wpkg project. The
92 : * libtld version is always a Debian compatible version.
93 : *
94 : * http://windowspackager.org/documentation/implementation-details/debian-version-api
95 : *
96 : * \section php_programmers For PHP Programmers
97 : *
98 : * At this point I do not have a very good environment to recompile everything
99 : * for PHP. The main reason is because the library is being compiled with cmake
100 : * opposed to the automake toolchain that Zend expects.
101 : *
102 : * This being said, the php directory includes all you need to make use of the
103 : * library under PHP. It works like a charm for me and there should be no reason
104 : * for you not to be able to do the same with the library.
105 : *
106 : * The way I rebuild everything for PHP:
107 : *
108 : * \code
109 : * # from within the libtld directory:
110 : * mkdir ../BUILD
111 : * (cd ../BUILD; cmake ../libtld)
112 : * make -C ../BUILD
113 : * cd php
114 : * ./build
115 : * \endcode
116 : *
117 : * The build script will copy the resulting php_libtld.so file where it
118 : * needs to go using sudo. Your system (Red Hat, Mandrake, etc.) may use
119 : * su instead. Update the script as required.
120 : *
121 : * Note that the libtld will be linked statically inside the php_libtld.so
122 : * so you do not have to actually install the libtld environment to make
123 : * everything work as expected.
124 : *
125 : * The resulting functions added to PHP via this extension are:
126 : *
127 : * \li %check_tld()
128 : * \li %check_uri()
129 : * \li %check_email()
130 : *
131 : * For information about these functions, check out the php/php_libtld.c
132 : * file which describes each function, its parameters, and its results
133 : * in great details.
134 : *
135 : * \section not_linux Compiling on Other Platforms
136 : *
137 : * We can successfully compile the library under MS-Windows with cygwin
138 : * and the Microsoft IDE. To do so, we use the CMakeLists.txt file found
139 : * under the dev directory. Overwrite the CMakeLists.txt file in the
140 : * main directory before configuring and you'll get a library without
141 : * having to first compile Qt4.
142 : *
143 : * \code
144 : * cp dev/libtld-only-CMakeLists.txt CMakeListst.txt
145 : * \endcode
146 : *
147 : * At this point this configuration only compiles the library. It gives
148 : * you a shared (.DLL) and a static (.lib) version. With the IDE you may
149 : * create a debug and a release version.
150 : *
151 : * Later we'll look into having a single CMakeLists.txt so you do not
152 : * have to make this copy.
153 : *
154 : * \section example Example
155 : *
156 : * We offer a file named example.c that shows you how to use the
157 : * library in C. It is very simple, one main() function so it is
158 : * very easy to get started with libtld.
159 : *
160 : * For a C++ example, check out the src/validate_tld.cpp tool which was
161 : * created as a command line tool coming with the libtld library.
162 : *
163 : * \include example.c
164 : *
165 : * \section dev Programmers & Maintainers
166 : *
167 : * If you want to work on the library, there are certainly things to
168 : * enhance. We could for example offer more offsets in the info
169 : * string, or functions to clearly define each part of the URI.
170 : *
171 : * However, the most important part of this library is the XML file
172 : * which defines all the TLDs. Maintaining that file is what will
173 : * help the most. It includes all the TLDs known at this point
174 : * (as defined in different places such as Wikipedia and each
175 : * different authority in that area.) The file is easy to read so
176 : * you can easily find whether your extension is defined and if not
177 : * you can let us know.
178 : *
179 : * \section requirements Library Requirements
180 : *
181 : * \li Usage
182 : *
183 : * The library doesn't need anything special. It's a few C functions.
184 : *
185 : * The library also offers a C++ classes. You do not need a C++ compiler
186 : * to use the library, but if you do program in C++, you can use the
187 : * tld_object and tld_email_list instead of the C functions. It makes
188 : * things a lot easier!
189 : *
190 : * Also if you are programming using PHP, the library includes a PHP
191 : * extension so you can check URIs and emails directly from PHP without
192 : * trying to create crazy regular expressions (that most often do not work
193 : * right!)
194 : *
195 : * \li Compiling
196 : *
197 : * To compile the library, you'll need CMake, a C++ compiler for different
198 : * parts and the Qt library as we use the QtXml and QtCore (Qt4). The QtXml
199 : * library is used to parse the XML file (tld_data.xml) which defines all
200 : * the TLDs, worldwide.
201 : *
202 : * To regenerate the documentation we use Doxygen. It is optional, though.
203 : *
204 : * \li PHP
205 : *
206 : * In order to recompile the PHP extension the Zend environment is required.
207 : * Under a Debian or Ubuntu system you can install the php5-dev package.
208 : *
209 : * \section tests Tests Coming with the Library
210 : *
211 : * We have the following tests at this time:
212 : *
213 : * \li tld_test.c
214 : *
215 : * \par
216 : * This test checks the tld() function as end users of the
217 : * library. It checks all the existing TLDs, a few unknown TLDs,
218 : * and invalid TLDs.
219 : *
220 : * \li tld_test_object.cpp
221 : *
222 : * \par
223 : * This test verifies that the tld_object works as expected. It is not
224 : * exhaustive in regard to the tld library itself, only of the tld_object.
225 : *
226 : * \li tld_internal_test.c
227 : *
228 : * \par
229 : * This test includes the tld.c directly so it can check each
230 : * internal function directly. This test checks the cmp() and
231 : * search() functions, with full coverage.
232 : *
233 : * \li tld_test_domain_lowercase.c
234 : *
235 : * \par
236 : * This test runs 100% coverage of the tld_domain_to_lowercase() function.
237 : * This includes conversion of %XX encoded characters and UTF-8 to wide
238 : * characters that can be case folded and saved back as encoded %XX
239 : * characters. The test verifies that all characters are properly
240 : * supported and that errors are properly handled.
241 : *
242 : * \li tld_test_tld_names.cpp
243 : *
244 : * \par
245 : * The Mozilla foundation offers a file with a complete list of all the
246 : * domain names defined throughout the world. This test reads that list
247 : * and checks all the TLDs against the libtld system. Some TLDs may be
248 : * checked in multiple ways. We support the TLDs that start with an
249 : * asterisk (*) and those that start with an exclamation mark (!) which
250 : * means all the TLDs are now being checked out as expected.
251 : * This test reads the public_suffix_list.dat file which has to be
252 : * available in your current directory.
253 : *
254 : * \par
255 : * A copy of the Mozilla file is included with each version of the TLD
256 : * library. It is named tests/public_suffix_list.dat and should be
257 : * up to date when we produce a new version for download on
258 : * SourceForge.net.
259 : *
260 : * \li tld_test_full_uri.c
261 : *
262 : * \par
263 : * The library includes an advanced function that checks the validity
264 : * of complete URIs making it very simple to test such in any software.
265 : * The URI must include a scheme (often called protocol), fully qualified
266 : * domain (sub-domains, domain, TLD), an absolute path, variables (after
267 : * the question mark,) and an anchor. The test ensures that all the
268 : * checks the parser uses are working as expected and allow valid URIs
269 : * while it forbids any invalid URIs.
270 : *
271 : * \li tld_test_emails.cpp
272 : *
273 : * \par
274 : * The libtld supports verifying and breaking up emails in different
275 : * parts. This is done to make sure users enter valid emails (although
276 : * it doesn't mean that the email address exists, it at least allows
277 : * us to know when an email is definitively completely incorrect and
278 : * should be immediately rejected.) The test ensures that all the
279 : * different types of invalid emails are properly being caught (i.e.
280 : * emails with control characters, invalid domain name, missing parts,
281 : * etc.)
282 : *
283 : * \li tld_test_versions.c
284 : *
285 : * \par
286 : * This test checks that the versions in all the files (two
287 : * CMakeLists.txt and the changelog) are equal. If one of those
288 : * does not match, then the test fails.
289 : *
290 : * \li tld_test_xml.sh
291 : *
292 : * \par
293 : * Shell script to run against the tld_data.xml file to ensure its validity.
294 : * This is a good idea any time you make changes to the file. It runs with
295 : * the xmllint tool. If you do not have the tool, it won't work. The tool
296 : * is part of the libxml2-utils package under Ubuntu.
297 : */
298 :
299 :
300 :
301 :
302 : /** \brief Compare two strings, one of which is limited by length.
303 : * \internal
304 : *
305 : * This internal function was created to handle a simple string
306 : * (no locale) comparison with one string being limited in length.
307 : *
308 : * The comparison does not require locale since all characters are
309 : * ASCII (a URI with Unicode characters encode them in UTF-8 and
310 : * changes all those bytes with %XX.)
311 : *
312 : * The length applied to the string in \p b. This allows us to make
313 : * use of the input string all the way down to the cmp() function.
314 : * In other words, we avoid a copy of the string.
315 : *
316 : * The string in \p a is 'nul' (\0) terminated. This means \p a
317 : * may be longer or shorter than \p b. In other words, the function
318 : * is capable of returning the correct result with a single call.
319 : *
320 : * If parameter \p a is "*", then it always matches \p b.
321 : *
322 : * \param[in] a The pointer in an f_tld field of the tld_descriptions.
323 : * \param[in] b Pointer directly in referencing the user domain string.
324 : * \param[in] n The number of characters that can be checked in \p b.
325 : *
326 : * \return -1 if a < b, 0 when a == b, and 1 when a > b
327 : */
328 1021033 : static int cmp(const char *a, const char *b, int n)
329 : {
330 : /* if `a == "*"` then we have a bug in the table
331 : if(a[0] == '*'
332 : && a[1] == '\0')
333 : {
334 : return 0;
335 : }
336 : */
337 :
338 : /* n represents the maximum number of characters to check in b */
339 3333894 : while(n > 0 && *a != '\0')
340 : {
341 2128108 : if(*a < *b)
342 : {
343 417441 : return -1;
344 : }
345 1710667 : if(*a > *b)
346 : {
347 418839 : return 1;
348 : }
349 1291828 : ++a;
350 1291828 : ++b;
351 1291828 : --n;
352 : }
353 184753 : if(*a == '\0')
354 : {
355 148212 : if(n > 0)
356 : {
357 : /* in this case n > 0 so b is larger */
358 4295 : return -1;
359 : }
360 143917 : return 0;
361 : }
362 : /* in this case n == 0 so a is larger */
363 36541 : return 1;
364 : }
365 :
366 :
367 : /** \brief Search for the specified domain.
368 : * \internal
369 : *
370 : * This function executes one search for one domain. The
371 : * search is binary, which means the tld_descriptions are
372 : * expected to be 100% in order at all levels.
373 : *
374 : * The \p i and \p j parameters represent the boundaries
375 : * of the current level to be checked. Know that for a
376 : * given TLD, there is a start and end boundary that is
377 : * used to define \p i and \p j. So except for the top
378 : * level, the bounds are limited to one TLD, sub-TLD, etc.
379 : * (for example, .uk has a sub-layer with .co, .ac, etc.
380 : * and that ground is limited to the second level entries
381 : * accepted within the .uk TLD.)
382 : *
383 : * This search does one search at one level. If sub-levels
384 : * are available for that TLD, then it is the responsibility
385 : * of the caller to call the function again to find out whether
386 : * one of those sub-domain name is in use.
387 : *
388 : * When the TLD cannot be found, the function returns -1.
389 : *
390 : * \param[in] i The start point of the search (included.)
391 : * \param[in] j The end point of the search (excluded.)
392 : * \param[in] domain The domain name to search.
393 : * \param[in] n The length of the domain name.
394 : *
395 : * \return The offset of the domain found, or -1 when not found.
396 : */
397 159730 : int search(int i, int j, const char *domain, int n)
398 : {
399 159730 : int auto_match = -1, p, r;
400 : const struct tld_description *tld;
401 :
402 159730 : if(i < j)
403 : {
404 : /* the "*" breaks the binary search, we have to handle it specially */
405 149685 : tld = tld_descriptions + i;
406 149685 : if(tld->f_tld[0] == '*' && tld->f_tld[1] == '\0')
407 : {
408 1167 : auto_match = i;
409 1167 : ++i;
410 : }
411 :
412 1176448 : while(i < j)
413 : {
414 1020991 : p = (j - i) / 2 + i;
415 1020991 : tld = tld_descriptions + p;
416 1020991 : r = cmp(tld->f_tld, domain, n);
417 1020991 : if(r < 0)
418 : {
419 : /* eliminate the first half */
420 421726 : i = p + 1;
421 : }
422 599265 : else if(r > 0)
423 : {
424 : /* eliminate the second half */
425 455352 : j = p;
426 : }
427 : else
428 : {
429 : /* match */
430 143913 : return p;
431 : }
432 : }
433 : }
434 :
435 15817 : return auto_match;
436 : }
437 :
438 :
439 : /** \brief Clear the info structure.
440 : *
441 : * This function initializes the info structure with defaults.
442 : * The different TLD functions that make use of this structure
443 : * will generally call this function first to represent a
444 : * failure case.
445 : *
446 : * Note that by default the category and status are set to
447 : * undefined (TLD_CATEGORY_UNDEFINED and TLD_STATUS_UNDEFINED).
448 : * Also the country and tld pointer are set to NULL and thus
449 : * they cannot be used as strings.
450 : *
451 : * \param[out] info The tld_info structure to clear.
452 : */
453 62702 : void tld_clear_info(struct tld_info *info)
454 : {
455 62702 : info->f_category = TLD_CATEGORY_UNDEFINED;
456 62702 : info->f_status = TLD_STATUS_UNDEFINED;
457 62702 : info->f_country = (const char *) 0;
458 62702 : info->f_tld = (const char *) 0;
459 62702 : info->f_offset = -1;
460 62702 : }
461 :
462 :
463 : /** \brief Get information about the TLD for the specified URI.
464 : *
465 : * The tld() function searches for the specified URI in the TLD
466 : * descriptions. The results are saved in the info parameter for
467 : * later interpretetation (i.e. extraction of the domain name,
468 : * sub-domains and the exact TLD.)
469 : *
470 : * The function extracts the last \em extension of the URI. For
471 : * example, in the following:
472 : *
473 : * \code
474 : * example.co.uk
475 : * \endcode
476 : *
477 : * the function first extracts ".uk". With that \em extension, it
478 : * searches the list of official TLDs. If not found, an error is
479 : * returned and the info parameter is set to \em unknown.
480 : *
481 : * When found, the function checks whether that TLD (".uk" in our
482 : * previous example) accepts sub-TLDs (second, third, forth and
483 : * fifth level TLDs.) If so, it extracts the next TLD entry (the
484 : * ".co" in our previous example) and searches for that second
485 : * level TLD. If found, it again tries with the third level, etc.
486 : * until all the possible TLDs were exhausted. At that point, it
487 : * returns the last TLD it found. In case of ".co.uk", it returns
488 : * the information of the ".co" TLD, second-level domain name.
489 : *
490 : * All the comparisons are done in lowercase. This is because
491 : * all the data is saved in lowercase and we expect the input
492 : * of the tld() function to already be in lowercase. If you
493 : * have a doubt and your input may actually be in uppercase,
494 : * make sure to call the tld_domain_to_lowercase() function
495 : * first. That function makes a duplicate of your domain name
496 : * in lowercase. It understands the %XX characters (since the
497 : * URI is expected to still be encoded) and properly handles
498 : * UTF-8 characters in order to define the lowercase characters
499 : * of the input. Note that the function returns a newly
500 : * allocated pointer that you are responsible to free once
501 : * you are done with it.
502 : *
503 : * \warning
504 : * If you call tld() with the pointer return by
505 : * tld_domain_to_lowercase(), keep in mind that the tld()
506 : * function saves pointers of the input string directly in
507 : * the tld_info structure. In other words, you want to free()
508 : * that string AFTER you are done with the tld_info structure.
509 : *
510 : * The \p info structure includes:
511 : *
512 : * \li f_category -- the category of TLD, unless set to
513 : * TLD_CATEGORY_UNDEFINED, it is considered valid
514 : * \li f_status -- the status of the TLD, unless set to
515 : * TLD_STATUS_UNDEFINED, it was defined from the tld_data.xml file;
516 : * however, only those marked as TLD_STATUS_VALID are considered to
517 : * currently be in use, all the other statuses can be used by your
518 : * software, one way or another, but it should not be accepted as
519 : * valid in a URI
520 : * \li f_country -- if the category is set to TLD_CATEGORY_COUNTRY
521 : * then this pointer is set to the name of the country
522 : * \li f_tld -- is set to the full TLD of your domain name; this is
523 : * a pointer WITHIN your uri string so make sure you keep your URI
524 : * string valid if you intend to use this f_tld string
525 : * \li f_offset -- the offset to the first period within the domain
526 : * name TLD (i.e. in our previous example, it would be the offset to
527 : * the first period in ".co.uk", so in "example.co.uk" the offset would
528 : * be 7. Assuming you prepend "www." to have the URI "www.example.co.uk"
529 : * then the offset would be 11.)
530 : *
531 : * \note
532 : * In our previous example, the ".uk" TLD is properly used: it includes
533 : * a second level domain name (".co".) The URI "example.uk" should have
534 : * returned TLD_RESULT_INVALID since .uk by itself was not supposed to be
535 : * acceptable. This changed a few years ago. The good thing is that it
536 : * resolves some problems as some companies were given a simple ".uk"
537 : * TLD and these were exceptions the library does not need to support
538 : * anymore. There are still some countries, such as ".bd", which do not
539 : * accept second level names, so "example.bd" does return
540 : * an \em error (TLD_RESULT_INVALID).
541 : *
542 : * Assuming that you always get valid URIs, you should get one of those
543 : * results:
544 : *
545 : * \li TLD_RESULT_SUCCESS -- success! the URI is valid and the TLD was
546 : * properly determined; use the f_tld or f_offset to extract the TLD
547 : * domain and sub-domains
548 : * \li TLD_RESULT_INVALID -- known TLD, but not currently valid; this
549 : * result is returned when we know that the TLD is not to be accepted
550 : *
551 : * Other results are returned when the input string is considered invalid.
552 : *
553 : * \note
554 : * The function only accepts a bare URI, in other words: no protocol, no
555 : * path, no anchor, no query string, and still URI encoded. Also, it
556 : * should not start and/or end with a period or you are likely to get
557 : * an invalid response. (i.e. don't use any of ".example.co.uk.",
558 : * "example.co.uk.", nor ".example.co.uk")
559 : *
560 : * \include example.c
561 : *
562 : * \param[in] uri The URI to be checked.
563 : * \param[out] info A pointer to a tld_info structure to save the result.
564 : *
565 : * \return One of the TLD_RESULT_... enumeration values.
566 : */
567 62434 : enum tld_result tld(const char *uri, struct tld_info *info)
568 : {
569 62434 : const char *end = uri;
570 : const char **level_ptr;
571 62434 : int level = 0, start_level, i, r, p, offset;
572 : enum tld_result result;
573 :
574 : /* set defaults in the info structure */
575 62434 : tld_clear_info(info);
576 :
577 62434 : if(uri == (const char *) 0 || uri[0] == '\0')
578 : {
579 3 : return TLD_RESULT_NULL;
580 : }
581 :
582 62431 : level_ptr = malloc(sizeof(const char *) * tld_max_level);
583 :
584 3169926 : while(*end != '\0')
585 : {
586 3045066 : if(*end == '.')
587 : {
588 362668 : if(level >= tld_max_level)
589 : {
590 : /* At this point the maximum number of levels in the
591 : * TLDs is 5
592 : */
593 742700 : for(i = 1; i < tld_max_level; ++i)
594 : {
595 594160 : level_ptr[i - 1] = level_ptr[i];
596 : }
597 148540 : level_ptr[tld_max_level - 1] = end;
598 : }
599 : else
600 : {
601 214128 : level_ptr[level] = end;
602 214128 : ++level;
603 : }
604 362668 : if(level >= 2 && level_ptr[level - 2] + 1 == level_ptr[level - 1])
605 : {
606 : /* two periods one after another */
607 2 : free(level_ptr);
608 2 : return TLD_RESULT_BAD_URI;
609 : }
610 : }
611 3045064 : ++end;
612 : }
613 : /* if level is not at least 1 then there are no periods */
614 62429 : if(level == 0)
615 : {
616 : /* no TLD */
617 10 : free(level_ptr);
618 10 : return TLD_RESULT_NO_TLD;
619 : }
620 :
621 62419 : start_level = level;
622 62419 : --level;
623 124838 : r = search(tld_start_offset, tld_end_offset,
624 124838 : level_ptr[level] + 1, (int) (end - level_ptr[level] - 1));
625 62419 : if(r == -1)
626 : {
627 : /* unknown */
628 17 : free(level_ptr);
629 17 : return TLD_RESULT_NOT_FOUND;
630 : }
631 :
632 : /* check for the next level if there is one */
633 62402 : p = r;
634 196480 : while(level > 0 && tld_descriptions[r].f_start_offset != USHRT_MAX)
635 : {
636 225744 : r = search(tld_descriptions[r].f_start_offset,
637 75248 : tld_descriptions[r].f_end_offset,
638 75248 : level_ptr[level - 1] + 1,
639 75248 : (int) (level_ptr[level] - level_ptr[level - 1] - 1));
640 75248 : if(r == -1)
641 : {
642 : /* we are done, return the previous level */
643 3572 : break;
644 : }
645 71676 : p = r;
646 71676 : --level;
647 : }
648 62402 : offset = (int) (level_ptr[level] - uri);
649 :
650 : /* if there are exceptions we may need to search those now if level is 0 */
651 62402 : if(level == 0)
652 : {
653 23084 : r = search(tld_descriptions[p].f_start_offset,
654 11542 : tld_descriptions[p].f_end_offset,
655 : uri,
656 11542 : (int) (level_ptr[0] - uri));
657 11542 : if(r != -1)
658 : {
659 346 : p = r;
660 346 : offset = 0;
661 : }
662 : }
663 :
664 62402 : info->f_status = tld_descriptions[p].f_status;
665 62402 : switch(info->f_status)
666 : {
667 59911 : case TLD_STATUS_VALID:
668 59911 : result = TLD_RESULT_SUCCESS;
669 59911 : break;
670 :
671 109 : case TLD_STATUS_EXCEPTION:
672 : /* return the actual TLD and not the exception
673 : * i.e. "nacion.ar" is valid and the TLD is just ".ar"
674 : * even though top level ".ar" is forbidden by default
675 : */
676 109 : p = tld_descriptions[p].f_exception_apply_to;
677 109 : level = start_level - tld_descriptions[p].f_exception_level;
678 109 : offset = (int) (level_ptr[level] - uri);
679 109 : info->f_status = TLD_STATUS_VALID;
680 109 : result = TLD_RESULT_SUCCESS;
681 109 : break;
682 :
683 2382 : default:
684 2382 : result = TLD_RESULT_INVALID;
685 2382 : break;
686 :
687 : }
688 :
689 62402 : info->f_category = tld_descriptions[p].f_category;
690 62402 : info->f_country = tld_descriptions[p].f_country;
691 62402 : info->f_tld = level_ptr[level];
692 62402 : info->f_offset = offset;
693 :
694 62402 : free(level_ptr);
695 :
696 62402 : return result;
697 : }
698 :
699 :
700 : /** \brief Internal function used to transform %XX values.
701 : *
702 : * This function transforms an hexadecimal (h) character to (2) a
703 : * decimal number (d).
704 : *
705 : * \param[in] c The hexadecimal character to transform
706 : *
707 : * \return The number the hexadecimal character represents (0 to 15)
708 : */
709 4 : static int h2d(int c)
710 : {
711 4 : if(c >= 'a')
712 : {
713 1 : return c - 'a' + 10;
714 : }
715 3 : if(c >= 'A')
716 : {
717 1 : return c - 'A' + 10;
718 : }
719 2 : return c - '0';
720 : }
721 :
722 :
723 : /** \brief Check that a URI is valid.
724 : *
725 : * This function very quickly parses a URI to determine whether it
726 : * is valid.
727 : *
728 : * Note that it does not (currently) support local naming conventions
729 : * which means that a host such as "localhost" will fail the test.
730 : *
731 : * The \p protocols variable can be set to a list of protocol names
732 : * that are considered valid. For example, for HTTP protocol one
733 : * could use "http,https". To accept any protocol use an asterisk
734 : * as in: "*". The protocol must be only characters, digits, or
735 : * underscores ([0-9A-Za-z_]+) and it must be at least one character.
736 : *
737 : * The flags can be set to the following values, or them to set multiple
738 : * flags at the same time:
739 : *
740 : * \li VALID_URI_ASCII_ONLY -- refuse characters that are not in the
741 : * first 127 range (we expect the URI to be UTF-8 encoded and any
742 : * byte with bit 7 set is considered invalid if this flag is set,
743 : * including encoded bytes such as %A0)
744 : * \li VALID_URI_NO_SPACES -- refuse spaces whether they are encoded
745 : * with + or %20 or verbatim.
746 : *
747 : * The return value is generally TLD_RESULT_BAD_URI when an invalid
748 : * character is found in the URI string. The TLD_RESULT_NULL is
749 : * returned if the URI is a NULL pointer or an empty string.
750 : * Other results may be returned by the tld() function. If a result
751 : * other than TLD_RESULT_SUCCESS is returned then the info structure
752 : * may or may not be updated.
753 : *
754 : * \param[in] uri The URI which validity is being checked.
755 : * \param[out] info The resulting information about the URI domain and TLD.
756 : * \param[in] protocols List of comma separated protocols accepted.
757 : * \param[in] flags A set of flags to tell the function what is valid/invalid.
758 : *
759 : * \return The result of the operation, TLD_RESULT_SUCCESS if the URI is
760 : * valid.
761 : *
762 : * \sa tld()
763 : */
764 268 : enum tld_result tld_check_uri(const char *uri, struct tld_info *info, const char *protocols, int flags)
765 : {
766 : const char *p, *q, *username, *password, *host, *port, *n, *a, *query_string;
767 : char domain[256];
768 : int protocol_length, length, valid, c, i, j, anchor;
769 : enum tld_result result;
770 :
771 : /* set defaults in the info structure */
772 268 : tld_clear_info(info);
773 :
774 268 : if(uri == NULL || uri[0] == '\0')
775 : {
776 2 : return TLD_RESULT_NULL;
777 : }
778 :
779 : /* check the protocol: [0-9A-Za-z_]+ */
780 1337 : for(p = uri; *uri != '\0' && *uri != ':'; ++uri)
781 : {
782 1072 : if((*uri < 'a' || *uri > 'z')
783 5 : && (*uri < 'A' || *uri > 'Z')
784 1 : && (*uri < '0' || *uri > '9')
785 1 : && *uri != '_')
786 : {
787 1 : return TLD_RESULT_BAD_URI;
788 : }
789 : }
790 265 : valid = 0;
791 265 : protocol_length = (int) (uri - p);
792 265 : c = tolower(*p);
793 4304 : for(q = protocols; *q != '\0';)
794 : {
795 4037 : if(q[0] == '*' && (q[1] == '\0' || q[1] == ','))
796 : {
797 1 : valid = 1;
798 1 : break;
799 : }
800 4036 : if(tolower(*q) == c)
801 : {
802 273 : if(strncasecmp(p, q, protocol_length) == 0
803 262 : && (q[protocol_length] == '\0' || q[protocol_length] == ','))
804 : {
805 262 : valid = 1;
806 262 : break;
807 : }
808 : }
809 : /* move to the next protocol */
810 3774 : for(; *q != '\0' && *q != ','; ++q);
811 3774 : for(; *q == ','; ++q);
812 : }
813 265 : if(valid == 0)
814 : {
815 2 : return TLD_RESULT_BAD_URI;
816 : }
817 263 : if(uri[1] != '/' || uri[2] != '/')
818 : {
819 3 : return TLD_RESULT_BAD_URI;
820 : }
821 260 : uri += 3; /* skip the '://' */
822 :
823 : /* extract the complete domain name with sub-domains, etc. */
824 260 : username = NULL;
825 260 : host = uri;
826 4671 : for(; *uri != '/' && *uri != '\0'; ++uri)
827 : {
828 4419 : if((unsigned char) *uri < ' ')
829 : {
830 : /* forbid control characters in domain name */
831 1 : return TLD_RESULT_BAD_URI;
832 : }
833 4418 : if(*uri == '@')
834 : {
835 7 : if(username != NULL)
836 : {
837 : /* two '@' signs is not possible */
838 1 : return TLD_RESULT_BAD_URI;
839 : }
840 6 : username = host;
841 6 : host = uri + 1;
842 : }
843 4411 : else if(*uri & 0x80)
844 : {
845 1 : if(flags & VALID_URI_ASCII_ONLY)
846 : {
847 : /* only ASCII allowed by caller */
848 1 : return TLD_RESULT_BAD_URI;
849 : }
850 : }
851 4410 : else if(*uri == ' ' || *uri == '+')
852 : {
853 : /* spaces not allowed in domain name */
854 2 : return TLD_RESULT_BAD_URI;
855 : }
856 4408 : else if(*uri == '%')
857 : {
858 : /* the next two digits must be hex
859 : * note that the first digit must be at least 2 because
860 : * we do not allow control characters
861 : */
862 5 : if(((uri[1] < '2' || uri[1] > '9')
863 2 : && (uri[1] < 'a' || uri[1] > 'f')
864 2 : && (uri[1] < 'A' || uri[1] > 'F'))
865 4 : || ((uri[2] < '0' || uri[2] > '9')
866 2 : && (uri[2] < 'a' || uri[2] > 'f')
867 1 : && (uri[2] < 'A' || uri[2] > 'F')))
868 : {
869 1 : return TLD_RESULT_BAD_URI;
870 : }
871 4 : if(uri[1] == '2' && uri[2] == '0')
872 : {
873 : /* spaces not allowed in domain name */
874 1 : return TLD_RESULT_BAD_URI;
875 : }
876 3 : if(uri[1] >= '8' && (flags & VALID_URI_ASCII_ONLY))
877 : {
878 : /* only ASCII allowed by caller */
879 1 : return TLD_RESULT_BAD_URI;
880 : }
881 : /* skip the two digits right away */
882 2 : uri += 2;
883 : }
884 : }
885 252 : if(username != NULL)
886 : {
887 5 : password = username;
888 5 : for(; *password != '@' && *password != ':'; ++password);
889 5 : if(*password == ':')
890 : {
891 4 : if((host - 1) - (password + 1) <= 0)
892 : {
893 : /* empty password are not acceptable */
894 2 : return TLD_RESULT_BAD_URI;
895 : }
896 : }
897 3 : if(password - username - 1 <= 0)
898 : {
899 : /* username cannot be empty */
900 2 : return TLD_RESULT_BAD_URI;
901 : }
902 : }
903 248 : for(port = host; *port != ':' && port < uri; ++port);
904 248 : if(*port == ':')
905 : {
906 : /* we have a port, it must be digits [0-9]+ */
907 6 : for(n = port + 1; *n >= '0' && *n <= '9'; ++n);
908 6 : if(n != uri || n == port + 1)
909 : {
910 : /* port is empty or includes invalid characters */
911 3 : return TLD_RESULT_BAD_URI;
912 : }
913 : }
914 :
915 : /* check the address really quick */
916 245 : query_string = NULL;
917 245 : anchor = 0;
918 774 : for(a = uri; *a != '\0'; ++a)
919 : {
920 544 : if((unsigned char) *a < ' ')
921 : {
922 : /* no control characters allowed */
923 2 : return TLD_RESULT_BAD_URI;
924 : }
925 542 : else if(*a == '+' || *a == ' ') /* old space encoding */
926 : {
927 2 : if(flags & VALID_URI_NO_SPACES)
928 : {
929 : /* spaces not allowed by caller */
930 2 : return TLD_RESULT_BAD_URI;
931 : }
932 : }
933 540 : else if(*a == '?')
934 : {
935 7 : query_string = a + 1;
936 : }
937 533 : else if(*a == '&' && anchor == 0)
938 : {
939 4 : if(query_string == NULL)
940 : {
941 : /* & must be encoded if used before ? */
942 1 : return TLD_RESULT_BAD_URI;
943 : }
944 3 : query_string = a + 1;
945 : }
946 529 : else if(*a == '=')
947 : {
948 10 : if(query_string != NULL && a - query_string == 0)
949 : {
950 : /* a query string variable name cannot be empty */
951 3 : return TLD_RESULT_BAD_URI;
952 : }
953 : }
954 519 : else if(*a == '#')
955 : {
956 1 : query_string = NULL;
957 1 : anchor = 1;
958 : }
959 518 : else if(*a == '%')
960 : {
961 : /* the next two digits must be hex
962 : * note that the first digit must be at least 2 because
963 : * we do not allow control characters
964 : */
965 7 : if(((a[1] < '2' || a[1] > '9')
966 3 : && (a[1] < 'a' || a[1] > 'f')
967 3 : && (a[1] < 'A' || a[1] > 'F'))
968 4 : || ((a[2] < '0' || a[2] > '9')
969 3 : && (a[2] < 'a' || a[2] > 'f')
970 1 : && (a[2] < 'A' || a[2] > 'F')))
971 : {
972 4 : return TLD_RESULT_BAD_URI;
973 : }
974 3 : if(a[1] == '2' && a[2] == '0' && (flags & VALID_URI_NO_SPACES))
975 : {
976 : /* spaces not allowed by caller */
977 1 : return TLD_RESULT_BAD_URI;
978 : }
979 2 : if(a[1] >= '8' && (flags & VALID_URI_ASCII_ONLY))
980 : {
981 : /* only ASCII allowed by caller */
982 1 : return TLD_RESULT_BAD_URI;
983 : }
984 : /* skip the two digits right away */
985 1 : a += 2;
986 : }
987 511 : else if(*a & 0x80)
988 : {
989 3 : if(flags & VALID_URI_ASCII_ONLY)
990 : {
991 : /* only ASCII allowed by caller */
992 1 : return TLD_RESULT_BAD_URI;
993 : }
994 : }
995 : }
996 :
997 : /* check the domain */
998 :
999 : /** \todo
1000 : * The following is WRONG:
1001 : * \li the domain \%XX are not being checked properly, as it stands the
1002 : * characters following % can be anything!
1003 : * \li the tld() function must be called with the characters still
1004 : * encoded; if you look at the data, you will see that I kept
1005 : * the data encoded (i.e. with the \%XX characters)
1006 : * \li what could be checked (which I guess could be for the entire
1007 : * domain name) is whether the entire string represents valid
1008 : * UTF-8; I don't think I'm currently doing so here. (I have
1009 : * such functions in the tld_domain_to_lowercase() now)
1010 : */
1011 :
1012 230 : length = (int) (port - host);
1013 230 : if(length >= (int) (sizeof(domain) / sizeof(domain[0])))
1014 : {
1015 : /* sub-domains + domain + TLD is more than 255 characters?!
1016 : * note that the host main include many %XX characters but
1017 : * we ignore the fact here at this time; we could move this
1018 : * test in the for() loop below though.
1019 : */
1020 1 : return TLD_RESULT_BAD_URI;
1021 : }
1022 229 : if(length == 0)
1023 : {
1024 : /* although we could return TLD_RESULT_NULL it would not be
1025 : * valid here because "http:///blah.com" is invalid, not NULL
1026 : */
1027 1 : return TLD_RESULT_BAD_URI;
1028 : }
1029 3787 : for(i = 0, j = 0; i < length; ++i, ++j)
1030 : {
1031 3559 : if(host[i] == '%') {
1032 2 : domain[j] = (char) (h2d(host[i + 1]) * 16 + h2d(host[i + 2]));
1033 2 : i += 2; /* skip the 2 digits */
1034 : }
1035 : else
1036 : {
1037 3557 : domain[j] = host[i];
1038 : }
1039 : /* TODO: check that characters are acceptable in a domain name */
1040 : }
1041 228 : domain[j] = '\0';
1042 228 : result = tld(domain, info);
1043 228 : if(info->f_tld != NULL)
1044 : {
1045 : /* define the TLD inside the source string which "unfortunately"
1046 : * is not null terminated by '\0'; also fix the offset since in
1047 : * the complete URI the TLD is a bit further away
1048 : */
1049 227 : info->f_tld = host + info->f_offset;
1050 227 : info->f_offset = (int) (info->f_tld - p);
1051 : }
1052 228 : return result;
1053 : }
1054 :
1055 :
1056 : /** \brief Return the version of the library.
1057 : *
1058 : * This functino returns the version of this library. The version
1059 : * is defined with three numbers: \<major>.\<minor>.\<patch>.
1060 : *
1061 : * You should be able to use the libversion to compare different
1062 : * libtld versions and know which one is the newest version.
1063 : *
1064 : * \return A constant string with the version of the library.
1065 : */
1066 9 : const char *tld_version()
1067 : {
1068 9 : return LIBTLD_VERSION;
1069 : }
1070 :
1071 :
1072 : /** \def LIBTLD_EXPORT
1073 : * \brief The export API used by MS-Windows DLLs.
1074 : *
1075 : * This definition is used to mark functions and classes as exported
1076 : * from the library. This allows other programs to automatically use
1077 : * functions defined in the library.
1078 : *
1079 : * The LIBTLD_EXPORT may be set to dllexport or dllimport depending
1080 : * on whether you compile the library or you intend to link against it.
1081 : */
1082 :
1083 : /** \def LIBTLD_VERSION
1084 : * \brief The version of the library as a string.
1085 : *
1086 : * This definition represents the version of the libtld header you
1087 : * are compiling against. You can compare it to the returned value
1088 : * of the tld_version() function to make sure that everything is
1089 : * compatible (i.e. if the version is not the same, then the
1090 : * tld_info structure may have changed.)
1091 : */
1092 :
1093 : /** \def LIBTLD_VERSION_MAJOR
1094 : * \brief The major version as a number.
1095 : *
1096 : * This definition represents the major version of the libtld header
1097 : * you are compiling against.
1098 : */
1099 :
1100 : /** \def LIBTLD_VERSION_MINOR
1101 : * \brief The minor version as a number.
1102 : *
1103 : * This definition represents the minor version of the libtld header
1104 : * you are compiling against.
1105 : */
1106 :
1107 : /** \def LIBTLD_VERSION_PATCH
1108 : * \brief The patch version as a number.
1109 : *
1110 : * This definition represents the patch version of the libtld header
1111 : * you are compiling against. Some people call this number the release
1112 : * number.
1113 : */
1114 :
1115 : /** \def VALID_URI_ASCII_ONLY
1116 : * \brief Whether to check that the URI only includes ASCII.
1117 : *
1118 : * By default the tld_check_uri() function accepts any extended character
1119 : * (i.e. characters over 0x80). This flag can be used to refuse such
1120 : * characters.
1121 : */
1122 :
1123 : /** \def VALID_URI_NO_SPACES
1124 : * \brief Whether to check that the URI do not include any spaces.
1125 : *
1126 : * By default the tld_check_uri() function accepts spaces as valid
1127 : * characters in a URI (whether they are explicit " ", or written as
1128 : * "+" or "%20".) This flag can be used to refuse all spaces (i.e.
1129 : * this means the "+" and "%20" are also refused.)
1130 : */
1131 :
1132 : /** \enum tld_category
1133 : * \brief The list of categories for the different TLDs.
1134 : *
1135 : * Defines the category of the TLD. The most well known categories
1136 : * are International TLDs (such as .com and .info) and the countries
1137 : * TLDs (such as .us, .uk, .fr, etc.)
1138 : *
1139 : * IANA offers and is working on other extensions such as .pro for
1140 : * profesionals, and .arpa for their internal infrastructure.
1141 : */
1142 :
1143 : /** \var TLD_CATEGORY_INTERNATIONAL
1144 : * \brief International TLDs
1145 : *
1146 : * This category represents TLDs that can be used by anyone anywhere
1147 : * in the world. In some cases, these have some limits (i.e. only a
1148 : * museum can register a .museum TLD.) However, the most well known
1149 : * international extension is .com and this one has absolutely no
1150 : * restrictions.
1151 : */
1152 :
1153 : /** \var TLD_CATEGORY_PROFESSIONALS
1154 : * \brief Professional TLDs
1155 : *
1156 : * This category is offered to professionals. Some countries already
1157 : * offer second-level domain name registrations for professionals and
1158 : * either way they are not used very much. These are reserved for people
1159 : * such as accountants, attorneys, and doctors.
1160 : *
1161 : * Only people who have a lisence with a government can register a .pro
1162 : * domain name.
1163 : */
1164 :
1165 : /** \var TLD_CATEGORY_LANGUAGE
1166 : * \brief Language specific TLDs
1167 : *
1168 : * At time of writing, there is one language extension: .cat for the
1169 : * Catalan language. The idea of the language extensions is to offer
1170 : * a language, rather than a country, a way to have a website that
1171 : * all the people on the Earth can read in their language.
1172 : */
1173 :
1174 : /** \var TLD_CATEGORY_GROUPS
1175 : * \brief Groups specific TLDs
1176 : *
1177 : * The concept of groups is similar to the language grouping, but in
1178 : * this case it may reference to a specific group of people (but not
1179 : * based on anything such as etnicity.)
1180 : *
1181 : * Examples of groups are Kids, Gay people, Ecologists, etc. This is
1182 : * only proposed at this point.
1183 : */
1184 :
1185 : /** \var TLD_CATEGORY_REGION
1186 : * \brief Region specific TLDs
1187 : *
1188 : * It has been proposed, like the .eu, to have extensions based on
1189 : * well defined regions such as .asia for all of Asia. We currently
1190 : * also have .aq for Antartique. Some proposed regions are .africa
1191 : * and city names such as .paris and .wien.
1192 : *
1193 : * Old TLDs that were for countries but are not assigned to those
1194 : * because the country \em disappeared (i.e. in general was split in
1195 : * two and both new countries have different names,) and future
1196 : * regions appear in this category.
1197 : *
1198 : * We keep old TLDs because it is not unlikely that such will be
1199 : * used every now and then and they can, in this way, cleanly be
1200 : * refused by your software.
1201 : */
1202 :
1203 : /** \var TLD_CATEGORY_TECHNICAL
1204 : * \brief Technical extensions are considered internal.
1205 : *
1206 : * These are likely valid (i.e. the .arpa is valid) but are used for
1207 : * technical reasons and not for regular URIs. So they are present
1208 : * but must certainly be ignored by your software.
1209 : *
1210 : * To avoid returning TLD_RESULT_SUCCESS when a TLD with such a
1211 : * category is found, we mark these with the
1212 : * TLD_STATUS_INFRASTRUCTURE.
1213 : */
1214 :
1215 : /** \var TLD_CATEGORY_COUNTRY
1216 : * \brief A country extension.
1217 : *
1218 : * Most of the extensions are country extensions. Country extensions
1219 : * are generally further broken down with second-level domain names.
1220 : * Some countries even have third, forth, and fifth level domain
1221 : * names.
1222 : */
1223 :
1224 : /** \var TLD_CATEGORY_ENTREPRENEURIAL
1225 : * \brief A private extension.
1226 : *
1227 : * Some private companies and individuals purchased domains that they
1228 : * then use as a TLD reselling sub-domains from that main domain name.
1229 : *
1230 : * For example, the ".blogspot.com" domain is offered by blogspot as
1231 : * a TLD to their users. This gives the users the capability to
1232 : * define a cookie at the ".blogspot.com" level but not directly
1233 : * under ".com". In other words, two distinct site such as:
1234 : *
1235 : * \li "a.blogspot.com", and
1236 : * \li "b.blogspot.com"
1237 : *
1238 : * cannot share their cookies. Yet, ".com" by itself is also a
1239 : * top-level domain name that anyone can use.
1240 : */
1241 :
1242 : /** \var TLD_CATEGORY_BRAND
1243 : * \brief The TLD is owned and represents a brand.
1244 : *
1245 : * This category is used to mark top level domain names that are
1246 : * specific to one company. Note that certain TLDs are owned by
1247 : * companies now, but they are not automatically marked as a
1248 : * brand (i.e. ".lol").
1249 : */
1250 :
1251 : /** \var TLD_CATEGORY_UNDEFINED
1252 : * \brief The TLD was not found.
1253 : *
1254 : * This category is used to initialize the information structure and
1255 : * is used to show that the TLD was not found.
1256 : */
1257 :
1258 : /** \enum tld_status
1259 : * \brief Defines the current status of the TLD.
1260 : *
1261 : * Each TLD has a status. By default, it is generally considered valid,
1262 : * however, many TLDs are either proposed or deprecated.
1263 : *
1264 : * Proposed TLDs are not yet officially accepted by the official entities
1265 : * taking care of those TLDs. They should be refused, but may become
1266 : * available later.
1267 : *
1268 : * Deprecated TLDs were in use before but got dropped. They may be dropped
1269 : * because a country doesn't follow up on their Internet TLD, or because
1270 : * the extension is found to be \em boycotted.
1271 : */
1272 :
1273 : /** \var TLD_STATUS_VALID
1274 : * \brief The TLD is currently valid.
1275 : *
1276 : * This status represents a TLD that is currently fully valid and supported
1277 : * by the owners.
1278 : *
1279 : * These can be part of URIs representing valid resources.
1280 : */
1281 :
1282 : /** \var TLD_STATUS_PROPOSED
1283 : * \brief The TLD was proposed but not yet accepted.
1284 : *
1285 : * The TLD is nearly considered valid, at least it is in the process to get
1286 : * accepted. The TLD will not work until officially accepted.
1287 : *
1288 : * No valid URIs can include this TLD until it becomes TLD_STATUS_VALID.
1289 : */
1290 :
1291 : /** \var TLD_STATUS_DEPRECATED
1292 : * \brief The TLD was once in use.
1293 : *
1294 : * This status is used by TLDs that were valid (TLD_STATUS_VALID) at some point
1295 : * in time and was changed to another TLD rendering that one useless (or
1296 : * \em incorrect in the case of a country name change.)
1297 : *
1298 : * This status means such URIs are not to be considered valid. However, it may
1299 : * be possible to emit a 301 (in terms of HTTP protocol) to fix the problem.
1300 : */
1301 :
1302 : /** \var TLD_STATUS_UNUSED
1303 : * \brief The TLD was officially assigned but not put to use.
1304 : *
1305 : * This special status is used for all the TLDs that were assigned to a specific
1306 : * entity, but never actually put to use. Many smaller countries (especially
1307 : * islands) are assigned this status.
1308 : *
1309 : * Unused TLDs are not valid in any URI until marked valid.
1310 : */
1311 :
1312 : /** \var TLD_STATUS_RESERVED
1313 : * \brief The TLD is reserved so no one can use it.
1314 : *
1315 : * This special case forces the specified TLDs into a "do not use" list. Seeing
1316 : * such TLDs may happen by people who whish it were official, but it is not
1317 : * considered \em legal.
1318 : *
1319 : * A reserved TLD may represent a second TLD that was assigned to a specific
1320 : * country or other category. It may be possible to do a transfer from that
1321 : * TLD to the official TLD (i.e. Great Britain was assigned .gb, but instead
1322 : * uses .uk; URIs with .gb could be transformed with .uk and checked for
1323 : * validity.)
1324 : */
1325 :
1326 : /** \var TLD_STATUS_INFRASTRUCTURE
1327 : * \brief These TLDs are reserved for the Internet infrastructure.
1328 : *
1329 : * These TLDs cannot be used with standard URIs. These are used to make the
1330 : * Internet functional instead.
1331 : *
1332 : * All URIs for standard resources must refuse these URIs.
1333 : */
1334 :
1335 : /** \var TLD_STATUS_UNDEFINED
1336 : * \brief Special status to indicate we did not find the TLD.
1337 : *
1338 : * The info structure is returned with an \em undefined status whenever the
1339 : * TLD could not be found in the list of existing TLDs. This means the URI
1340 : * is completely invalid. (The only exception would be if you support some
1341 : * internal TLDs.)
1342 : *
1343 : * URI what cannot get a TLD_STATUS_VALID should all be considered invalid.
1344 : * But those marked as TLD_STATUS_UNDEFINED are completely invalid. This
1345 : * being said, you may want to make sure you passed the correct string.
1346 : * The URI must be just and only the set of sub-domains, the domain, and
1347 : * the TLDs. No protocol, slashes, colons, paths, query strings, anchors
1348 : * are accepted in the URI.
1349 : */
1350 :
1351 : /** \var TLD_STATUS_EXCEPTION
1352 : * \brief Special status to indicate an exception which is not directly a TLD.
1353 : *
1354 : * When a NIC decides to change their setup it can generate exceptions. For
1355 : * example, the UK first made use of .uk and as such offered a few customers
1356 : * to use .uk. Later they decided to only offer second level domain names
1357 : * such as the .co.uk and .ac.uk. This generates a few exceptions on the .uk
1358 : * domain name. For example, the police.uk domain is still in use and thus
1359 : * it is an exception. We reference it as ".police.uk" in our XML data file
1360 : * yet the TLD in that case is just ".uk".
1361 : */
1362 :
1363 :
1364 : /** \enum tld_result
1365 : * \brief The result returned by tld().
1366 : *
1367 : * This enumeration defines all the possible results of the tld() function.
1368 : *
1369 : * Only the TLD_RESULT_SUCCESS is considered to represent a valid result.
1370 : *
1371 : * The TLD_RESULT_INVALID represents a TLD that was found but is not currently
1372 : * marked as valid (it may be deprecated or proposed, for example.)
1373 : */
1374 :
1375 : /** \var TLD_RESULT_SUCCESS
1376 : * \brief Success! The TLD of the specified URI is valid.
1377 : *
1378 : * This result is returned when the URI includes a valid TLD. The function
1379 : * further includes valid results in the tld_info structure.
1380 : *
1381 : * You can accept this URI as valid.
1382 : */
1383 :
1384 : /** \var TLD_RESULT_INVALID
1385 : * \brief The TLD was found, but it is marked as invalid.
1386 : *
1387 : * This result represents a TLD that is not valid as is for a URI, but it
1388 : * was defined in the TLD data. The function includes further information
1389 : * in the tld_info structure. There you can check the category, status,
1390 : * and other parameters to determine what the TLD really represents.
1391 : *
1392 : * It may be possible to use such a TLD, although as far as web addresses
1393 : * are concerned, these are not considered valid. As mentioned in the
1394 : * statuses, some may mean that the TLD can be changed for another and
1395 : * work (i.e. a country name that changed.)
1396 : */
1397 :
1398 : /** \var TLD_RESULT_NULL
1399 : * \brief The input URI is empty.
1400 : *
1401 : * The tld() function returns this value whenever the input URI pointer is
1402 : * NULL or the empty string (""). Obviously, no TLD is found in this case.
1403 : */
1404 :
1405 : /** \var TLD_RESULT_NO_TLD
1406 : * \brief The input URI has no TLD defined.
1407 : *
1408 : * Whenever the URI does not include at least one period (.), this error
1409 : * is returned. Local URIs are considered valid and don't generally include
1410 : * a period (i.e. "localhost", "my-computer", "johns-computer", etc.) We
1411 : * expect that the tld() function would not be called with such URIs.
1412 : *
1413 : * A valid Internet URI must include a TLD.
1414 : */
1415 :
1416 : /** \var TLD_RESULT_BAD_URI
1417 : * \brief The URI includes characters that are not accepted by the function.
1418 : *
1419 : * This value is returned if a character is found to be incompatible or a
1420 : * sequence of characters is found incompatible.
1421 : *
1422 : * At this time, tld() returns this error if two periods (.) are found one
1423 : * after another. The errors will be increased with time to detect invalid
1424 : * characters (anything outside of [-a-zA-Z0-9.%].)
1425 : *
1426 : * Note that the URI should not start or end with a period. This error will
1427 : * also be returned (at some point) when the function detects such problems.
1428 : */
1429 :
1430 : /** \var TLD_RESULT_NOT_FOUND
1431 : * \brief The URI has a TLD that could not be determined.
1432 : *
1433 : * The TLD of the URI was searched in the TLD data and could not be found
1434 : * there. This means the TLD is not a valid Internet TLD.
1435 : */
1436 :
1437 :
1438 : /** \struct tld_info
1439 : * \brief Set of information returned by the tld() function.
1440 : *
1441 : * This structure is used by the tld() function to define the results to
1442 : * return to the caller.
1443 : *
1444 : * Remember that this is a C structure. By default, the fields are undefined.
1445 : * The tld() function will first defined these fields, before returning any
1446 : * result.
1447 : *
1448 : * It is acceptable to clear the structure before calling the tld() function
1449 : * but it is not required.
1450 : */
1451 :
1452 : /** \var enum tld_category tld_info::f_category;
1453 : * \brief The category of the TLD.
1454 : *
1455 : * This represents the category of the TLD. One of the tld_category enumeration
1456 : * values can be found in this field.
1457 : *
1458 : * \sa enum tld_category
1459 : */
1460 :
1461 : /** \var enum tld_status tld_info::f_status;
1462 : * \brief The status of the TLD.
1463 : *
1464 : * This value defines the current status of the TLD. Most of the TLDs we define
1465 : * are valid, but some are either deprecated, unused, or proposed.
1466 : *
1467 : * Only a TLD marked as TLD_STATUS_VALID should be considered valid, although
1468 : * otherwise may be accepted in some circumstances.
1469 : *
1470 : * \sa enum tld_status
1471 : */
1472 :
1473 : /** \var const char *tld_info::f_country;
1474 : * \brief The country where this TLD is used.
1475 : *
1476 : * When the f_category is set to TLD_CATEGORY_COUNTRY then this field is a
1477 : * pointer to the name of the country in English (although some may include
1478 : * accents, the strings are in UTF-8.)
1479 : *
1480 : * This field is set to NULL if the category is not Country or the TLD was
1481 : * not found.
1482 : *
1483 : * \sa tld_info::f_category
1484 : * \sa enum tld_category
1485 : */
1486 :
1487 : /** \var const char *tld_info::f_tld;
1488 : * \brief Pointer to the TLD in the URI string you supplied.
1489 : *
1490 : * This is a pointer to the TLD section that the tld() function found in
1491 : * your URI. Note that it is valid only as long as your URI string pointer.
1492 : *
1493 : * It is also possible to make use of the tld_info::f_offset value to
1494 : * extract the TLD, domain, or sub-domains.
1495 : *
1496 : * If the TLD is not found, this field is NULL.
1497 : */
1498 :
1499 : /** \var int tld_info::f_offset;
1500 : * \brief The offset to the TLD in the URI string you supplied.
1501 : *
1502 : * This offset, when added to the URI string pointer, gets you to the
1503 : * TLD of that URI. The offset can also be used to start searching
1504 : * for the beginning of the domain name by searching for the previous
1505 : * period from that offset minus one. In effect, this gives you a
1506 : * way to determine the list of sub-domain.
1507 : */
1508 :
1509 : /** \struct tld_description
1510 : * \brief [internal] The description of one TLD.
1511 : * \internal
1512 : *
1513 : * The XML data is transformed in an array of TLD description saved in this
1514 : * structure.
1515 : *
1516 : * This structure is internal to the database. You never are given direct
1517 : * access to it. However, some of the constant pointers (i.e. country names)
1518 : * do point to that data.
1519 : */
1520 :
1521 : /** \var tld_description::f_category
1522 : * \brief The category of this entry.
1523 : *
1524 : * The XML data must defined the different TLDs inside catageorized area
1525 : * tags. This variable represents that category.
1526 : */
1527 :
1528 : /** \var tld_description::f_country
1529 : * \brief The name of the country owning this TLD.
1530 : *
1531 : * The name of the country owning this entry. Many TLDs do not have a
1532 : * country attached to it (i.e. .com and .info, for example, do not have
1533 : * a country attached to them) in which case this pointer is NULL.
1534 : */
1535 :
1536 : /** \var tld_description::f_start_offset
1537 : * \brief The first offset of a list of TLDs.
1538 : *
1539 : * This offset represents the start of a list of TLDs. The start offset is
1540 : * inclusive so that very offset IS included in the list.
1541 : *
1542 : * The TLDs being referenced from this TLD are those between f_start_offset
1543 : * and f_end_offset - 1 also writte:
1544 : *
1545 : * [f_start_offset, f_end_offset)
1546 : */
1547 :
1548 : /** \var tld_description::f_end_offset
1549 : * \brief The last offset of a list of TLDs.
1550 : *
1551 : * This offset represents the end of a list of TLDs. The end offset is
1552 : * exclusive so that very offset is NOT included in the list.
1553 : *
1554 : * The TLDs being referenced from this TLD are those between f_start_offset
1555 : * and f_end_offset - 1 also writte:
1556 : *
1557 : * [f_start_offset, f_end_offset)
1558 : */
1559 :
1560 : /** \var tld_description::f_exception_apply_to
1561 : * \brief This TLD is an exception of the "apply to" TLD.
1562 : *
1563 : * With time, some TLDs were expected to have or not have certain sub-domains
1564 : * and when removal of those was partial (i.e. did not force existing owners
1565 : * to lose their domain) then we have exceptions. This variable holds the
1566 : * necessary information to support such exceptions.
1567 : *
1568 : * The "apply to" is only defined if the entry is an exception (see f_status.)
1569 : * The f_exception_apply_to value is an offset to the very TLD we want to
1570 : * return when we get this exception.
1571 : */
1572 :
1573 : /** \var tld_description::f_exception_level
1574 : * \brief This entry is an exception representing a TLD at this specified level.
1575 : *
1576 : * When we find an exception, it may be more than 1 level below the TLD it uses
1577 : * (a.b.c.d may be viewed as part of TLD .d thus .a has to be bumped 3 levels
1578 : * up.) In most cases, this is equal to this TLD level - 1.
1579 : */
1580 :
1581 : /** \var tld_description::f_status
1582 : * \brief The status of this TLD.
1583 : *
1584 : * The status of a TLD is TLD_STATUS_VALID by default. Using the different
1585 : * tags available in the XML file we can defined other statuses such as the
1586 : * TLD_STATUS_DEPRECATED status.
1587 : *
1588 : * In the TLD table the status can be TLD_STATUS_EXCEPTION.
1589 : */
1590 :
1591 : /** \var tld_description::f_tld
1592 : * \brief The actual TLD of this entry.
1593 : *
1594 : * In this table, the TLD is actually just one name and no period. Other
1595 : * parts of a multi-part TLD are found at the [f_start_offset, f_end_offset).
1596 : *
1597 : * The TLD is built by starting a search at the top level which is defined as
1598 : * [tld_start_offset, tld_end_offset). These offsets are global variables defined
1599 : * in the tld_data.c file.
1600 : */
1601 :
1602 : /* vim: ts=4 sw=4 et
1603 : */
|