2025-01-12 04:36:52 +08:00

138 lines
4.0 KiB
Plaintext

---
title: "URL Validation"
author: "Jim Hester"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{URL Validation}
%\VignetteEngine{knitr::rmarkdown}
\usepackage[utf8]{inputenc}
---
Consider the task of correctly [validating a URL](https://mathiasbynens.be/demo/url-regex).
From that page two conclusions can be made.
1. Validating URLs require complex regular expressions.
2. Creating a correct regular expression is hard! (only 1 out of 13 regexs were valid for all cases).
Because of this one may be tempted to simply copy the best regex you can find ([gist](https://gist.github.com/dperini/729294)).
The problem with this is that while you can copy it now, what happens later when you find a case that is not handled correctly? Can you correctly interpret and modify this?
```{r url_parsing_stock, eval=F}
"^(?:(?:http(?:s)?|ftp)://)(?:\\S+(?::(?:\\S)*)?@)?(?:(?:[a-z0-9\u00a1-\uffff](?:-)*)*(?:[a-z0-9\u00a1-\uffff])+)(?:\\.(?:[a-z0-9\u00a1-\uffff](?:-)*)*(?:[a-z0-9\u00a1-\uffff])+)*(?:\\.(?:[a-z0-9\u00a1-\uffff]){2,})(?::(?:\\d){2,5})?(?:/(?:\\S)*)?$"
```
However if you re-create the regex with `rex` it is much easier to understand and modify later if needed.
```{r url_parsing_url}
library(rex)
library(magrittr)
valid_chars <- rex(except_some_of(".", "/", " ", "-"))
re <- rex(
start,
# protocol identifier (optional) + //
group(list("http", maybe("s")) %or% "ftp", "://"),
# user:pass authentication (optional)
maybe(non_spaces,
maybe(":", zero_or_more(non_space)),
"@"),
#host name
group(zero_or_more(valid_chars, zero_or_more("-")), one_or_more(valid_chars)),
#domain name
zero_or_more(".", zero_or_more(valid_chars, zero_or_more("-")), one_or_more(valid_chars)),
#TLD identifier
group(".", valid_chars %>% at_least(2)),
# server port number (optional)
maybe(":", digit %>% between(2, 5)),
# resource path (optional)
maybe("/", non_space %>% zero_or_more()),
end
)
```
We can then validate that it correctly identifies both good and bad URLs. (_IP address validation removed_)
```{r url_parsing_validate}
good <- c("http://foo.com/blah_blah",
"http://foo.com/blah_blah/",
"http://foo.com/blah_blah_(wikipedia)",
"http://foo.com/blah_blah_(wikipedia)_(again)",
"http://www.example.com/wpstyle/?p=364",
"https://www.example.com/foo/?bar=baz&inga=42&quux",
"http://✪df.ws/123",
"http://userid:password@example.com:8080",
"http://userid:password@example.com:8080/",
"http://userid@example.com",
"http://userid@example.com/",
"http://userid@example.com:8080",
"http://userid@example.com:8080/",
"http://userid:password@example.com",
"http://userid:password@example.com/",
"http://➡.ws/䨹",
"http://⌘.ws",
"http://⌘.ws/",
"http://foo.com/blah_(wikipedia)#cite-1",
"http://foo.com/blah_(wikipedia)_blah#cite-1",
"http://foo.com/unicode_(✪)_in_parens",
"http://foo.com/(something)?after=parens",
"http://☺.damowmow.com/",
"http://code.google.com/events/#&product=browser",
"http://j.mp",
"ftp://foo.bar/baz",
"http://foo.bar/?q=Test%20URL-encoded%20stuff",
"http://مثال.إختبار",
"http://例子.测试",
"http://-.~_!$&'()*+,;=:%40:80%2f::::::@example.com",
"http://1337.net",
"http://a.b-c.de",
"http://223.255.255.254")
bad <- c(
"http://",
"http://.",
"http://..",
"http://../",
"http://?",
"http://??",
"http://??/",
"http://#",
"http://##",
"http://##/",
"http://foo.bar?q=Spaces should be encoded",
"//",
"//a",
"///a",
"///",
"http:///a",
"foo.com",
"rdar://1234",
"h://test",
"http:// shouldfail.com",
":// should fail",
"http://foo.bar/foo(bar)baz quux",
"ftps://foo.bar/",
"http://-error-.invalid/",
"http://-a.b.co",
"http://a.b-.co",
"http://0.0.0.0",
"http://3628126748",
"http://.www.foo.bar/",
"http://www.foo.bar./",
"http://.www.foo.bar./")
all(grepl(re, good) == TRUE)
all(grepl(re, bad) == FALSE)
```
You can now see the power and expressiveness of building regular expressions with `rex`!