In which a perfectly reasonable project spins off an obsessive sub-project… https://github.com/autogram-is/url-tools
We’re poking at an interesting @autogram_is project that required some ad-hoc site spidering for a client’s site audit. Among other things, we wanted a decent way to distinguish between all the different permutations of a given URL we encountered and the ‘real’ one for a page.
e.g., http://example.com, https://www.example.com/en/, and https://example.com?utm_campaign=summer are all the same ‘page’ as far as content publishing is concerned. Normalizing them in a way that reflects that is usually site-specific and full of odd exceptions…
This is particularly vexing when the “site” you’re examining is actually a cluster of inter-twined sites, each running on different software — common with most big organizations’ web presence.
Writing quick and dirty ad-hoc filters works for prototyping, but is difficult to keep consistent as we accumulate a dataset; it’s not exactly ‘big data’, but dozens of thousands of pages with millions of URL permutations is common.
So, here we are. A wrapper URL that adds a few parsing and manipulation options; a set of composable filter and transform functions for sorting and normalizing the URLs; and a collection class that can take a pile of strings and spit back a set normalized to your specifications.
In any case, the resulting library is a tiny drop in the node.js ocean, but a gratifying one — like getting the closet shelves tidied or mowing the lawn when it’s a bit ragged.