Scraping Websites with Racket
August 25, 2024
My Personal History with Early Blogging Tools:
From Jan 15, 2002 until Feb 18, 2009 I kept a personal blog with Radio Userland.
Radio Userland was built on top of Userland Frontier. Frontier was an amazingly capable system: it featured its own scripting language (UserTalk - an outline based language!), an object database, mini word processor and outliner. Originally targeted as an application for scripting and cross-application communication, in the late ’90s / early ’00s, it turned into a web publishing system.
Fun fact: A desire to return to this workbench environment got me into Pharo Smalltalk.
I always felt that Radio Userland was targeted specifically at early ‘bloggers: it included a local editing platform and free hosting for your blog by default.
I used this free hosting - radio.userland.com/SOME_NUMBER - to write a blog, which I’m sure nobody read. I blogged 592 days in that almost exactly 7 year period: mostly linkblog style.
I forgot about that blog, until recently, when I found the content still on the web at http://radio-weblogs.com.
Obviously I lost that content once, I need to save it for posterity now.
Webscraping With Racket
I decided to scrape the site with Racket, mostly as Rash gives me a quick ability to very easily call command line programs.
Walking backwards in time
First, could I build a custom functional iterator that walks back through days, calling the provided function at every day?
This requires a date library and logic, as negative numbers wrap into the previous month. I found one in gregor. A bit of poking later:
(define (iterate-into-past start-date end-date fn)
(define current-date (-days start-date 1))
(fn current-date)
(if (>= (->posix end-date) (->posix current-date))
(+ 1 1) ; force something to be here
(iterate-into-past current-date end-date fn) ; try for tail call optimization
)
)
Called by (iterate-into-past my-start-date my-end-date (lambda x: (println "hi")))
Parsing Content
Now that I have dates I can call a URL. 404 status from the request means I didn’t make a blog entry that day.
By default Racket’s http-client returns any HTML response not as HTML, but as S-expressions.
Technically the serialization result - the S-expressions that come back from parsing HTML - is called an XML-Expression aka an xexp, mostly because HTML can be parsed to SXML by the Racket library html-parsing.
This automatic parsing behavior doesn’t seem to be documented anywhere: the automatic conversation of HTML into sexp/xexp. You can turn off this behavior with (current-http-client/response-auto #f)
.
Now, Racket has a library called sxpath: allowing you to give an XPath expression to select expressions in an xexp. We use an XPath to find the HTML element with the blog posts: I don’t want to include the chrome around the blog (just the facts content!).
(define (process-date-url current-date)
(define path (string-append my-weblog-id (my-date-format-function current-date "/") ".html"))
; ^^^ my-date-format-function is a method I declared earlier
(define res (http-get "http://radio-weblogs.com" #:path path))
; (current-http-client/response-auto #f)
(match (http-response-code res)
[200 (println
((sxpath "//div[contains(@class, 'body')]") (http-response-body res)))]
; sxpath returns a procedure you apply your document to
; ^^^ the actual blog post is in a <div class="body">
[404 (println (string-append "no entry for " (radio-date-format current-date "-")))]
)
)
To translate the sexp back into HTML we use (xexp->html)
, and write the result to a file.
Putting it together
#lang racket
(require gregor) ; install gregor package
(require http-client)
(require sxml/sxpath) ; install sxml package
(require html-writing) ; install html-writing package
(define start-date (iso8601->date "2009-02-25"))
(define end-date (iso8601->date "2009-02-01"))
(define my-weblog-id "000000/")
; left pad a 0 if the string is = 1 character long
(define (at-least-two s)
(match (string-length s)
[1 (string-append "0" s)]
[_ s]
)
)
(define (my-date-format-function current-date seperator)
(string-append
(at-least-two (number->string (->year current-date)))
seperator
(at-least-two (number->string (->month current-date)))
seperator
(at-least-two (number->string (->day current-date)))
)
)
(define (iterate-into-past start-date end-date fn)
(define current-date (-days start-date 1))
(fn current-date)
(if (>= (->posix end-date) (->posix current-date))
(+ 1 1) ; force something to be here
(iterate-into-past current-date end-date fn) ; try for tail call optimization
)
)
(define (process-date-url current-date)
(define path (string-append my-weblog-id (my-date-format-function current-date "/") ".html"))
(define res (http-get "http://radio-weblogs.com" #:path path))
; (current-http-client/response-auto #f)
(match (http-response-code res)
[200
(define body-results ((sxpath "//div[contains(@class, 'body')]") (http-response-body res)))
; sxpath returns a procedure you apply your document to
; ^^^ the actual blog post is in a <div class="body">
(println (xexp->html body-results))
]
[404 (println (string-append "no entry for " (my-date-format-function current-date "-")))]
)
)
(iterate-into-past
start-date
end-date
process-date-url
)
This was a fun challenge, and the Racket ecosystem made complex things easier (the selecting only the HTML elements I am interested in, for example)!