Wilcox Development Solutions Blog

Scraping Websites with Racket

August 25, 2024

My Personal History with Early Blogging Tools:

From Jan 15, 2002 until Feb 18, 2009 I kept a personal blog with Radio Userland.

Radio Userland was built on top of Userland Frontier. Frontier was an amazingly capable system: it featured its own scripting language (UserTalk - an outline based language!), an object database, mini word processor and outliner. Originally targeted as an application for scripting and cross-application communication, in the late ’90s / early ’00s, it turned into a web publishing system.

Fun fact: A desire to return to this workbench environment got me into Pharo Smalltalk.

I always felt that Radio Userland was targeted specifically at early ‘bloggers: it included a local editing platform and free hosting for your blog by default.

I used this free hosting - radio.userland.com/SOME_NUMBER - to write a blog, which I’m sure nobody read. I blogged 592 days in that almost exactly 7 year period: mostly linkblog style.

I forgot about that blog, until recently, when I found the content still on the web at http://radio-weblogs.com.

Obviously I lost that content once, I need to save it for posterity now.

Webscraping With Racket

I decided to scrape the site with Racket, mostly as Rash gives me a quick ability to very easily call command line programs.

Walking backwards in time

First, could I build a custom functional iterator that walks back through days, calling the provided function at every day?

This requires a date library and logic, as negative numbers wrap into the previous month. I found one in gregor. A bit of poking later:


(define (iterate-into-past start-date end-date fn)
  (define current-date (-days start-date 1))

  (fn current-date)
  (if (>= (->posix end-date) (->posix current-date))
      (+ 1 1) ; force something to be here
      (iterate-into-past current-date end-date fn)  ; try for tail call optimization
  )
)

Called by (iterate-into-past my-start-date my-end-date (lambda x: (println "hi")))

Parsing Content

Now that I have dates I can call a URL. 404 status from the request means I didn’t make a blog entry that day.

By default Racket’s http-client returns any HTML response not as HTML, but as S-expressions.

Technically the serialization result - the S-expressions that come back from parsing HTML - is called an XML-Expression aka an xexp, mostly because HTML can be parsed to SXML by the Racket library html-parsing.

This automatic parsing behavior doesn’t seem to be documented anywhere: the automatic conversation of HTML into sexp/xexp. You can turn off this behavior with (current-http-client/response-auto #f).

Now, Racket has a library called sxpath: allowing you to give an XPath expression to select expressions in an xexp. We use an XPath to find the HTML element with the blog posts: I don’t want to include the chrome around the blog (just the facts content!).


(define (process-date-url current-date)
    (define path (string-append my-weblog-id (my-date-format-function current-date "/") ".html"))
    ; ^^^ my-date-format-function is a method I declared earlier

    (define res (http-get "http://radio-weblogs.com" #:path path))
    ; (current-http-client/response-auto #f)

    (match (http-response-code res)
        [200 (println
            ((sxpath "//div[contains(@class, 'body')]") (http-response-body res)))]
             ; sxpath returns a procedure you apply your document to
             ; ^^^ the actual blog post is in a <div class="body">

        [404 (println (string-append "no entry for " (radio-date-format current-date "-")))]
    )
)

To translate the sexp back into HTML we use (xexp->html), and write the result to a file.

Putting it together


#lang racket

(require gregor)  ; install gregor package
(require http-client)
(require sxml/sxpath) ; install sxml package
(require html-writing) ; install html-writing package


(define start-date (iso8601->date "2009-02-25"))
(define end-date   (iso8601->date "2009-02-01"))
(define my-weblog-id "000000/")


; left pad a 0 if the string is = 1 character long
(define (at-least-two s)
  (match (string-length s)
    [1 (string-append "0" s)]
    [_ s]
    )
)


(define (my-date-format-function current-date seperator)
  (string-append
     (at-least-two (number->string (->year current-date)))
     seperator
     (at-least-two (number->string (->month current-date)))
     seperator
     (at-least-two (number->string (->day current-date)))
     )
)


(define (iterate-into-past start-date end-date fn)
  (define current-date (-days start-date 1))

  (fn current-date)
  (if (>= (->posix end-date) (->posix current-date))
      (+ 1 1) ; force something to be here
      (iterate-into-past current-date end-date fn)  ; try for tail call optimization
  )
)


(define (process-date-url current-date)
    (define path (string-append my-weblog-id (my-date-format-function current-date "/") ".html"))

    (define res (http-get "http://radio-weblogs.com" #:path path))
    ; (current-http-client/response-auto #f)

    (match (http-response-code res)
        [200
         (define body-results ((sxpath "//div[contains(@class, 'body')]") (http-response-body res)))
         ; sxpath returns a procedure you apply your document to
         ; ^^^ the actual blog post is in a <div class="body">

         (println (xexp->html body-results))
        ]

        [404 (println (string-append "no entry for " (my-date-format-function current-date "-")))]
    )
)


(iterate-into-past
    start-date
    end-date
    process-date-url
)

This was a fun challenge, and the Racket ecosystem made complex things easier (the selecting only the HTML elements I am interested in, for example)!


Tagged with:

Written by Ryan Wilcox Chief Developer, Wilcox Development Solutions... and other things