Archive for the ‘ForumIO’ Category

in-labeled-groups

Thursday, April 8th, 2010

So I found myself needing to group a sequence by two different ways of grouping it without iterating through it multiple times.

;; (in-labeled-groups `((a . ,f) (b . ,g)) seq) returns a sequence that's the
;; interleaving of (in-map (curry cons 'a) (in-groups f seq)) and (in-map
;; (curry cons 'b) (in-groups g seq)), except that seq is only consumed once.
;; (Mapping and grouping separately would try to consume the sequence twice,
;; which doesn't actually work.)  This can be generalized to any number of
;; labeled predicates, of course.
;;
;; (sequence->list (in-labeled-groups `((eq . ,equal?) (lte . ,<=)) '(1 2 2 2 3 2 2 3 4 5)))
;; ((eq 1)
;;  (eq 2 2 2)
;;  (eq 3)
;;  (lte 1 2 2 2 3)
;;  (eq 2 2)
;;  (eq 3)
;;  (eq 4)
;;  (eq 5)
;;  (lte 2 2 3 4 5))
(define (in-labeled-groups binary-pred-assocs seq)
  (in-once (let-values ([(has-next? next!) (sequence-generate seq)])
             (when (has-next?)
               (let loop ([current-lists-reversed (let ([x (next!)])
                                                    (make-list (length binary-pred-assocs) (list x)))])
                 (if (has-next?)
                     (let ([next (next!)])
                       (loop (for/list ([reversed-group current-lists-reversed]
                                        [labeled-binary-pred binary-pred-assocs])
                               (if ((cdr labeled-binary-pred) (car reversed-group) next)
                                   (cons next reversed-group)
                                   (begin
                                     (yield (cons (car labeled-binary-pred) (reverse reversed-group)))
                                     (list next))))))
                     (for ([reversed-group current-lists-reversed]
                           [labeled-binary-pred binary-pred-assocs])
                       (yield (cons (car labeled-binary-pred) (reverse reversed-group))))))))))

So that was very ugly.

One nice thing about this function is that it’s an example of one where I’d feel pretty uncomfortable writing it in Haskell. What would its type signature be? How convenient would it be to use? This implementation here scales well with respect to the number of ways in which I’m grouping the sequence, since it’s just another label.

One question is: how am I going to like splitting the sequence of groups up later? I’m starting to feel the call of reactive programming. How good is PLT Scheme at having lots of tiny threads? I might need to look into that. I’m having one of those moments where you recognize threads as a tool for making cleaner code, instead of as a tool for doing two things at the same time. But this is a moment of insanity.

in-once

Wednesday, April 7th, 2010

In a previous post I was complaining about PLT Scheme’s sequences.  Specifically, in-generator makes a sequence that destroys itself on the first consumption, and then silently does nothing.

I like sequences that destroy themselves on the first consumption, because it forces you to pipe things through efficiently.  And so I’ve replaced all uses of in-generator with in-once, as defined below:

#lang scheme

;; Provides in-once (and once-producer) which is like in-generator and
;; in-producer except that instead of silently not working, it raises
;; an exception should you decide that you want to iterate through the
;; sequence more than one time.  You may only iterate through the
;; sequence exactly one time.  I recommend using this if you want to
;; use "one-time" sequences.

(require scheme/generator)
(provide in-once once-producer)
(provide (except-out (all-from-out scheme/generator)
                     in-generator))

(define stop-value (gensym))

(define (once-producer producer stop . more)
  (let ([started-already #f]
        [sem (make-semaphore 1)])
    (make-do-sequence
     (lambda ()
       (if (call-with-semaphore sem
                                (lambda ()
                                  (begin0 started-already
                                          (set! started-already #t))))
           (error "Already iterated through this producer")
           (begin (set! started-already #t)
                  (values (if (null? more)
                              (lambda (_) (producer))
                              (lambda (_) (apply producer more)))
                          void
                          (void)
                          void
                          (if (procedure? stop)
                              (if (equal? 1 (procedure-arity stop))
                                  (lambda (x) (not (stop x)))
                                  (lambda xs (not (apply stop xs))))
                              (lambda (x) (not (eq? x stop))))
                          void)))))))

(define-sequence-syntax in-once
  (syntax-rules ()
    [(_ body0 body ...)
     (once-producer (generator body0 body ... stop-value) stop-value)])
  (lambda (stx)
    (syntax-case stx ()
      [((id ...) (_ body0 body ...))
       #'[(id ...)
          (in-producer (generator body0 body ... stop-value) stop-value)]])))

I decided to make it thread-safe. Or at least I think it’s thread-safe. It’s moments like these where I miss Haskell.

As far as I can tell, for comprehensions don’t try to iterate through sequences twice, so I’ve decided to trust that. (With define-sequence-syntax, I don’t really know what I’m doing, I’m just following the pattern shown elsewhere.)

Oh, and I suppose I’m distributing the above code under the LGPL Version 2, since it’s almost directly lifted from PLT Scheme v4.2.4/collects/scheme/generate.ss

Yuck, inverse bbcode

Tuesday, April 6th, 2010

So we now have crappy HTML-to-bbcode parsing going. It’s not elegant, but that’s okay. I’ve sworn off elegance this year. We use a dynamically scoped parameter to keep track of whether we’re inside a <pre> block, and…

If the server sends us unexpected HTML, we say “unexpected.” We try not to do too much intelligent stuff here (such as caring about illegally nested tags like <b>bold <i>bold and italic</b> italic</i> and such) because we’ll want the client to be able to quote a post and make a reply — and we don’t want unnatural bbcode constructs, justas a matter of principle.  Anyway the client will have to keep track of all this stuff anyway, so we might as well keep the server as dumb as possible.

I made a little macro for convenientizing the syntax, but I’m uncomfortable with macros, so it’s not maximally convenient.

We once had different struct types for each possible kind of bbcode tag, but that turned out to be a bit too heavyweight, so now we just use vectors.

(define currently-inside-preformatted (make-parameter #f))

(define (strip-text text)
  (if (currently-inside-preformatted)
      text
      (regexp-replace* #px"\\s+" text " ")))

(define-syntax-rule (p-pair pattern clause) (cons (list . pattern) (match-lambda clause)))

(define parse-handlers
  (list (p-pair [is-text] [(list text)
                           (strip-text text)])
        (p-pair [(is-tag #"br")] ['() #(break)])
        ;; quote
        (p-pair [(both (is-tag #"div")
                       (has-class "bbc-block"))
                 (is-tag #"h4")
                 (is-text-match #px"^(?:quote:|(.+)\\s+posted:)$")
                 (is-end-tag #"h4")
                 (is-tag #"blockquote")]
                [(list maybe-name) (make-bb 'quote maybe-name)])
        ;; end quote
        (p-pair [(is-end-tag #"blockquote")
                 (is-end-tag #"div")]
                ['() #(end quote)])
        ;; php block
        (p-pair [(both (is-tag #"div")
                       (has-class "bbc-block")
                       (has-class "php"))
                 (is-tag #"h5")
                 (is-text-match #px"php")
                 (is-end-tag #"h5")
                 (is-tag #"pre")
                 (is-tag #"code")]
                ['()
                 (currently-inside-preformatted #t)
                 #(php)])
        ;; code block
        (p-pair [(both (is-tag #"div")
                       (has-class "bbc-block")
                       (has-class "code"))
                 (is-tag #"h5")
                 (is-text-match #px"code")
                 (is-end-tag #"h5")
                 (is-tag #"pre")
                 (is-tag #"code")]
                ['()
                 (currently-inside-preformatted #t)
                 #(code)])
        ;; end php block / end code block
        (p-pair [(is-end-tag #"code")
                 (is-end-tag #"pre")
                 (is-end-tag #"div")]
                ['()
                 (currently-inside-preformatted #f)
                 #(end php-or-code)])
        ;; SA emoticons
        (p-pair [(both (is-tag #"img")
                       (has-attr-match #"src" emot-pattern)
                       (has-attr-match #"title" #px"^(.+)$"))]
                [(list url title) (make-bb 'emot url title)])
        ;; weird, an editedby marker
        (p-pair [(both (is-tag #"p")
                       (has-class "editedby"))
                 (is-tag #"span")
                 (is-text-match #px"^(.+) fucked around with this message at (.+) around (.+)$")
                 (is-end-tag #"span")]
                [(list editor-name date time)
                 (make-bb 'editedby editor-name date time)])
        ;; bold
        (p-pair [(is-tag #"b")] ['() #(b)])
        (p-pair [(is-end-tag #"b")] ['() #(end b)])
        ;; strikethrough
        (p-pair [(is-tag #"s")] ['() #(s)])
        (p-pair [(is-end-tag #"s")] ['() #(end s)])
        ;; spoiler
        (p-pair [(both (is-tag #"span")
                       (has-class "bbc-spoiler"))]
                ['() #(spoiler)])
        (p-pair [(is-end-tag #"span")] ['() #(end spoiler)])
        ;; underline
        (p-pair [(is-tag #"u")] ['() #(u)])
        (p-pair [(is-end-tag #"u")] ['() #(end u)])
        ;; italic
        (p-pair [(is-tag #"i")] ['() #(i)])
        (p-pair [(is-end-tag #"i")] ['() #(end i)])
        ;; super
        (p-pair [(is-tag #"sup")] ['() #(super)])
        (p-pair [(is-end-tag #"sup")] ['() #(end super)])
        ;; sub
        (p-pair [(is-tag #"sub")] ['() #(sub)])
        (p-pair [(is-end-tag #"sub")] ['() #(end sub)])
        ;; fixed
        (p-pair [(both (is-tag #"tt")
                       (has-class "bbc"))]
                ['() #(fixed)])
        (p-pair [(is-end-tag #"tt")] ['() #(end fixed)])

        ;; mailto.  order matters: we check for mailto before general urls.
        (p-pair [(both (is-tag #"a")
                       (has-attr-match #"href" #px"^mailto:(.+)$"))]
                [(list email-address) (make-bb 'mailto email-address)])
        ;; [url=blah]
        (p-pair [(both (is-tag #"a")
                       (has-attr-match #"href" #px"^(.+)$"))]
                [(list url) (make-bb 'url url)])
        ;; end of email / end of url
        (p-pair [(is-end-tag #"a")]
                ['() #(end mailto-or-url)])

        ;; bullet-list
        (p-pair [(is-tag #"ul")] ['() #(bullet-list)])
        (p-pair [(is-end-tag #"ul")] ['() #(end bullet-list)])

        ;; list item
        (p-pair [(is-tag #"li")] ['() #(list-item)])
        ))

(define (parse-post-body segments)
  (parameterize ([currently-inside-preformatted #f])
    (let loop ([ret '()]
               [segments segments])
      (if (null? segments)
          (reverse ret)
          (let hloop ([handlers parse-handlers])
            (if (null? handlers)
                (loop (cons 'unrecognized ret)
                      (cdr segments))
                (let* ([tail #f]
                       [report-tail (lambda (t) (set! tail t))]
                       [captures (matches-predicate-chain segments (caar handlers) report-tail)])
                  (if captures
                      (loop (cons ((cdar handlers) captures) ret)
                            tail)
                      (hloop (cdr handlers))))))))))

Character Entities

Saturday, April 3rd, 2010

I finally got character entity conversion going. Now I have this ugly bastard mix of strings and bytestrings. The rule of thumb is that stuff that needs character entity conversion goes in strings.

There is now beautiful stuff like this:

(define entity-table
  '(("quot" . "\"")
    ("amp" . "&")
    ("apos" . "'")
    ("lt" . "<")     ("gt" . ">")
    ("nbsp" . "\u00A0")
    ("iexcl" . "¡")
    ("cent" . "¢")
    ...

and stuff like this…

(define entity-regexp
  (let ([unicode "#(?:[0-9]+|x[0-9A-Fa-f]+)"]
        [entity-name (string-append "(?:"
                                    (string-join (map car entity-table) "|")
                                    ")")])
    (pregexp (string-append "&(?:" unicode "|" entity-name ");"))))

;; Returns the replacement string for a character entity, or, if the entity is unrecognized, returns
;; the original entity string.  Entity strings are expected to be in correct syntax, but invalid
;; entity names are tolerated.  For example, (entity-replacement "&") => "&", and
;; (entity-replacement "&bogus;") => "&bogus;", but (entity-replacement "totallybogus") has
;; undefined behavior.
(define (entity-replacement entity-string)
  (when (not (and (char=? #\& (string-ref entity-string 0))
                  (char=? #\; (string-ref entity-string (- (string-length entity-string) 1)))))
    (error (format "invalid entity string: ~a" entity-string)))
  (if (char=? #\# (string-ref entity-string 1))
      (string (integer->char (string->number
                              (substring entity-string
                                         ;; This is a bit glib.  "#xABCD" is parsed correctly
                                         ;; string->number as hexidecimal, but with #1234 we want to
                                         ;; omit the octothorpe.
                                         (if (char=? #\x (string-ref entity-string 2))
                                             1
                                             2)
                                         (- (string-length entity-string) 1)))))
      (hash-ref entity-hash-table
                (substring entity-string 1 (- (string-length entity-string) 1))
                entity-string)))

That’s how it originally was. In entity-regexp it ensured the proper entity name. But it would be just as good to do this:

(define entity-regexp
  (let ([unicode "#(?:[0-9]+|x[0-9A-Fa-f]+)"]
        [entity-name "[A-Za-z0-9]+"])
    (pregexp (string-append "&(?:" unicode "|" entity-name ");"))))

since we already check for valid entity names in the lookup table. So I’m going to go with that.

More Monstrous HTML Grepping

Friday, April 2nd, 2010

I wasn’t able to get to work on this for the past few days. I’m sure you cared.

I now have a very ugly interface to the HTML grepping monster I’ve created.

Here’s the result:

> (take (sequence->list (in-full-thread-descriptions 202 1)) 5)
(#(struct:full-thread-description
   2779598
   #"Ask General Programming Questions Not Worth Their Own Thread"
   #"csammis"
   3625
   152656
   #"01:15 Apr 03, 2010"
   #"Haystack")
 #(struct:full-thread-description
   2836504
   #"Cavern of Cobol FAQ (Read this first)"
   #"Scaevolus"
   0
   22367
   #"19:14 Apr 28, 2008"
   #"Scaevolus")
 #(struct:full-thread-description
   3048157
   #"iPhone Development Megathread"
   #"tinabeatr"
   3888
   131476
   #"05:37 Apr 03, 2010"
   #"Dr. Glasscock")
 #(struct:full-thread-description
   2675400
   #"Python information and short questions megathread."
   #"m0nk3yz"
   3968
   170465
   #"02:33 Apr 03, 2010"
   #"UberJumper")
 #(struct:full-thread-description
   2585949
   #"Ruby on Rails Love-In"
   #"MrSaturn"
   1543
   69960
   #"02:08 Apr 03, 2010"
   #"Nolgthorn"))

Here’s the code for in-full-thread-descriptions, which shows the new “gather” feature for gathering specific results within a given match group.

(define (in-full-thread-descriptions forum-number page-number)
  (let ([chain (list (gather capture-text 'spacer (has-class #"thread_title"))
                     (gather capture-text 'spacer (both (is-tag #"td")
                                                        (has-class #"author")))
                     (gather capture-text 'spacer (has-class #"replies"))
                     (gather capture-text 'spacer (has-class #"views"))
                     (gather capture-text 'spacer (has-class #"date") (has-class #"lastpost"))
                     (gather capture-text 'spacer (has-class #"author") (has-class #"lastpost"))
                     'spacer
                     (both (is-tag #"tr")
                           (has-class #"thread")
                           (has-attr-match #"id" #px#"thread(\\d+)")))]
        [combiners (list (list text-combiner)
                         (list text-combiner)
                         (list text-combiner)
                         (list text-combiner)
                         (list text-combiner)
                         (list text-combiner)
                         car)])
    (in-generator
     (for ([item (in-match-and-combine chain combiners (get-forum-html-stacks forum-number page-number))])
       (match-let ([(list (list (list thread-title))
                          (list (list thread-author))
                          (list (list num-replies))
                          (list (list num-views))
                          (list (list lastpost-date))
                          (list (list lastpost-author))
                          thread-id) item])
         (yield (make-full-thread-description (bytes->number thread-id)
                                              thread-title
                                              thread-author
                                              (bytes->number num-replies)
                                              (bytes->number num-views)
                                              lastpost-date
                                              lastpost-author)))))))

Yuck.

Inelegant Combobulation

Wednesday, March 31st, 2010

Here’s the code:

(define (in-forum-thread-descriptions forum-number)
  (let ([chain (list capture-text
                     'spacer
                     (both (is-tag #"a")
                           (has-class #"thread_title")
                           (has-attr-match #"href" #px#"showthread\\.php\\?threadid=(\\d+)")))]
        [combiners (list text-combiner car)])
    (in-match-and-combine chain combiners (get-forum-html-stacks forum-number))))

And here’s the result:

> (for ([desc (in-forum-thread-descriptions 202)])
    (printf "~a: ~a\n" (cadr desc) (car desc)))
2779598: Ask General Programming Questions Not Worth Their Own Thread
2836504: Cavern of Cobol FAQ (Read this first)
3048157: iPhone Development Megathread
3161913: WordPress - Development, Themes, Plugins.
2672629: SELECT * FROM Questions WHERE Type = 'Stupid'
2773485: C/C++ Programming Questions Not Worth Their Own Thread
3286714: "Run As Administrator" on Win2008 getting in the way of SQL SSIS jobs?
3263809: Another How Much Money Should I Be Making Thread
3283309: Ada Lovelace Day
2841382: Post screenshots of stuff you're working on!
2675400: Python information and short questions megathread.
2802621: <?PHP questions that don\\'t need their own thread ?>
3246449: Goons for Hire: Get your Developers here
3286440: Powershell
2803713: Coding horrors: post the code that makes you laugh (or cry)
3070034: Javascript questions which don't deserve their own thread.
2692947: Game Development Megathread
2262300: .Net Questions Megathread Part 2
2790475: Django Web Framework: The 'D' is silent dumbass
2718078: Web Design/Development Small Questions - Rev Holy Grail
3286273: Planning/Building Family Websites: What's this family tree doing in my web?
2780384: Java questions which don't deserve their own thread, yet.
3108969: a Vim thread
2662688: Xcode and Cocoa Megathread
2664804: The Perl Short Questions Megathread: executable line noise
3281048: When to (and not to) use goto;
2897255: 3D graphics questions that do not deserve their own thread (OpenGL / Dx10)
3113983: Version Control Questions Megathread (SVN / git / whatever else)
2585949: Ruby on Rails Love-In
2385157: Flash Questions Megathread

PLT Scheme sequences are annoying

Tuesday, March 30th, 2010

I don’t really know much about PLT Scheme sequences, but they seem to be fairly warty. First of all, consuming a sequence may destroy the sequence (at least if you use in-generator to make one). This is annoying, but I suppose it could also be regarded as a dangerous feature. Also, there’s no easy way (as far as I can tell) to convert sequences to lists. All I can find is

(for/list ([x seq])
  x)

Generators are pretty nice, and in-generator is okay, you can make sequences like this:

(define (in-filter pred seq)
  (in-generator (for ([x seq])
                  (when (pred x)
                    (yield x)))))

Right now, if you try to enumerate a sequence defined by uses of in-generator that have side-effects, you end up eating the sequence, so that subsequent enumerations nonchalantly tell you that the sequence has no elements. I’ll have to make a version of in-generator that throws an exception on subsequent enumerations. Here’s a function that would benefit:

(define (in-matches pattern port)
  (in-generator (let loop ()
                  (let ([matches (regexp-match pattern port)])
                    (when matches
                      (yield matches)
                      (loop))))))

This function consumes the port. It either needs to use (in-memo (in-generator ...)) or it needs to use in-destructive, which would be the exception-throwing generator I would have to implement.

Here’s my implementation of in-memo, for what it’s worth.

;;; Makes a memoized version of the sequence, which you can iterate
;;; through without destroying the sequence.
(define (in-memo seq)
  (let-values ([(has-next? next!) (sequence-generate)])
    (letrec ([next-node (λ () (if (has-next?)
                                (cons (next!) (delay (next-node)))
                                '()))])
      (let ([stream-head (delay (next-node))])
        (in-generator (let loop ([forced-stream (force stream-head)])
                        (when (not (null? forced-stream))
                          (yield (car forced-stream))
                          (loop (force (cdr forced-stream))))))))))

I don’t know, that seems fairly ugly.

--------------------------------
Side discussion below this point
--------------------------------

Sometimes I get annoyed by Scheme’s lack of laziness. I’m used to Haskell’s laziness, you see. For example, you can’t implement Data.Function.on in Scheme — not in the same way — because something like ((&&) `on` f) wouldn’t short-circuit. I wanted short-circuiting behavior, and the only way to do it was to define a macro…

(define-syntax-rule (on binary-op f) (λ (x y) (binary-op (f x) (f y))))

This is much uglier. Scheme is generally uglier, but more comely, than Haskell.

Slight HTML “Parsing” Adjustments

Monday, March 29th, 2010

I made some slight adjustments to the HTML Parsing.

> (define txt (string-append "<!DOCTYPE HTML><html lang=\"en\"><head><title>"
                             "Hello</title>  </head><body>text</body></html>"))
> (map (λ (stack) (matches-predicate-chain stack (list 'spacer (both (is-tag "html")
                                                                     (has-attr-value "lang" "en")))))
       (get-html-stacks (get-html-segments txt)))
(#f #t #t #t #t #t #t #t #t #t #t #f)
> (map (λ (stack) (matches-predicate-chain stack (list 'spacer (both (is-tag "html")
                                                                     (has-attr-value "lang" "foo")))))
       (get-html-stacks (get-html-segments txt)))
(#f #f #f #f #f #f #f #f #f #f #f #f)

Now attributes can be reasoned with. And the parser now generates stacks with top nodes on the top — with text nodes on top, and so on.

Minimalist HTML “Parsing”

Sunday, March 28th, 2010

I’ve decided to go with my own cheap HTML “parser.” Right now there are just some functions for cutting an HTML document into tags and text node substrings, and then code for building a stacks of start tags that lie under each substring. There’s also a primitive matcher that matches search expressions that check for a tag name. For example, here we mark as #t the substrings <title> and Hello, because they’re directly atop the TITLE element, followed by some other containing elements, followed by the HTML element.

> (define txt (string-append "<!DOCTYPE HTML><html lang=\"en\"><head><title>"
                             "Hello</title>  </head><body>text</body></html>"))
> (map (λ (s) (matches-predicate-chain s (list (is-tag "title")
                                               'spacer
                                               (is-tag "html"))))
       (get-html-stacks (get-html-segments txt)))
(#f #f #f #t #t #f #f #f #f #f #f #f)

It looks like I beat my 24-hour posting time limit by 6 minutes.

Goon Project

Saturday, March 27th, 2010

I’m going to break from my habit of scheduling most posts at 5:00 AM and instead intend to go for a more brutal pattern, one of posting at least once in any 24 hour period. I’ve decided to start a little project to make a little iPhone app that I have wished the existence of for a little period of time. It’s going to be the usual sort of forum reader software like “Forums” or “Forum Pro” or what have you, except that it won’t suck, it’ll use a proxy server with a relatively custom protocol (probably just gzipped json or equivalent), for the purpose of lowering bandwidth and for the purpose of being able to add support for new forums that use weird skins or structures whenever it’s needed, so that users can complain and have their problems resolved instead of having to give up or wait for an update. (And I guess I better check if that’s allowed, according to the app store policies.)

My postings will be about what I’ve done in the previous sub-24-hour period. This will force me to actually get stuff done.

I have made my first decision about this project already.  I’m going to use PLT Scheme for the server end.  Part of the reason is that I just read this comment by boskone on Hacker News, basically saying,

IMHO, there are currently only 2 top tier active hotspots where the cool theoretical meets with the practical and usable in programming language theory, the Haskell and PLT ecospheres.

Well that’s very sweet.  Anyway, I think it deserves a leap of faith, especially since, if I ever decide that I really want a statically typed language, I can just go with #lang typed-scheme.  So I will try it and report back on my trials and tribble-ations with PLT Scheme.

Edit: tribble-ations? Sigh.