Arc Forum | From zip to smart tokenization

Arc Forum

From zip to smart tokenization

1 point by akkartik 5726 days ago | 4 comments

A couple of days ago I found myself scratching my head over this terse definition of zip from anarki [1]:

  (def zip ls
    (apply map list ls))

  (test-iso "zip merges corresponding elements of lists"
            '((1 2 3) (2 4 6) (3 6 9))
            (zip '(1 2 3) '(2 4 6 7) '(3 6 9)))

When I understood it and replaced the naive recursive implementation in my toolbox, I found myself wondering if there's a better implementation for zipmax, which doesn't stop when the shortest list runs out:

  (def zipmax l
    (if (some acons l)
      (cons (map car l)
        (apply zipmax (map cdr l)))))

  (test-iso "zipmax returns as many elements as the longest list"
            '((1 2) (3 4) (5 6) (nil 8))
            (zipmax '(1 3 5) '(2 4 6 8)))

(Please feel free to suggest better names or implementations, as before - http://arclanguage.org/item?id=11111)

Why is zipmax useful? It allows us to construct nctx.

  (def nctx(n l)
    (apply zipmax (cdrs (- n 1) l)))

  (def cdrs(n l)
    (if (is n 0)
      (list l)
      (cons l (cdrs (- n 1) (cdr l)))))

  (test-iso "nctx slides a window over a list"
            '((1 2 3) (2 3 4) (3 4 nil) (4 nil nil))
            (nctx 3 '(1 2 3 4)))

  (test-iso "cdrs"
            '((1 2 3) (2 3) (3))
            (cdrs 2 '(1 2 3)))

nctx is surprisingly useful anytime you want to iterate over a list in a stateful way, with each iteration aware of the current and previous element, or current and next, etc.

--- An illustration of nctx

Here's how I use nctx to build a context-sensitive tokenizer.[2] In english, certain characters may or may not mark word boundaries depending on context. For example, single quotes shouldn't break up words when used as apostrophes, but shouldn't attach themselves to touching words within quoted sentences.

  (mac conscar(a l)
    `(= ,l (cons (cons ,a (car ,l)) (cdr ,l))))

  (def partition-words(s)
    (unless (blank s)
      (withs (firstchar (s 0)
              ans (list (list firstchar))
              state (charclass firstchar))
        (each (last curr next) (nctx 3 (coerce s 'cons))
          (if curr
            (let newstate (charclass curr)
              (if (is newstate 1)
                (if (or (whitec last) (whitec next))
                  (= newstate 0)
                  (= newstate 2)))
              (if
                (is newstate state) (conscar curr ans)
                                    (push (list curr) ans))
              (= state newstate))))
        (rev:map [coerce (rev _) 'string] ans))))

  (with (NEVER-WORD* ";\"![]() \n\t\r"
         MAYBE-WORD* ".,'=-/:&?")
    (def charclass(c)
      (let c (coerce c 'string)
        (if
          (posmatch c NEVER-WORD*)  0
          (posmatch c MAYBE-WORD*)  1
                                    2))))

So anyway, I'd love people's comments on this ball of parens. Is there an idiomatic or cheaper (in conses) non-recursive definition of zipmax? A better name for any of these functions?

---

Footnotes:

[1] rntz finally helped me figure it out: Apply right-associatively conses extra args together to form the list.

  (test-iso "apply strips outermost parens from last arg"
    (list 3 '(1 2) '(3 4))
    (apply list 3 '((1 2) (3 4))))

[2] PG's bayesian spam filter (http://paulgraham.com/better.html) uses a similar tokenizer, but it's described to only allow periods and commas within words. Here I permit other characters as well to match how humans read english: ' for apostrophes, - for hyphens, /=?& for urls, and : for timestamps. I couldn't find any scenarios where these characters should form word partitions without adjoining whitespace. Am I missing something?

---

Appendix: Some more automated tests (anarki/lib/arctap.arc)

  (test-iso "nctx trivial"
            '((1 nil))
            (nctx 2 '(1)))


  (test-iso "partition-words should partition along whitespace"
            '("abc" " " "def")
            (partition-words "abc def"))

  (test-iso "partition-words should partition along punctuation"
            '("abc" ", " "def")
            (partition-words "abc, def"))

  (test-iso "partition-words should intelligently partition along punctuation 1"
            '("abc" " - " "def")
            (partition-words "abc - def"))

  (test-iso "partition-words should intelligently partition along punctuation 2"
            '("abc-def")
            (partition-words "abc-def"))

  (test-iso "partition-words should intelligently partition along punctuation 3"
            '("abc" " \"" "def" "\"")
            (partition-words "abc \"def\""))

2 points by fallintothis 5726 days ago | link

I'd use a different name for nctx because I don't know what it's supposed to mean. You define it as a sliding window, so I call it windows (alternatively sliding-windows, but that's a bit longer).

I notice that it conses a lot for cdrs, which is ultimately useless since (cdrs (- n 1) l) is just used in an apply. Also, the stateful iteration remark reminded me of the discussion at http://arclanguage.org/item?id=11104. So, without studying zipmax at all, I rewrote nctx (in vanilla Arc) using reclist:

  (def windows (n xs)
    (accum a
      (reclist [do (a:firstn n _) nil] xs)))

This doesn't work quite the same way.

  arc> (windows 3 '(1 2 3 4))
  ((1 2 3) (2 3 4) (3 4) (4))

But each list is suitable for destructuring in a let or each, which seems to be your intended use. e.g.,

  arc> (each (a b c) (windows 3 '(1 2 3 4))
         (prs a b c)
         (prn))
  1 2 3
  2 3 4
  3 4 nil
  4 nil nil
  nil

As for efficiency:

  arc> (do1 'ok (= xs (n-of 1000000 'x)))
  ok
  arc> (jtime (nctx 3 xs))
  time: 28047 msec.
  ok
  arc> (jtime (windows 3 xs))
  time: 21909 msec.
  ok

Hm. Not that much faster. Does it still cons less?

  arc> (= cons-count* 0)
  0
  arc> (let old-cons cons
         (def cons (x y)
           (++ cons-count*)
           (old-cons x y)))
  *** redefining cons
  #<procedure: cons>
  arc> (jtime (nctx 3 xs))
  time: 39062 msec.
  ok
  arc> cons-count*
  7000038
  arc> (= cons-count* 0)
  0
  arc> (jtime (windows 3 xs))
  time: 29046 msec.
  ok
  arc> cons-count*
  5000032

It doesn't cons nearly as much, so how else could it be made faster? Turns out reclist bites us.

  (def reclist (f xs)
    (and xs (or (f xs) (reclist f (cdr xs)))))

f is meant to be a boolean test, which we don't even care about: we just want to get past the first or clause. So, I write a more imperative loop.

  (def windows (n xs)
    (accum a
      (a (firstn n xs))
      (whilet xs (cdr xs)
        (a (firstn n xs)))))

This time (in a new session, so cons is back to normal):

  arc> (do1 'ok (= xs (n-of 1000000 'x)))
  ok
  arc> (jtime (nctx 3 xs))
  time: 28850 msec.
  ok
  arc> (jtime (windows 3 xs))
  time: 13012 msec.
  ok

That seems to do the trick! It even conses just as much as the old windows:

  arc> (= cons-count* 0)
  0
  arc> (let old-cons cons
         (def cons (x y)
           (++ cons-count*)
           (old-cons x y)))
  *** redefining cons
  #<procedure: cons>
  arc> (jtime (nctx 3 xs))
  time: 40256 msec.
  ok
  arc> cons-count*
  7000038
  arc> (= cons-count* 0)
  0
  arc> (jtime (windows 3 xs))
  time: 20308 msec.
  ok
  arc> cons-count*
  5000032

The next way I might try speeding it up is to inline firstn, which means we need a macro, but that seems hard to do at expansion time without passing a literal number instead of n. Plus, I think this definition strikes a nice balance between efficiency and self-containment (i.e., we need no new definitions).

I haven't figured out a way to simplify partition-words yet, but when I get a chance later I'll have a look. At first glance, I'd use symbols instead of 0, 1, and 2 for your character classes. e.g.,

  (with (NEVER-WORD* ";\"![]() \n\t\r"
         MAYBE-WORD* ".,'=-/:&?")
    (def charclass(c)
      (let c (coerce c 'string)
        (if (posmatch c NEVER-WORD*)
              'never
            (posmatch c MAYBE-WORD*)
              'maybe
              'always))))

It's more readable simply because words have more meaning than numbers.

-----

1 point by akkartik 5726 days ago | link

That was illuminating, thanks! That final definition of windows is awesome. I wrote these functions about six months ago, when I wasn't aware of reclist. Come to think of it, I still haven't used reclist before, so perhaps this thread should be called reclist not zip.

nctx => short for 'context of n'. For this post I deliberately tried to come up with three different perspectives on each function (description, name, test string) just to see what sticks.

-----

1 point by fallintothis 5725 days ago | link

That was illuminating, thanks!

Sure thing!

nctx => short for 'context of n'

Ah. I had guessed that, but it still seemed clunky. I'm liking sliding-window, but it's kind of long.

As for partition-words, it could be made clearer (less "ball of parens") with shorter definitions and more evenly distributed logic.

(conscar a l) is just (push a (car l)), so we can get rid of a macro.

Notice that charclass is used in two places: once for firstchar, and several times for any subsequent curr character. But each time it's called on curr, the newstate is set to something that's definitely not 'maybe (renamed per my previous comment). So it seems that this logic could be moved into charclass itself.

  (with (NEVER-WORD* ";\"![]() \n\t\r"
         MAYBE-WORD* ".,'=-/:&?")
    (def charclass (c prev next)
      (let c (coerce c 'string)
        (if (posmatch c NEVER-WORD*)
              'never
            (posmatch c MAYBE-WORD*)
              (if (or (whitec prev) (whitec next))
                  'never
                  'always)
              'always))))

This handles the curr characters. Before handling firstchar, we can simplify thus:

  (= never-word* ";\"![]() \n\t\r"
     maybe-word* ".,'=-/:&?")

  (def charclass (c prev next)
    (if (find c never-word*)
          'never
        (find c maybe-word*)
          (if (or (whitec prev) (whitec next))
              'never
              'always)
          'always))

Changes: (1) It's not a big deal, but Arc tends toward lowercase. (2) Since your variables were named like globals (i.e., with asterisks), I made them actually global. There's nothing wrong with the with block, but if you use that you probably shouldn't put asterisks after the names. (3) You can use find instead of the coerce & posmatch stuff. (4) I use prev instead of last to avoid confusion because Arc already defines a last function.

With this abstracted, we can analyze how state changes. Specifically, we see that firstchar can give a 'maybe result, though no other character can. I suspect this is a flaw in the logic, but preserved it anyway. It's cleaner to separate these two cases into their own functions, so I write

  (= never-word* ";\"![]() \n\t\r"
     maybe-word* ".,'=-/:&?")

  (def charclass (c)
    (if (find c never-word*)
          'never
        (find c maybe-word*)
          'maybe
          'always))

  (def charstate (c prev next)
    (caselet class (charclass c)
      maybe (if (or (whitec prev) (whitec next))
                'never
                'always)
            class))

Updating gives us the program:

  (= never-word* ";\"![]() \n\t\r"
     maybe-word* ".,'=-/:&?")

  (def sliding-window (n xs)
    (accum a
      (a (firstn n xs))
      (whilet xs (cdr xs)
        (a (firstn n xs)))))

  (def charclass (c)
    (if (find c never-word*)
          'never
        (find c maybe-word*)
          'maybe
          'always))

  (def charstate (c prev next)
    (caselet class (charclass c)
      maybe (if (or (whitec prev) (whitec next))
                'never
                'always)
            class))

  (def partition-words (s)
    (unless (blank s)
      (with (ans (list:list:s 0) state (charclass s.0))
        (each (prev curr next) (sliding-window 3 (coerce s 'cons))
          (when curr
            (let newstate (charstate curr prev next)
              (if (is newstate state)
                  (push curr (car ans))
                  (push (list curr) ans))
              (= state newstate))))
        (rev:map [coerce (rev _) 'string] ans))))

That's as far as I got. I think the code's clearer, but I'm not even sure how the original was supposed to work. For example, yours and mine both give these results:

  arc> (partition-words "this slash / has spaces")
  ("this" " " "slash" " / " "has" " " "spaces")
  arc> (partition-words "/ this slash doesn't")
  ("/" " " "this" " " "slash" " " "doesn't")
  arc> (partition-words "My parents' car is fast.  It's all like 'vroom'.")
  ("My" " " "parents" "' " "car" " " "is" " " "fast" ".  " "It's" " " "all" " " "like" " '" "vroom'.")

But I defer to your judgment, since it's your program. ;)

-----

2 points by akkartik 5725 days ago | link

Very nice, thanks. I'm leaning toward windows as well. nctx is shorter, but it's not clear I use it often enough to justify the terseness.

Those single quote issues on the last line are def bugs, thanks.

partition-words isn't a tokenizer, it partitions the string along word boundaries. Hence the tokens containing spaces and punctuation. You can easily extract tokens with a single keep, but I found it useful to be able to easily recombine the partitions to reconstruct the original string.

-----