Experiencing irrational feelings of jealousy watching rin and kaniini rewrite a fast html scrubber. Not sure it would be a reasonable thing for me to dive into. Alas.

@tedu @kaniini hopefully "fast" doesn't mean "foolable". Any scrubber should be super conservative with what it allows through and be unable to "fail open." If in doubt, throw it out!


@kaniini @tedu I mostly mention this because if I needed to scrub HTML I'd probably do it the slow way, by using a maintained parser, a visitor that reconstructs the parse tree only pulling out info it understands (i.e. no blind recursion), and then re-serialize that. To do it the fast way I'd probably use an event-driven parser that tries as hard as possible to avoid any memory allocation or backtracking, possibly even operating in-place.

@tedu @kaniini I guess one could make that reasonably safe by assuming reasonably valid HTML and bailing out completely if anything unexpected, perhaps even zeroing out the string if it operates in-place to make sure the caller can't go on and use it anyway.

@freakazoid @tedu that's basically what we did. the previous gold standard for html parsing used mochiweb, we replaced it with a parser from an actual browser.
Sign in to participate in the conversation
R E T R O  S O C I A L

A social network for the 19A0s.