Skip to content

Accelerating Unicode Processing with StringZilla: 50x Speedups over ICU #61092

@ashvardanian

Description

@ashvardanian

What is the problem this feature will solve?

Text processing isn't fast in JavaScript, especially when it comes to Unicode handling. Operations like case-folding or case-insensitive substring search are orders of magnitude slower than they can be with modern SIMD kernels.

What is the feature you are proposing to solve the problem?

Following up on this eXchange with @mcollina, I'm curious if this latest release of my StringZilla library can be of help to NodeJS and the broader JavaScript community?

In short: I grouped all Unicode 17 case-folding rules and wrote ~3K lines of AVX-512 kernels around them to enable fully compliant case-insensitive substring search across the full 1M+ Unicode range, directly on the original UTF-8 bytes. It's not only often ~50× faster than ICU, but also "less wrong" than most search tools you'll reach for — from low-level Grep to products like Google Docs, Microsoft Excel, and VS Code.

I already have NodeJS bindings available from NPM, but, given the lack of NAPIs for zero-copy access to internal string representations from C, my API is limited to accepting/returning Buffers. It's questionable from an ergonomics perspective and, of course, would be much more usable if integrated adequately with Node's native strings.

What alternatives have you considered?

ICU4C and ICU4X are the only options for this functionality beyond StringZilla. They are the rock-solid reference implementations, but often 5-150x slower than StringZilla.

Metadata

Metadata

Assignees

No one assigned

    Labels

    feature requestIssues that request new features to be added to Node.js.

    Type

    No type

    Projects

    Status

    Awaiting Triage

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions