Add utf8 string view #784

tiehuis · 2018-02-23T08:28:55Z

An initial implementation for a utf-8 view of a byte slice. A buffer alternative may be useful down the track, analogous to &str and String in Rust. I don't need this just yet, though.

I've created a new Rune type to indicate a unicode codepoint (or scalar value which it looks like the decoding is based on) as well. Are we happy with this name?

The module structure probably needs to be decided. How do we want to structure unicode and utf8? Do we want to include other decoding/encoding in the future (utf-16 for windows apis) or not and if so how will this fit in? My current view was simply that utf8.StringView is more explanatory than unicode.StringView which doesn't give any detail to the underlying storage.

bnoordhuis · 2018-02-23T09:06:48Z

I've created a new Rune type to indicate a unicode codepoint (or scalar value which it looks like the decoding is based on) as well. Are we happy with this name?

It's a type alias, not a new type? Not something you can comptime discriminate on?

Speaking for myself, I would just stick with plain u32. Less mental overhead.

tiehuis · 2018-02-23T09:19:17Z

Sorry yes, just an alias.

thejoshwolfe · 2018-02-23T14:54:16Z

Are we happy with this name?

In Unicode, a "rune" is a character from the runic block, like "ᚠᚢᛉ". A grapheme cluster is a sequence of codepoints that work together to form a single functional character, called a grapheme. A glyph is the vectors or pixels you use to render a grapheme.

The first thing that came to mind when I read Rune was grapheme. I think the word that best describes what you're working with is simply Codepoint (or CodePoint). But I agree with @bnoordhuis that just u32 would be best.

It looks like the heart of this PR is the iterators, and the StringView type exists for two reasons: prevalidate so that the iterators aren't returning errorables; serve as an iterable factory so you can make multiple iterables from the same string view. So an alternative implementation would be to omit the StringView type, and just make iterators directly that return errorable codepoints. I can see both convenience and performance arguments for both approaches.

You really only need one iterator type with two different next methods. nextCodepoint() and nextCodepointSlice() or whatever names. (And the codepoint one can call the slice one, if you want)

Regarding naming and namespacing, I would put everything in the unicode module, at least according to the module's API. We can make subdirectories and an index.zig to bring all the functionality into a single namespace if we want, but it doesn't look like it's time for that yet. In the unicode module, name things with the word "utf8" where appropriate. Like I would call your types Utf8View and Utf8Iterator. I like being explicit about "utf8" right in the name of the type rather than calling anything a "string", which sounds like you're trying to hide the underlying representation from the API.

thejoshwolfe · 2018-02-23T14:55:09Z

std/utf8.zig

+            return false;
+        }
+    }
+    return true;


if (i > s.len) return false;

You need this check in case a multi-byte sequence would overflow the end of the buffer. validate("\xe2\x82")

tiehuis · 2018-02-24T03:37:07Z

Thanks for the comments.

This should be more in line with them. In regards to pre-validating and performance I think it is okay for the moment. It means catching errors earlier when using this view which is often nicer program behavior (fail early). If performance is really important then the user can always use a modified iterator themselves to avoid the dual decode as the core decoding primitives are still available.

I noticed the validation function wasn't actually doing a decode (beyond checking the length) which could various bad utf8 through. This is fixed.

thejoshwolfe · 2018-02-24T18:32:48Z

Thanks! There's still a lot more work to be done in the unicode module, but this looks good for now.

andrewrk · 2018-02-24T18:37:53Z

Yeah. Just a reminder that we're going to audit the entire standard library before 1.0.0 so if something is an improvement over status quo and it has tests, it's good to merge

thejoshwolfe reviewed Feb 23, 2018

View reviewed changes

Add utf8 string view

df043e7

tiehuis force-pushed the utf8-string-view branch from 3bd26d9 to df043e7 Compare February 24, 2018 03:31

thejoshwolfe merged commit 08d595b into master Feb 24, 2018

tiehuis deleted the utf8-string-view branch March 8, 2018 08:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add utf8 string view #784

Add utf8 string view #784

tiehuis commented Feb 23, 2018

bnoordhuis commented Feb 23, 2018

tiehuis commented Feb 23, 2018

thejoshwolfe commented Feb 23, 2018

thejoshwolfe Feb 23, 2018 •

edited

tiehuis commented Feb 24, 2018

thejoshwolfe commented Feb 24, 2018

andrewrk commented Feb 24, 2018

Add utf8 string view #784

Add utf8 string view #784

Conversation

tiehuis commented Feb 23, 2018

bnoordhuis commented Feb 23, 2018

tiehuis commented Feb 23, 2018

thejoshwolfe commented Feb 23, 2018

thejoshwolfe Feb 23, 2018 • edited

Choose a reason for hiding this comment

tiehuis commented Feb 24, 2018

thejoshwolfe commented Feb 24, 2018

andrewrk commented Feb 24, 2018

thejoshwolfe Feb 23, 2018 •

edited