Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add utf8 string view #784

Merged
merged 1 commit into from Feb 24, 2018
Merged

Add utf8 string view #784

merged 1 commit into from Feb 24, 2018

Conversation

tiehuis
Copy link
Member

@tiehuis tiehuis commented Feb 23, 2018

An initial implementation for a utf-8 view of a byte slice. A buffer alternative may be useful down the track, analogous to &str and String in Rust. I don't need this just yet, though.

I've created a new Rune type to indicate a unicode codepoint (or scalar value which it looks like the decoding is based on) as well. Are we happy with this name?

The module structure probably needs to be decided. How do we want to structure unicode and utf8? Do we want to include other decoding/encoding in the future (utf-16 for windows apis) or not and if so how will this fit in? My current view was simply that utf8.StringView is more explanatory than unicode.StringView which doesn't give any detail to the underlying storage.

@bnoordhuis
Copy link
Contributor

I've created a new Rune type to indicate a unicode codepoint (or scalar value which it looks like the decoding is based on) as well. Are we happy with this name?

It's a type alias, not a new type? Not something you can comptime discriminate on?

Speaking for myself, I would just stick with plain u32. Less mental overhead.

@tiehuis
Copy link
Member Author

tiehuis commented Feb 23, 2018

Sorry yes, just an alias.

@thejoshwolfe
Copy link
Sponsor Contributor

Are we happy with this name?

In Unicode, a "rune" is a character from the runic block, like "ᚠᚢᛉ". A grapheme cluster is a sequence of codepoints that work together to form a single functional character, called a grapheme. A glyph is the vectors or pixels you use to render a grapheme.

The first thing that came to mind when I read Rune was grapheme. I think the word that best describes what you're working with is simply Codepoint (or CodePoint). But I agree with @bnoordhuis that just u32 would be best.

It looks like the heart of this PR is the iterators, and the StringView type exists for two reasons: prevalidate so that the iterators aren't returning errorables; serve as an iterable factory so you can make multiple iterables from the same string view. So an alternative implementation would be to omit the StringView type, and just make iterators directly that return errorable codepoints. I can see both convenience and performance arguments for both approaches.

You really only need one iterator type with two different next methods. nextCodepoint() and nextCodepointSlice() or whatever names. (And the codepoint one can call the slice one, if you want)

Regarding naming and namespacing, I would put everything in the unicode module, at least according to the module's API. We can make subdirectories and an index.zig to bring all the functionality into a single namespace if we want, but it doesn't look like it's time for that yet. In the unicode module, name things with the word "utf8" where appropriate. Like I would call your types Utf8View and Utf8Iterator. I like being explicit about "utf8" right in the name of the type rather than calling anything a "string", which sounds like you're trying to hide the underlying representation from the API.

std/utf8.zig Outdated
return false;
}
}
return true;
Copy link
Sponsor Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (i > s.len) return false;

You need this check in case a multi-byte sequence would overflow the end of the buffer. validate("\xe2\x82")

@tiehuis
Copy link
Member Author

tiehuis commented Feb 24, 2018

Thanks for the comments.

This should be more in line with them. In regards to pre-validating and performance I think it is okay for the moment. It means catching errors earlier when using this view which is often nicer program behavior (fail early). If performance is really important then the user can always use a modified iterator themselves to avoid the dual decode as the core decoding primitives are still available.

I noticed the validation function wasn't actually doing a decode (beyond checking the length) which could various bad utf8 through. This is fixed.

@thejoshwolfe thejoshwolfe merged commit 08d595b into master Feb 24, 2018
@thejoshwolfe
Copy link
Sponsor Contributor

Thanks! There's still a lot more work to be done in the unicode module, but this looks good for now.

@andrewrk
Copy link
Member

Yeah. Just a reminder that we're going to audit the entire standard library before 1.0.0 so if something is an improvement over status quo and it has tests, it's good to merge

@tiehuis tiehuis deleted the utf8-string-view branch March 8, 2018 08:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants