New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add utf8 string view #784
Add utf8 string view #784
Conversation
It's a type alias, not a new type? Not something you can comptime discriminate on? Speaking for myself, I would just stick with plain |
Sorry yes, just an alias. |
In Unicode, a "rune" is a character from the runic block, like "ᚠᚢᛉ". A grapheme cluster is a sequence of codepoints that work together to form a single functional character, called a grapheme. A glyph is the vectors or pixels you use to render a grapheme. The first thing that came to mind when I read It looks like the heart of this PR is the iterators, and the You really only need one iterator type with two different Regarding naming and namespacing, I would put everything in the |
std/utf8.zig
Outdated
return false; | ||
} | ||
} | ||
return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (i > s.len) return false;
You need this check in case a multi-byte sequence would overflow the end of the buffer. validate("\xe2\x82")
3bd26d9
to
df043e7
Compare
Thanks for the comments. This should be more in line with them. In regards to pre-validating and performance I think it is okay for the moment. It means catching errors earlier when using this view which is often nicer program behavior (fail early). If performance is really important then the user can always use a modified iterator themselves to avoid the dual decode as the core decoding primitives are still available. I noticed the validation function wasn't actually doing a decode (beyond checking the length) which could various bad utf8 through. This is fixed. |
Thanks! There's still a lot more work to be done in the unicode module, but this looks good for now. |
Yeah. Just a reminder that we're going to audit the entire standard library before 1.0.0 so if something is an improvement over status quo and it has tests, it's good to merge |
An initial implementation for a utf-8 view of a byte slice. A buffer alternative may be useful down the track, analogous to
&str
andString
in Rust. I don't need this just yet, though.I've created a new
Rune
type to indicate a unicode codepoint (or scalar value which it looks like the decoding is based on) as well. Are we happy with this name?The module structure probably needs to be decided. How do we want to structure
unicode
andutf8
? Do we want to include other decoding/encoding in the future (utf-16
for windows apis) or not and if so how will this fit in? My current view was simply thatutf8.StringView
is more explanatory thanunicode.StringView
which doesn't give any detail to the underlying storage.