Type Independent String Functions #891

BraedonWooding · 2018-04-04T12:34:55Z

What is this PR

See here #912, this explains it in a deeper level.

What EXACTLY is in this PR

Adds 'endsWith' to mem; this is purely since there is a 'startsWith'
Moves 'join' to stringUtils and makes it type independent according to Type Independent String Manipulation #912
Adds AsciiView and AsciiIterator, and makes Utf8View and Utf8Iterator comply with the standards of Type Independent String Manipulation #912
Adds encode function to Utf8 (to go from codepoint to array of bytes)
Adds CodepointSequenceLength function to utf8 (to count length of bytes that a codepoint will be)
Restructured utf8 functions to not have the utf8 prefix as they exist within 'utf8' library now; such that you call them like utf8.View rather than utf8View should just make the code within it easier to understand :), this is similar to how its been done in Math
Added representation folder and corresponding index.zig, this is just since most likely we will have around 4 representations (utf8/16, ascii, and char* strings), note: you can access them through std without having to go through representations.

What this PR ISN'T

This PR is not going to be ANYTHING to do with LOCALE.

Furthermore this PR isn't going to edit any code to use Unicode string functions as that'll push it away from its focused use, it'll still add tests of course and I presume any new code will use unicode but any old code should continue to use Ascii (and will merge over to ascii views) as for example I don't know if Path.zig just pretends the bytes it gets are ascii but really are unicode or if it reports an error if you give it a unicode or whatever, and due to the size of path.zig along with the fact that Windows uses a special Utf16 version that should be a separate PR :).

Checklist

Comply Utf8View/Iterator to the View/Iterator standards for string manipulation (basically more functions and some variables have to be named a certain way).
Build AsciiView
Expand trim to allow specifying side
Expand IndexOfPos and its similar forms to allow specifying choosing the highest index that matches or the lowest (though currently it doesn't go back to front if choosing highest, which I think it probably should but I didn't want to add a ton of conditions, though I think I'll make highest a comptime variable then I'll be happy adding more conditions)
Move join to stringUtils.zig, this is due to the fact that it explicitly requires strings it isn't for memory joining; i.e. as the description says its for joining strings!
Make trim for strings (i.e. string version independent)
Cleanup and fix bugs

Short Explanation of Views/Iterators

Views are just a way to view a memory in two ways simultaneously, both as raw data and as what it represents for example utf8 you can index the raw data or the code points it represents (which are u32 instead of u8 for example), you can of course do other actions like comparing views (if same type of course) and slicing/indexing both code points and raw data. The only function that can return an error is view.init which should perform a validity check on the data (there is also an initUnchecked function) as when indexing data there is no reason why it should be invalid.

Iterators just allow you to cycle through a view, they allow you to obtain the 'nextBytes' or the 'nextCodepoint' representing either the next set of bytes or the actual code point for some iterators they are the same thing for example Ascii which will always return a byte array of size 1 for nextBytes (though of course it'll return it as an array rather than an individual element to fit the spec of iterators), for that reason you should commonly use nextCodepoint as that is more efficient in cases like Ascii and the reason you are using iterators is to get the functionality of turning the raw unicode data into code points :). There are no error throwing functions in iterators, this is to make your loops nicer and your code much nicer :).

Note: When I saw no error throwing functions I mean that those functions don't have an opportunity to throw because of the way they were designed; that is IF the data is valid (checked by the initial init) then all the functions won't throw. If that data isn't valid then that is undefined behaviour, and that is why using initUnchecked is typically bad behaviour and you should rather use try init :).

BraedonWooding · 2018-04-04T12:54:48Z

I'll style guide it up sometime soon :).

Hejsil · 2018-04-04T12:55:57Z

A few of these seem very simular to function from std.mem

find_stris very simular to the mem.indexOf family.
strip is very simular to mem.trim.
starts_with is very simular to mem.startsWith.
split takes a different approuch to mem.split, so maybe it should be a wrapper to that instead.
str_eql is very simular to mem.eql.

I don't think we need the same functions in two places. As for the rest, I like the idea of having an ascii.zig, just like we have unicode.zig. After all, a string is just bytes. What the string means is a matter for encoding. Functions like to_upper, is_num and all those could go in that.

Hejsil · 2018-04-04T13:10:56Z

std/string_utils.zig

+    return byte >= 'A' and byte <= 'Z';
+}
+
+test "String_Utils" {


I recommend splitting this into some smaller tests like:

test "string_utils.is_ascii_letter" { assert(is_ascii_letter('C')); assert(is_ascii_letter('e')); assert(!is_ascii_letter('2')); }

Yes of course, I just wanted to get a basic testing suite up first :).

BraedonWooding · 2018-04-04T13:29:57Z

find_str handles both lowest and highest index which mem.indexOf doesn't, strip handles the ability to just do left or right side, or both where mem.trim doesn't. Split is well not lazy so yes perhaps it does, and str_eql calls mem_eql :), it is just there to replace mem.str_eql (I moved it as per my description).

So overall the only function that really doesn't need to exist is starts_with which I just added due to ends_with existing. Would it be better for me to port my methods over to memory instead of having a separate string utility class? That is add the non-lazy method, the trimming either side, the highest/lowest indexOf choice, and the str_eql (since its used a lot for hashmaps) over to memory?

But still keep the other 'ifX' functions in string utilities?

Hejsil · 2018-04-04T13:34:21Z

I think that is better. After all, trimming, finding and splitting is not unique to strings. Unicode strings would probably need their own versions of some of these (if that is useful).

BraedonWooding · 2018-04-04T13:35:31Z

Yes, I plan to do that separately :). I would say it is useful? Though we need a sort of locale sorted first.

andrewrk

Thanks for the PR! I'm a little worried that this leads us down a path of misuse of strings and unicode. See #234

andrewrk · 2018-04-04T14:31:48Z

.gitignore

@@ -1,3 +1,4 @@
 zig-cache/
 build/
 build-*/
+.vscode/


Fair enough I'll remove it :), I shouldn't be calling 'git add .' anyways.

I think you should be able to use git add .. Everything to ignore falls into 2 categories: system-wide ignore, or project-specific ignore. If both are correct then git add . works just fine.

andrewrk · 2018-04-04T14:40:18Z

std/index.zig

@@ -30,6 +30,7 @@ pub const rand = @import("rand/index.zig");
 pub const sort = @import("sort.zig");
 pub const unicode = @import("unicode.zig");
 pub const zig = @import("zig/index.zig");
+pub const stringUtils = @import("string_utils.zig");


The decision to put these things in something called "memory" rather than "string" is intentional.

A "string" would be something like []u32, where each item is a unicode point. Zig std lib does not have a string data type yet - and it looks like this PR does not introduce one. Although we do have some utf8 decoding and encoding code in std/unicode.zig.

String utilities would be a welcome addition to the zig standard library, but they should operate on unicode points, not encoded bytes. Utilities that operate on encoded bytes should be clear about that, using parameter names such as "bytes" and explaining the difference between encoded strings and actual strings in the docs.

andrewrk · 2018-04-04T14:42:46Z

std/string_utils.zig

+    return byte >= 'a' and byte <= 'z';
+}
+
+pub fn is_ascii_upper(byte: u8) bool {


I'm fairly confident that we should not have asciiUpper and asciiLower in the standard library for these reasons:

it's too easy to misuse, when actual strings should be used instead

if it really is the odd case where you need ascii upper/ascii lower, writing out the function body is easy enough. it should be enough tedium to get people to consider using the actual string functions.

Do you have an actual use case for ascii upper/lower right now?

In the randomizer I use a few ascii functions(They take &const u8 because I use them as fn ptrs to generic functions), because I know the strings I get from the nds headers are ascii. I don't mind not having it in std though.

Hm, that's fair. Maybe it's not so bad to have them. We should definitely put a bunch of noticeable warning signs near them and avoid the use of the word "string".

They could live in ascii.zig. Then you sign up for ascii problems by importing that.

I'm ok with that.

note that "ascii" is different from []u8. ascii requires the top bit of every byte to be 0, so if you're doing ascii string processing, you might consider validating that top bit.

those ascii functions you linked are correct, because they check for specific ranges of values, but something like ascii_to_upper() should perhaps assert the top bit is 0.

One could also just check if it was within 127? That avoids the bit masking operations (which are more expensive and remove clarity)? But yes it probably does have to be checked, though I would rather verify data through the locale viewer (which would check just once) rather than an indeterminate amount of times :).

BraedonWooding · 2018-04-05T00:39:16Z

std/unicode.zig

+        // Is made up of one byte
+        // Can just add a '0' and then the code point
+        // Thus can just output 'c'
+        var result: [1]u8 = undefined;


@andrewrk, @Hejsil
This is an idiom that I've done a lot since returning arrays from functions ain't pretty due to the arrays being allocated within the function stack, is there a better way to get an array output, and furthermore is there a better way to assign a slice to a new slice size without doing this 'trick'.

In most languages there is a syntax for creating a new array on the stack and assigning its value to something else, currently this requires a hidden memcpy presumably; which I think we could optimise away.

This is returning stack-allocated data from within the function to the caller and isn't valid.

Consider the following example:

const std = @import("std"); const warn = std.debug.warn; fn example(out: &[]const u8, s: u8) void { var a: [4]u8 = undefined; a[0] = s; a[1] = s + 1; a[2] = s + 2; a[3] = s + 3; *out = a; } pub fn main() void { var a: []const u8 = undefined; example(&a, 0); warn("a: "); for (a) |e| { warn("{} ", e); } warn("\n"); var b: []const u8 = undefined; example(&b, 1); warn("b: "); for (b) |e| { warn("{} ", e); } warn("\n"); warn("a: "); for (a) |e| { warn("{} ", e); } warn("\n"); }

Gives the output:

$ zig run example.zig a: 0 1 2 3 b: 1 2 3 4 a: 1 2 3 4

That is the best case that is happening, there are no guarantees about the data returned since everything within the function call is out of scope on exit unless it was dynamically allocated.

I would suggest instead taking a raw slice and simply asserting that the user has provided a sufficiently sized buffer. This allows the user to use their own buffer which they may want and provides the most flexibility.

Fixing up that previous example:

const std = @import("std"); const warn = std.debug.warn; fn example(out: []u8, s: u8) void { std.debug.assert(out.len >= 4); out[0] = s; out[1] = s + 1; out[2] = s + 2; out[3] = s + 3; } pub fn main() void { var a: [4]u8 = undefined; example(a[0..], 0); warn("a: "); for (a) |e| { warn("{} ", e); } warn("\n"); var b: [4]u8 = undefined; example(b[0..], 1); warn("b: "); for (b) |e| { warn("{} ", e); } warn("\n"); warn("a: "); for (a) |e| { warn("{} ", e); } warn("\n"); }

If you really wanted to return fixed data you could create a new struct and wrap the array. This requires bounding the array size to the maximum needed. This is less flexible and I wouldn't implement it this way.

const std = @import("std"); const warn = std.debug.warn; const Encoded = struct { out: [4]u8 }; fn example(s: u8) Encoded { var a: Encoded = undefined; a.out[0] = s; a.out[1] = s + 1; a.out[2] = s + 2; a.out[3] = s + 3; return a; } pub fn main() void { const a = example(0); warn("a: "); for (a.out) |e| { warn("{} ", e); } warn("\n"); const b = example(1); warn("b: "); for (b.out) |e| { warn("{} ", e); } warn("\n"); warn("a: "); for (a.out) |e| { warn("{} ", e); } warn("\n"); }

To return the length I'd just return it as the return value.

The only problem with it is that this is used for encoding bytes back to a byte[] which means the size isn't known prior for example, this means that it gets a bit finicky. The way I think may be better for this unknown size thing could either be requiring them to run a function to get how many bytes the code point will be OR;

fn example(s: [4]const u8) []const u8 { // ... do stuff with 's' return s[0..sizeWanted]; }

Perhaps? Or just result to a full buffer based system and return the length of that byte? Giving them the opportunity to either slice it or leave it? Tricky problem this one is.

I agree, I don't like the struct way; it feels icky.

You could do this, and get a static guarantee that the buffer is a certain size:

fn example(out: &[4]u8, s: u8) void { (*out)[0] = s; (*out)[1] = s + 1; (*out)[2] = s + 2; (*out)[3] = s + 3; }

Related: #863

Oooh I do like that too.

One issue with requiring an array of a specific size is that you are forcing the caller to use an array of this type. The caller will need to memcpy the bytes if they want to move the bytes into any other sized array. The static guarantees are nice, but if the data is copied anywhere else you will be using a slice interface anyway so will simply be performing the implied check later down the track.

Also, does anywhere else in std take a fixed array argument?

This is why I want #863, which will allow slice -> array pointer casts.

It is true that the check has to be performed some other place than the call, but isn't this the same as the maybe type. Giving you a static guarantee that something is true, requiring the caller to ensure the truth.

I suppose the main point I'm trying to say is that since a slice is a view into a backing array so in my view it doesn't make sense to limit the output type to an array of a specific size, when you could use a slice and take an arbitrary sized array.

Your proposal looks nice and handles exactly this scenario. For now, I'd stick with taking a slice and if down the track we get better guarantees we can update any of these occurences.

… this)

…erator

BraedonWooding · 2018-04-05T08:32:21Z

@thejoshwolfe Pinging you because I know you don't think we should do locales :). If you look at locale.zig and unicode.zig (I just made a few minor improvements) in this PR (ignoring string_utils.zig as that will disappear I'm just keeping it around as I'll reflect some of the nice options of it like finding highest index over to memory). This is kinda the intention of the scope, it is purely based around unicode and not trying to handle localisation. It can also handle things like time/number formatting of course but that's again around what I would call unicode. Most countries would fit a blanket toUpper and toLower and some would need a custom, at minimum this would be quite effective for atleast the english language as it does allow you to insert whatever type you want, though I would cut down on the formatter and make it just english. I however feel that we could cover a series of similar languages with it atleast?

Hejsil · 2018-04-06T07:45:25Z

std/unicode.zig

@@ -155,6 +156,18 @@ pub const Utf8View = struct {
        return initUnchecked(s);
    }

+    pub fn eql(self: &const Utf8View, other: &const Utf8View) bool {


I think mem.eql should be used when you don't care about encoding and only about memory, while something like unicode.eql should take into account that a "letter" can be a combination of many unicode code points, and that the order of these code points, don't always matter.

I did not realise that, I'll look into it and build a proper solution :0

Hejsil · 2018-04-06T07:47:52Z

std/unicode.zig

+        return mem.eql(u8, self.bytes, other.bytes);
+    }
+
+    pub fn slice(self: &const Utf8View, start: usize, end: usize) !Utf8View {


I would expect that slicing a unicode string would be from start_code_point to end_code_point, and not just slice the memory and validate that the substring still is valid.

But then again, the string could be big so maybe this is better

Yes that check is not needed :), as said in my other response. This slice will be renamed because that is a valid point it mainly is used in my split function for example which doesn't need a concept of end points; how about sliceRaw? I'll do a proper slice one since I can just check the first bit to detect when to iterate the index though it'll be max O(N) { N := the raw point corresponding to the code point index } just to detect the range which is unavoidable.

Somehow I missed your second point, it isn't always needed :). So I'll probably add it as 'sliceRawUnchecked' for example

Hejsil · 2018-04-06T07:51:42Z

std/mem.zig

-        .split_bytes = split_bytes,
-    };
-}
+pub fn split(comptime viewType: type, comptime iterator_type: type, comptime baseType: type) type {


I don't think the things in mem should use these "string iterators". When I use mem.split I only care about memory, not encoding, views, interators and all that.

The next fn in here is a good reason, why mem.split shouldn't use these iterators. Iterating over memory can't fail with an error, because the elements don't need to be validated to decoded.

I disagree with both points because of a few reasons :).

We need split in a few cases (just in std); first for string splitting in windows which is Utf16 with a few weird oddities, with my example you can just create a Utf16Win view and corresponding iterator and you don't have to create a unique function for it :), we also need a generic Utf8 one for 'general use' and a raw Ascii one for memory or when you know your just dealing with bland non encoded data.

Thus if we just pretend that iterators and views don't exist it becomes combersome :).

You generally don't have to care about views and iterators :) as shown in the test example there exists a type under stringUtils called string_it which maps the iterators and views for you :). That is instead of calling split("A/B/C/", "/") you call stringUtils.string_it.init("A/B/C/", "/"), currently 'next' returns a view however I'm also gonna add a function that returns the raw type that you specify which in the case of the string one is []const u8, this should solve your issue of views as you'll never need to touch one. Views are typically a good thing as they generalise.

Your second statement is also false, Iterators don't EVER fail with an error :), just look at both the Utf8 Iterator and Ascii Iterator they have type ?[]u8 not !?[]u8 because of course they don't fail, HOWEVER the views can fail if you give them invalid data, I can change it to just set the data regardless however this would then ensure that the data you originally give it has to be valid which it doesn't necessarily have to be :), however technically you are right next doesn't have to be error checked since I can validate that the slice can use initUnchecked.

So overall I can make next() not return an error ever and it kinda is just due to me wanting to be 100% safe and that kinda eroded the practicality of it :). So with all that said I don't particularly get why you don't like it?

My only real problem is that I don't think it should be in mem. Memory splitting and string splitting is different. When we split memory, there is no concept of valid code points, encoding and all that. Therefore I think the interface should be simpler.

I agree, that for splitting paths, the string splitting functions you provide should be used, as paths are strings. However, there are cases where you just have some memory ([]u8, []16, []32, []64) and want to split on some values. Btw, stringUtils.string_it only works on ascii, so you can't split any arbitrary sequence of bytes.

I guess you could provide a memory view and iterator though, and that might work out. I really think split should not be able to fail not next for these though :)

Yeh it won't fail for next I already finished that, currently it can only fail upon initialisation of the iterator. Also you can get it to return the raw data instead of a view.

And yes it only works for ascii that is just due to me not bothering naming it properly. I think I'll name it stringUtils.split_ascii_it, and have stringUtils.split_it for Utf8 and have stringUtils.split_win_it for the windows special Utf16 (that is super annoying why can't windows do anything right lel).

Again I completely agree next will not fail for anything after my next set of changes :).

One last thing; so you would prefer mem.split to use the old system and have the new split exist under stringUtils :)?

Really appreciate you having an in-depth look at my code by the way, your points are really valid and are helpful 😄 .

I think so. I think mem should only concern itself with slices of data. Not a more complicated iterator/view system. Ofc, I don't have the final say, as I'm not a "designer" of Zig or it's std library.

One note: paths are not strings in linux. They are null-terminated arrays of bytes. You can have invalid utf-8 as a linux file path.

Oh yes I'll make sure that it allows for null-terminated arrays of bytes :). Pretty sure that path sanitises it to a slice anyway but I'll double check (then converts it back at the end).

…ce now

there have been changes

andrewrk · 2018-04-13T15:50:07Z

CMakeLists.txt

@@ -452,6 +452,10 @@ set(ZIG_STD_FILES
    "index.zig"
    "io.zig"
    "linked_list.zig"
+    "string_utils.zig"
+    "representations/utf8.zig"


representations is an ambiguous word. it asks the question, "representing what?"

Ahh fair, I'll work on a better name :).

andrewrk · 2018-04-13T15:51:00Z

build.zig

@@ -55,8 +56,8 @@ pub fn build(b: &Builder) !void {
    addCppLib(b, exe, cmake_binary_dir, "zig_cpp");
    if (lld_include_dir.len != 0) {
        exe.addIncludeDir(lld_include_dir);
-        var it = mem.split(lld_libraries, ";");
-        while (it.next()) |lib| {
+        var it = try string.asciiSplit(lld_libraries, ";");


it looks to me that every place you replaced mem.split with asciiSplit is incorrect. Here we are quite intentionally splitting on utf8-encoded bytes.

Yes :), mem.split only worked on ascii bytes so I've replaced it to only work with ascii then I was aiming to go through and replace it with utf8 where needed :).

Also I want to note that the fact you can just change it to string.utf8Split(...) and you suddenly use unicode correctly kinda shows how 'cool' it is.

andrewrk · 2018-04-13T15:51:24Z

std/os/path.zig


-pub const sep_windows = '\\';
-pub const sep_posix = '/';
+pub const sep_windows : u8 = '\\';


why add u8 ?

Haven't gone through and removed little needless changes sorry.

andrewrk · 2018-04-13T15:52:17Z

std/os/path.zig

-    } else {
-        return joinPosix(allocator, paths);
-    }
+    // Currently a work around for the expansion to allow string separators 


std.os.path.join as it is currently implemented is correct, fully implemented, finished, and passes all tests. Why are we adding workarounds for anything?

Just converting a character into a string since the new join takes a string separator since I found it rarer to want a character than a string i.e. ", " is very common. If you know of a way to convert a character into a string without this little workaround then I would be overjoyed :).

Currently, I've decided to have a function to handle that, it literally does just convert it into an array like []u8 { sep } so functionally its the same but atleast its a bit nicer :). So you can just call string.joinCharSep() and give a single character sep rather than a string :).

…ion :)

BraedonWooding · 2018-04-14T10:48:24Z

@andrewrk I don't have much else I want to do with this PR, so it comes down to whether or not you want me to keep the type invariant stuff or make it JUST utf8. I'm happy to change it over, however I still stand by my original view that its more helpful to have it then it is to remove it :).

I'll have another look

andrewrk

Let's discuss the best way to solve this use case in #912

andrewrk · 2018-04-16T20:55:33Z

build.zig

@@ -55,8 +56,8 @@ pub fn build(b: &Builder) !void {
    addCppLib(b, exe, cmake_binary_dir, "zig_cpp");
    if (lld_include_dir.len != 0) {
        exe.addIncludeDir(lld_include_dir);
-        var it = mem.split(lld_libraries, ";");
-        while (it.next()) |lib| {
+        var it = try string.utf8Split(lld_libraries, ";");


I haven't changed my mind about this since the last PR comments

I need more clarification about what specifically you dislike :). Is it the call name ? Like is it just because it is not called string.split or is it the fact it uses generics behind the scene?

Again i'm happy to remove the type independent stuff if need be :)

andrewrk · 2018-04-16T20:56:06Z

doc/docgen.zig

@@ -309,7 +310,7 @@ const Node = union(enum) {
 const Toc = struct {
    nodes: []Node,
    toc: []u8,
-    urls: std.HashMap([]const u8, Token, mem.hash_slice_u8, mem.eql_slice_u8),
+    urls: std.HashMap([]const u8, Token, string.hashStr, string.strEql),


this hash map is hashing bytes, not strings.

I think this shows a problem with the way its done because the hash_slice_u8 for example and hashStr would be identical and that leads to a singular name meaning two different things, I guess how would you approach this because in situations where it is hashing strings not bytes it leads to this same problem? I would either declare aliases or just live with one of them.

andrewrk · 2018-04-16T20:57:45Z

std/mem.zig

@@ -130,6 +130,16 @@ pub fn copy(comptime T: type, dest: []T, source: []const T) void {
    for (source) |s, i| dest[i] = s;
 }

+pub fn copyRange(comptime T: type, dest: []T, destStart: usize, destEnd: usize, source: []const T, sourceStart: usize, sourceEnd: usize) void {


instead, callers should use slicing, and then call mem.copy.

Yes, I used that but then I realised slicing is soo much easier then just slipped my mind to remove it, even when you go through all your commits it is hard to find little things like this haha.

andrewrk · 2018-04-16T20:58:32Z

std/mem.zig

 /// Remove values from the beginning and end of a slice.
-pub fn trim(comptime T: type, slice: []const T, values_to_strip: []const T) []const T {
+pub fn trim(comptime T: type, slice: []const T, values_to_strip: []const T, side: Side) []const T {
+    // Asserting 


what is this comment communicating?

That I forgot to remove it :P

andrewrk · 2018-04-16T20:59:06Z

std/mem.zig

-pub fn trim(comptime T: type, slice: []const T, values_to_strip: []const T) []const T {
+pub fn trim(comptime T: type, slice: []const T, values_to_strip: []const T, side: Side) []const T {
+    // Asserting 
+    assert(side == Side.LEFT or side == Side.RIGHT or side == Side.BOTH);


this is guaranteed by the type system. any time a type is casted to an enum there is a runtime safety check.

Oh sweet, still learning things

andrewrk · 2018-04-16T21:04:15Z

std/os/child_process.zig

@@ -63,9 +64,11 @@ pub const ChildProcess = struct {
        PermissionDenied,
        InvalidUserId,
        ResourceLimitReached,
+        InvalidCharacter,


these should not be errors that are possible for ChildProcess.spawn to emit.

Why? They are possible now due to it now using Utf8? In saying that it should only return one of them, I believe InvalidCharacter, so I'll remove the other one.

andrewrk · 2018-04-16T21:10:18Z

std/string/string_utils.zig

+const ascii = std.string.ascii;
+const utf8 = std.string.utf8;
+
+/// Returns an iterator that iterates over the slices of `buffer` that are not


these docs are incorrect

The only incorrect bit is that it references 'bytes' :)

BraedonWooding · 2018-04-20T13:08:14Z

@andrewrk I fixed all the changes requested BUT the hash situation and the split one. This is because I need clarification on those two things :). I will be happy to make the changes once I get clarification.

I'll resolve this PR this weekend, one way or another

andrewrk · 2018-04-22T23:11:47Z

The stdlib string data type and unicode handling is extremely important. We need to get it right. And how we do it is going to be affected by #130 and #770. It's not time to introduce a string type yet.

This is an area where it's difficult to contribute to. Thank you for putting effort into this change; you've given the zig project some things to think about.

For now, zig's stance on strings is that we deal with encoded UTF-8 bytes. The standard library does not yet have any concept of a string, although it provides some unicode encoding and decoding facilities. When zig becomes more mature, we will solve this problem.

As always, if someone is working on a project using zig and runs into a use case where this would have solved the problem, please open an issue and we can figure out short-term and long-term solutions.

thejoshwolfe · 2018-04-22T23:18:55Z

From the OP:

Adds 'endsWith' to mem; this is purely since there is a 'startsWith'

Adds encode function to Utf8 (to go from codepoint to array of bytes)

These sound valuable regardless of the rest of this PR. You could probably make a PR for each one.

BraedonWooding · 2018-04-22T23:23:37Z

Fair enough, thanks for having a look :).

@thejoshwolfe I'll do a smaller PR with just a few of the key changes like those :) (as well as ones like codePointSequenceLength, as well as like allowing to trim from only one side and so on).

BraedonWooding added 4 commits April 4, 2018 22:20

String Utilities

40f016f

string utilities

54b5c00

remove weird files

ae1b7e4

Changed higher to upper

3f20233

Hejsil reviewed Apr 4, 2018

View reviewed changes

andrewrk previously requested changes Apr 4, 2018

View reviewed changes

Added unicode encoding

39d92e8

BraedonWooding commented Apr 5, 2018

View reviewed changes

BraedonWooding added 2 commits April 5, 2018 12:08

Fixes an overflow bug (wow Zig is actually good at protecting against…

ec3194a

… this)

Basics of locale

714b89c

BraedonWooding mentioned this pull request Apr 5, 2018

Locale in Std #893

Closed

BraedonWooding added 4 commits April 5, 2018 13:14

Fixed up array

4793904

allowing slice size to be larger then 4

23d424c

Fixed bugs in unicode mostly type issues, and then generalised the it…

1f47341

…erator

Finished ascii example

48028ce

Updated init to be check the characters

ebca31a

BraedonWooding changed the title ~~String Utilities~~ Locale Module (focused on string functions) Apr 5, 2018

BraedonWooding added 2 commits April 5, 2018 20:34

Removed change in git ignore

66a0559

Mem.Split doesn't care about your type now

932eb6f

Hejsil reviewed Apr 6, 2018

View reviewed changes

BraedonWooding added 3 commits April 7, 2018 10:51

Locale can figure out if a codepoint is a number, letter, or whitespa…

1648192

…ce now

Unicode encoding now throws

1a75cdf

Added Utf8 Locale

15ccce7

BraedonWooding added 7 commits April 13, 2018 17:11

Merge branch 'master' of github.com:zig-lang/zig into StringUtils

2f027b7

Cleaned up and added docs

3639e7a

Name changes stuffed compile

dc3cfda

Tests both windows and posix joins

dd349fb

Fixed up term color error

5b8012b

This 'should' fix the issues with appveyor

f3bf4f6

removed const qualifier

b0ea58a

andrewrk previously requested changes Apr 13, 2018

View reviewed changes

BraedonWooding and others added 8 commits April 14, 2018 10:32

Merge branch 'master' into StringUtils

6fcbf6e

Fixup merge

2d2477a

Made requested changes

b2cbcb5

Fixed commit

ded5e5d

Removed 'std.string' from main.zig

8453786

Fixed up wrong call

c6ff5e4

Restructured it nicer, and removed the workaround with a proper solut…

a2269f9

…ion :)

Things use utf8 for splitting rather than ascii call

6e61be7

BraedonWooding added 3 commits April 14, 2018 20:53

Added error to possible errors in spawnError

6294dd4

typo

4dc263e

Fixed up a bug I've been meaning to fix for a while

17dc853

andrewrk previously requested changes Apr 16, 2018

View reviewed changes

Made requested changes

b8d1995

andrewrk closed this Apr 22, 2018

thejoshwolfe mentioned this pull request Jul 15, 2018

Add lower and upper functions #1241

Closed

Type Independent String Functions #891

Type Independent String Functions #891

Conversation

BraedonWooding commented Apr 4, 2018 • edited

What is this PR

What EXACTLY is in this PR

What this PR ISN'T

Checklist

Short Explanation of Views/Iterators

BraedonWooding commented Apr 4, 2018

Hejsil commented Apr 4, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BraedonWooding commented Apr 4, 2018 • edited

Hejsil commented Apr 4, 2018

BraedonWooding commented Apr 4, 2018

andrewrk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BraedonWooding Apr 5, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tiehuis Apr 5, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Hejsil Apr 5, 2018 • edited

Choose a reason for hiding this comment

tiehuis Apr 5, 2018 • edited

Choose a reason for hiding this comment

BraedonWooding commented Apr 5, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BraedonWooding Apr 9, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BraedonWooding commented Apr 14, 2018

andrewrk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BraedonWooding commented Apr 4, 2018 •

edited

BraedonWooding commented Apr 4, 2018 •

edited

BraedonWooding Apr 5, 2018 •

edited

tiehuis Apr 5, 2018 •

edited

Hejsil Apr 5, 2018 •

edited

tiehuis Apr 5, 2018 •

edited

BraedonWooding Apr 9, 2018 •

edited

BraedonWooding Apr 20, 2018 •

edited