Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Type Independent String Functions #891

Closed
wants to merge 45 commits into from
Closed

Type Independent String Functions #891

wants to merge 45 commits into from

Conversation

BraedonWooding
Copy link
Contributor

@BraedonWooding BraedonWooding commented Apr 4, 2018

What is this PR

See here #912, this explains it in a deeper level.

What EXACTLY is in this PR

  • Adds 'endsWith' to mem; this is purely since there is a 'startsWith'
  • Moves 'join' to stringUtils and makes it type independent according to Type Independent String Manipulation #912
  • Adds AsciiView and AsciiIterator, and makes Utf8View and Utf8Iterator comply with the standards of Type Independent String Manipulation #912
  • Adds encode function to Utf8 (to go from codepoint to array of bytes)
  • Adds CodepointSequenceLength function to utf8 (to count length of bytes that a codepoint will be)
  • Restructured utf8 functions to not have the utf8 prefix as they exist within 'utf8' library now; such that you call them like utf8.View rather than utf8View should just make the code within it easier to understand :), this is similar to how its been done in Math
  • Added representation folder and corresponding index.zig, this is just since most likely we will have around 4 representations (utf8/16, ascii, and char* strings), note: you can access them through std without having to go through representations.

What this PR ISN'T

This PR is not going to be ANYTHING to do with LOCALE.

Furthermore this PR isn't going to edit any code to use Unicode string functions as that'll push it away from its focused use, it'll still add tests of course and I presume any new code will use unicode but any old code should continue to use Ascii (and will merge over to ascii views) as for example I don't know if Path.zig just pretends the bytes it gets are ascii but really are unicode or if it reports an error if you give it a unicode or whatever, and due to the size of path.zig along with the fact that Windows uses a special Utf16 version that should be a separate PR :).

Checklist

  • Comply Utf8View/Iterator to the View/Iterator standards for string manipulation (basically more functions and some variables have to be named a certain way).
  • Build AsciiView
  • Expand trim to allow specifying side
  • Expand IndexOfPos and its similar forms to allow specifying choosing the highest index that matches or the lowest (though currently it doesn't go back to front if choosing highest, which I think it probably should but I didn't want to add a ton of conditions, though I think I'll make highest a comptime variable then I'll be happy adding more conditions)
  • Move join to stringUtils.zig, this is due to the fact that it explicitly requires strings it isn't for memory joining; i.e. as the description says its for joining strings!
  • Make trim for strings (i.e. string version independent)
  • Cleanup and fix bugs

Short Explanation of Views/Iterators

Views are just a way to view a memory in two ways simultaneously, both as raw data and as what it represents for example utf8 you can index the raw data or the code points it represents (which are u32 instead of u8 for example), you can of course do other actions like comparing views (if same type of course) and slicing/indexing both code points and raw data. The only function that can return an error is view.init which should perform a validity check on the data (there is also an initUnchecked function) as when indexing data there is no reason why it should be invalid.

Iterators just allow you to cycle through a view, they allow you to obtain the 'nextBytes' or the 'nextCodepoint' representing either the next set of bytes or the actual code point for some iterators they are the same thing for example Ascii which will always return a byte array of size 1 for nextBytes (though of course it'll return it as an array rather than an individual element to fit the spec of iterators), for that reason you should commonly use nextCodepoint as that is more efficient in cases like Ascii and the reason you are using iterators is to get the functionality of turning the raw unicode data into code points :). There are no error throwing functions in iterators, this is to make your loops nicer and your code much nicer :).

Note: When I saw no error throwing functions I mean that those functions don't have an opportunity to throw because of the way they were designed; that is IF the data is valid (checked by the initial init) then all the functions won't throw. If that data isn't valid then that is undefined behaviour, and that is why using initUnchecked is typically bad behaviour and you should rather use try init :).

@BraedonWooding
Copy link
Contributor Author

I'll style guide it up sometime soon :).

@Hejsil
Copy link
Sponsor Contributor

Hejsil commented Apr 4, 2018

A few of these seem very simular to function from std.mem

  • find_stris very simular to the mem.indexOf family.
  • strip is very simular to mem.trim.
  • starts_with is very simular to mem.startsWith.
  • split takes a different approuch to mem.split, so maybe it should be a wrapper to that instead.
  • str_eql is very simular to mem.eql.

I don't think we need the same functions in two places. As for the rest, I like the idea of having an ascii.zig, just like we have unicode.zig. After all, a string is just bytes. What the string means is a matter for encoding. Functions like to_upper, is_num and all those could go in that.

return byte >= 'A' and byte <= 'Z';
}

test "String_Utils" {
Copy link
Sponsor Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend splitting this into some smaller tests like:

test "string_utils.is_ascii_letter" {
    assert(is_ascii_letter('C'));
    assert(is_ascii_letter('e'));
    assert(!is_ascii_letter('2'));
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes of course, I just wanted to get a basic testing suite up first :).

@BraedonWooding
Copy link
Contributor Author

BraedonWooding commented Apr 4, 2018

find_str handles both lowest and highest index which mem.indexOf doesn't, strip handles the ability to just do left or right side, or both where mem.trim doesn't. Split is well not lazy so yes perhaps it does, and str_eql calls mem_eql :), it is just there to replace mem.str_eql (I moved it as per my description).

So overall the only function that really doesn't need to exist is starts_with which I just added due to ends_with existing. Would it be better for me to port my methods over to memory instead of having a separate string utility class? That is add the non-lazy method, the trimming either side, the highest/lowest indexOf choice, and the str_eql (since its used a lot for hashmaps) over to memory?

But still keep the other 'ifX' functions in string utilities?

@Hejsil
Copy link
Sponsor Contributor

Hejsil commented Apr 4, 2018

I think that is better. After all, trimming, finding and splitting is not unique to strings. Unicode strings would probably need their own versions of some of these (if that is useful).

@BraedonWooding
Copy link
Contributor Author

Yes, I plan to do that separately :). I would say it is useful? Though we need a sort of locale sorted first.

Copy link
Member

@andrewrk andrewrk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I'm a little worried that this leads us down a path of misuse of strings and unicode. See #234

.gitignore Outdated
@@ -1,3 +1,4 @@
zig-cache/
build/
build-*/
.vscode/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #888

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough I'll remove it :), I shouldn't be calling 'git add .' anyways.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should be able to use git add .. Everything to ignore falls into 2 categories: system-wide ignore, or project-specific ignore. If both are correct then git add . works just fine.

std/index.zig Outdated
@@ -30,6 +30,7 @@ pub const rand = @import("rand/index.zig");
pub const sort = @import("sort.zig");
pub const unicode = @import("unicode.zig");
pub const zig = @import("zig/index.zig");
pub const stringUtils = @import("string_utils.zig");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The decision to put these things in something called "memory" rather than "string" is intentional.

A "string" would be something like []u32, where each item is a unicode point. Zig std lib does not have a string data type yet - and it looks like this PR does not introduce one. Although we do have some utf8 decoding and encoding code in std/unicode.zig.

String utilities would be a welcome addition to the zig standard library, but they should operate on unicode points, not encoded bytes. Utilities that operate on encoded bytes should be clear about that, using parameter names such as "bytes" and explaining the difference between encoded strings and actual strings in the docs.

return byte >= 'a' and byte <= 'z';
}

pub fn is_ascii_upper(byte: u8) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fairly confident that we should not have asciiUpper and asciiLower in the standard library for these reasons:

  • it's too easy to misuse, when actual strings should be used instead
  • if it really is the odd case where you need ascii upper/ascii lower, writing out the function body is easy enough. it should be enough tedium to get people to consider using the actual string functions.

Do you have an actual use case for ascii upper/lower right now?

Copy link
Sponsor Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the randomizer I use a few ascii functions(They take &const u8 because I use them as fn ptrs to generic functions), because I know the strings I get from the nds headers are ascii. I don't mind not having it in std though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, that's fair. Maybe it's not so bad to have them. We should definitely put a bunch of noticeable warning signs near them and avoid the use of the word "string".

Copy link
Sponsor Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They could live in ascii.zig. Then you sign up for ascii problems by importing that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with that.

Copy link
Sponsor Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that "ascii" is different from []u8. ascii requires the top bit of every byte to be 0, so if you're doing ascii string processing, you might consider validating that top bit.

those ascii functions you linked are correct, because they check for specific ranges of values, but something like ascii_to_upper() should perhaps assert the top bit is 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One could also just check if it was within 127? That avoids the bit masking operations (which are more expensive and remove clarity)? But yes it probably does have to be checked, though I would rather verify data through the locale viewer (which would check just once) rather than an indeterminate amount of times :).

std/unicode.zig Outdated
// Is made up of one byte
// Can just add a '0' and then the code point
// Thus can just output 'c'
var result: [1]u8 = undefined;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrewrk, @Hejsil
This is an idiom that I've done a lot since returning arrays from functions ain't pretty due to the arrays being allocated within the function stack, is there a better way to get an array output, and furthermore is there a better way to assign a slice to a new slice size without doing this 'trick'.

In most languages there is a syntax for creating a new array on the stack and assigning its value to something else, currently this requires a hidden memcpy presumably; which I think we could optimise away.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is returning stack-allocated data from within the function to the caller and isn't valid.

Consider the following example:

const std = @import("std");
const warn = std.debug.warn;

fn example(out: &[]const u8, s: u8) void {
    var a: [4]u8 = undefined;
    a[0] = s;
    a[1] = s + 1;
    a[2] = s + 2;
    a[3] = s + 3;
    *out = a;
}

pub fn main() void {
    var a: []const u8 = undefined;
    example(&a, 0);

    warn("a: "); for (a) |e| { warn("{} ", e); } warn("\n");

    var b: []const u8 = undefined;
    example(&b, 1);

    warn("b: "); for (b) |e| { warn("{} ", e); } warn("\n");
    warn("a: "); for (a) |e| { warn("{} ", e); } warn("\n");
}

Gives the output:

$ zig run example.zig
a: 0 1 2 3 
b: 1 2 3 4 
a: 1 2 3 4

That is the best case that is happening, there are no guarantees about the data returned since everything within the function call is out of scope on exit unless it was dynamically allocated.

I would suggest instead taking a raw slice and simply asserting that the user has provided a sufficiently sized buffer. This allows the user to use their own buffer which they may want and provides the most flexibility.

Fixing up that previous example:

const std = @import("std");
const warn = std.debug.warn;

fn example(out: []u8, s: u8) void {
    std.debug.assert(out.len >= 4);
    out[0] = s;
    out[1] = s + 1;
    out[2] = s + 2;
    out[3] = s + 3;
}

pub fn main() void {
    var a: [4]u8 = undefined;
    example(a[0..], 0);

    warn("a: "); for (a) |e| { warn("{} ", e); } warn("\n");

    var b: [4]u8 = undefined;
    example(b[0..], 1);

    warn("b: "); for (b) |e| { warn("{} ", e); } warn("\n");
    warn("a: "); for (a) |e| { warn("{} ", e); } warn("\n");
}

If you really wanted to return fixed data you could create a new struct and wrap the array. This requires bounding the array size to the maximum needed. This is less flexible and I wouldn't implement it this way.

const std = @import("std");
const warn = std.debug.warn;

const Encoded = struct { out: [4]u8 };

fn example(s: u8) Encoded {
    var a: Encoded = undefined;
    a.out[0] = s;
    a.out[1] = s + 1;
    a.out[2] = s + 2;
    a.out[3] = s + 3;
    return a;
}

pub fn main() void {
    const a = example(0);

    warn("a: "); for (a.out) |e| { warn("{} ", e); } warn("\n");

    const b = example(1);

    warn("b: "); for (b.out) |e| { warn("{} ", e); } warn("\n");
    warn("a: "); for (a.out) |e| { warn("{} ", e); } warn("\n");
}

Copy link
Member

@tiehuis tiehuis Apr 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To return the length I'd just return it as the return value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only problem with it is that this is used for encoding bytes back to a byte[] which means the size isn't known prior for example, this means that it gets a bit finicky. The way I think may be better for this unknown size thing could either be requiring them to run a function to get how many bytes the code point will be OR;

fn example(s: [4]const u8) []const u8 {
    // ... do stuff with 's'
    return s[0..sizeWanted];
}

Perhaps? Or just result to a full buffer based system and return the length of that byte? Giving them the opportunity to either slice it or leave it? Tricky problem this one is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I don't like the struct way; it feels icky.

Copy link
Sponsor Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could do this, and get a static guarantee that the buffer is a certain size:

fn example(out: &[4]u8, s: u8) void {
    (*out)[0] = s;
    (*out)[1] = s + 1;
    (*out)[2] = s + 2;
    (*out)[3] = s + 3;
}

Related: #863

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oooh I do like that too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One issue with requiring an array of a specific size is that you are forcing the caller to use an array of this type. The caller will need to memcpy the bytes if they want to move the bytes into any other sized array. The static guarantees are nice, but if the data is copied anywhere else you will be using a slice interface anyway so will simply be performing the implied check later down the track.

Also, does anywhere else in std take a fixed array argument?

Copy link
Sponsor Contributor

@Hejsil Hejsil Apr 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is why I want #863, which will allow slice -> array pointer casts.

It is true that the check has to be performed some other place than the call, but isn't this the same as the maybe type. Giving you a static guarantee that something is true, requiring the caller to ensure the truth.

Copy link
Member

@tiehuis tiehuis Apr 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose the main point I'm trying to say is that since a slice is a view into a backing array so in my view it doesn't make sense to limit the output type to an array of a specific size, when you could use a slice and take an arbitrary sized array.

Your proposal looks nice and handles exactly this scenario. For now, I'd stick with taking a slice and if down the track we get better guarantees we can update any of these occurences.

@BraedonWooding BraedonWooding mentioned this pull request Apr 5, 2018
@BraedonWooding
Copy link
Contributor Author

@thejoshwolfe Pinging you because I know you don't think we should do locales :). If you look at locale.zig and unicode.zig (I just made a few minor improvements) in this PR (ignoring string_utils.zig as that will disappear I'm just keeping it around as I'll reflect some of the nice options of it like finding highest index over to memory). This is kinda the intention of the scope, it is purely based around unicode and not trying to handle localisation. It can also handle things like time/number formatting of course but that's again around what I would call unicode. Most countries would fit a blanket toUpper and toLower and some would need a custom, at minimum this would be quite effective for atleast the english language as it does allow you to insert whatever type you want, though I would cut down on the formatter and make it just english. I however feel that we could cover a series of similar languages with it atleast?

@BraedonWooding BraedonWooding changed the title String Utilities Locale Module (focused on string functions) Apr 5, 2018
std/unicode.zig Outdated
@@ -155,6 +156,18 @@ pub const Utf8View = struct {
return initUnchecked(s);
}

pub fn eql(self: &const Utf8View, other: &const Utf8View) bool {
Copy link
Sponsor Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think mem.eql should be used when you don't care about encoding and only about memory, while something like unicode.eql should take into account that a "letter" can be a combination of many unicode code points, and that the order of these code points, don't always matter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not realise that, I'll look into it and build a proper solution :0

std/unicode.zig Outdated
return mem.eql(u8, self.bytes, other.bytes);
}

pub fn slice(self: &const Utf8View, start: usize, end: usize) !Utf8View {
Copy link
Sponsor Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect that slicing a unicode string would be from start_code_point to end_code_point, and not just slice the memory and validate that the substring still is valid.

Copy link
Sponsor Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then again, the string could be big so maybe this is better

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that check is not needed :), as said in my other response. This slice will be renamed because that is a valid point it mainly is used in my split function for example which doesn't need a concept of end points; how about sliceRaw? I'll do a proper slice one since I can just check the first bit to detect when to iterate the index though it'll be max O(N) { N := the raw point corresponding to the code point index } just to detect the range which is unavoidable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somehow I missed your second point, it isn't always needed :). So I'll probably add it as 'sliceRawUnchecked' for example

std/mem.zig Outdated
.split_bytes = split_bytes,
};
}
pub fn split(comptime viewType: type, comptime iterator_type: type, comptime baseType: type) type {
Copy link
Sponsor Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the things in mem should use these "string iterators". When I use mem.split I only care about memory, not encoding, views, interators and all that.

Copy link
Sponsor Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The next fn in here is a good reason, why mem.split shouldn't use these iterators. Iterating over memory can't fail with an error, because the elements don't need to be validated to decoded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree with both points because of a few reasons :).

  1. We need split in a few cases (just in std); first for string splitting in windows which is Utf16 with a few weird oddities, with my example you can just create a Utf16Win view and corresponding iterator and you don't have to create a unique function for it :), we also need a generic Utf8 one for 'general use' and a raw Ascii one for memory or when you know your just dealing with bland non encoded data.

Thus if we just pretend that iterators and views don't exist it becomes combersome :).

  1. You generally don't have to care about views and iterators :) as shown in the test example there exists a type under stringUtils called string_it which maps the iterators and views for you :). That is instead of calling split("A/B/C/", "/") you call stringUtils.string_it.init("A/B/C/", "/"), currently 'next' returns a view however I'm also gonna add a function that returns the raw type that you specify which in the case of the string one is []const u8, this should solve your issue of views as you'll never need to touch one. Views are typically a good thing as they generalise.

  2. Your second statement is also false, Iterators don't EVER fail with an error :), just look at both the Utf8 Iterator and Ascii Iterator they have type ?[]u8 not !?[]u8 because of course they don't fail, HOWEVER the views can fail if you give them invalid data, I can change it to just set the data regardless however this would then ensure that the data you originally give it has to be valid which it doesn't necessarily have to be :), however technically you are right next doesn't have to be error checked since I can validate that the slice can use initUnchecked.

So overall I can make next() not return an error ever and it kinda is just due to me wanting to be 100% safe and that kinda eroded the practicality of it :). So with all that said I don't particularly get why you don't like it?

Copy link
Sponsor Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only real problem is that I don't think it should be in mem. Memory splitting and string splitting is different. When we split memory, there is no concept of valid code points, encoding and all that. Therefore I think the interface should be simpler.

I agree, that for splitting paths, the string splitting functions you provide should be used, as paths are strings. However, there are cases where you just have some memory ([]u8, []16, []32, []64) and want to split on some values. Btw, stringUtils.string_it only works on ascii, so you can't split any arbitrary sequence of bytes.

Copy link
Sponsor Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you could provide a memory view and iterator though, and that might work out. I really think split should not be able to fail not next for these though :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeh it won't fail for next I already finished that, currently it can only fail upon initialisation of the iterator. Also you can get it to return the raw data instead of a view.

And yes it only works for ascii that is just due to me not bothering naming it properly. I think I'll name it stringUtils.split_ascii_it, and have stringUtils.split_it for Utf8 and have stringUtils.split_win_it for the windows special Utf16 (that is super annoying why can't windows do anything right lel).

Again I completely agree next will not fail for anything after my next set of changes :).

One last thing; so you would prefer mem.split to use the old system and have the new split exist under stringUtils :)?

Really appreciate you having an in-depth look at my code by the way, your points are really valid and are helpful 😄 .

Copy link
Sponsor Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. I think mem should only concern itself with slices of data. Not a more complicated iterator/view system. Ofc, I don't have the final say, as I'm not a "designer" of Zig or it's std library.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One note: paths are not strings in linux. They are null-terminated arrays of bytes. You can have invalid utf-8 as a linux file path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes I'll make sure that it allows for null-terminated arrays of bytes :). Pretty sure that path sanitises it to a slice anyway but I'll double check (then converts it back at the end).

@andrewrk andrewrk dismissed their stale review April 9, 2018 04:22

there have been changes

CMakeLists.txt Outdated
@@ -452,6 +452,10 @@ set(ZIG_STD_FILES
"index.zig"
"io.zig"
"linked_list.zig"
"string_utils.zig"
"representations/utf8.zig"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

representations is an ambiguous word. it asks the question, "representing what?"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh fair, I'll work on a better name :).

build.zig Outdated
@@ -55,8 +56,8 @@ pub fn build(b: &Builder) !void {
addCppLib(b, exe, cmake_binary_dir, "zig_cpp");
if (lld_include_dir.len != 0) {
exe.addIncludeDir(lld_include_dir);
var it = mem.split(lld_libraries, ";");
while (it.next()) |lib| {
var it = try string.asciiSplit(lld_libraries, ";");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks to me that every place you replaced mem.split with asciiSplit is incorrect. Here we are quite intentionally splitting on utf8-encoded bytes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes :), mem.split only worked on ascii bytes so I've replaced it to only work with ascii then I was aiming to go through and replace it with utf8 where needed :).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I want to note that the fact you can just change it to string.utf8Split(...) and you suddenly use unicode correctly kinda shows how 'cool' it is.

std/os/path.zig Outdated

pub const sep_windows = '\\';
pub const sep_posix = '/';
pub const sep_windows : u8 = '\\';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why add u8 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't gone through and removed little needless changes sorry.

std/os/path.zig Outdated
} else {
return joinPosix(allocator, paths);
}
// Currently a work around for the expansion to allow string separators
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std.os.path.join as it is currently implemented is correct, fully implemented, finished, and passes all tests. Why are we adding workarounds for anything?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just converting a character into a string since the new join takes a string separator since I found it rarer to want a character than a string i.e. ", " is very common. If you know of a way to convert a character into a string without this little workaround then I would be overjoyed :).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, I've decided to have a function to handle that, it literally does just convert it into an array like []u8 { sep } so functionally its the same but atleast its a bit nicer :). So you can just call string.joinCharSep() and give a single character sep rather than a string :).

@BraedonWooding
Copy link
Contributor Author

@andrewrk I don't have much else I want to do with this PR, so it comes down to whether or not you want me to keep the type invariant stuff or make it JUST utf8. I'm happy to change it over, however I still stand by my original view that its more helpful to have it then it is to remove it :).

@andrewrk andrewrk dismissed their stale review April 15, 2018 17:28

I'll have another look

Copy link
Member

@andrewrk andrewrk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss the best way to solve this use case in #912

@@ -55,8 +56,8 @@ pub fn build(b: &Builder) !void {
addCppLib(b, exe, cmake_binary_dir, "zig_cpp");
if (lld_include_dir.len != 0) {
exe.addIncludeDir(lld_include_dir);
var it = mem.split(lld_libraries, ";");
while (it.next()) |lib| {
var it = try string.utf8Split(lld_libraries, ";");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't changed my mind about this since the last PR comments

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need more clarification about what specifically you dislike :). Is it the call name ? Like is it just because it is not called string.split or is it the fact it uses generics behind the scene?

Again i'm happy to remove the type independent stuff if need be :)

@@ -309,7 +310,7 @@ const Node = union(enum) {
const Toc = struct {
nodes: []Node,
toc: []u8,
urls: std.HashMap([]const u8, Token, mem.hash_slice_u8, mem.eql_slice_u8),
urls: std.HashMap([]const u8, Token, string.hashStr, string.strEql),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this hash map is hashing bytes, not strings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this shows a problem with the way its done because the hash_slice_u8 for example and hashStr would be identical and that leads to a singular name meaning two different things, I guess how would you approach this because in situations where it is hashing strings not bytes it leads to this same problem? I would either declare aliases or just live with one of them.

std/mem.zig Outdated
@@ -130,6 +130,16 @@ pub fn copy(comptime T: type, dest: []T, source: []const T) void {
for (source) |s, i| dest[i] = s;
}

pub fn copyRange(comptime T: type, dest: []T, destStart: usize, destEnd: usize, source: []const T, sourceStart: usize, sourceEnd: usize) void {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead, callers should use slicing, and then call mem.copy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I used that but then I realised slicing is soo much easier then just slipped my mind to remove it, even when you go through all your commits it is hard to find little things like this haha.

std/mem.zig Outdated
/// Remove values from the beginning and end of a slice.
pub fn trim(comptime T: type, slice: []const T, values_to_strip: []const T) []const T {
pub fn trim(comptime T: type, slice: []const T, values_to_strip: []const T, side: Side) []const T {
// Asserting
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this comment communicating?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That I forgot to remove it :P

std/mem.zig Outdated
pub fn trim(comptime T: type, slice: []const T, values_to_strip: []const T) []const T {
pub fn trim(comptime T: type, slice: []const T, values_to_strip: []const T, side: Side) []const T {
// Asserting
assert(side == Side.LEFT or side == Side.RIGHT or side == Side.BOTH);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is guaranteed by the type system. any time a type is casted to an enum there is a runtime safety check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sweet, still learning things

@@ -63,9 +64,11 @@ pub const ChildProcess = struct {
PermissionDenied,
InvalidUserId,
ResourceLimitReached,
InvalidCharacter,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these should not be errors that are possible for ChildProcess.spawn to emit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? They are possible now due to it now using Utf8? In saying that it should only return one of them, I believe InvalidCharacter, so I'll remove the other one.

const ascii = std.string.ascii;
const utf8 = std.string.utf8;

/// Returns an iterator that iterates over the slices of `buffer` that are not
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these docs are incorrect

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only incorrect bit is that it references 'bytes' :)

@BraedonWooding
Copy link
Contributor Author

@andrewrk I fixed all the changes requested BUT the hash situation and the split one. This is because I need clarification on those two things :). I will be happy to make the changes once I get clarification.

@andrewrk andrewrk dismissed their stale review April 20, 2018 15:07

I'll resolve this PR this weekend, one way or another

@andrewrk
Copy link
Member

The stdlib string data type and unicode handling is extremely important. We need to get it right. And how we do it is going to be affected by #130 and #770. It's not time to introduce a string type yet.

This is an area where it's difficult to contribute to. Thank you for putting effort into this change; you've given the zig project some things to think about.

For now, zig's stance on strings is that we deal with encoded UTF-8 bytes. The standard library does not yet have any concept of a string, although it provides some unicode encoding and decoding facilities. When zig becomes more mature, we will solve this problem.

As always, if someone is working on a project using zig and runs into a use case where this would have solved the problem, please open an issue and we can figure out short-term and long-term solutions.

@andrewrk andrewrk closed this Apr 22, 2018
@thejoshwolfe
Copy link
Sponsor Contributor

From the OP:

  • Adds 'endsWith' to mem; this is purely since there is a 'startsWith'
  • Adds encode function to Utf8 (to go from codepoint to array of bytes)

These sound valuable regardless of the rest of this PR. You could probably make a PR for each one.

@BraedonWooding
Copy link
Contributor Author

Fair enough, thanks for having a look :).

@thejoshwolfe I'll do a smaller PR with just a few of the key changes like those :) (as well as ones like codePointSequenceLength, as well as like allowing to trim from only one side and so on).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants