Skip to content

Data processor block/upgrade. #1077

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Kubuxu opened this issue Apr 17, 2015 · 19 comments
Closed

Data processor block/upgrade. #1077

Kubuxu opened this issue Apr 17, 2015 · 19 comments

Comments

@Kubuxu
Copy link
Contributor

Kubuxu commented Apr 17, 2015

Data processor would allow advanced data processing (pun not intended).
Its main purpose would be to provide functions for bulk data storage or lower level functions on data.

Proposed functions:

  • tobytes(data1: Primitive[, data2: Primitive[, ...]]) -> ByteArray, format: String - Returns byte encoded primitive, throws error if rhere is more than one argument and it is string with null byte in it.
  • tobytesf(format: String, ...) -> ByteArray, format: String - Accepts format string and variable number of arguments. Returns ByteArray and the same format string for consistency with
  • frombytes(format/type: String, bytes: ByteArray) -> Primitive (of type 'type') or ... depending whether type or format was specfied. - returns decoded data from bytes.
  • Format String uses single characets to denote Lua's basic types:
    • d: for double
    • b: for boolean
    • s: for string (String can't contain 0 byte.)
    • r (like raw): for byte array. r is required to be followed by decimal number of bytes.
  • deflate(bytes: ByteArray) -> ByteArray - applies DEFLATE compression.
  • inflate(bytes: ByteArray) -> ByteArray - applies INFLATE decompression.
  • endianness() -> String - returns "big" in case of big-endian systems and "little" in case of little-endian systems.
  • encode64(bytes: ByteArray) -> ByteArray - returns bytes encoded to base64.
  • decode64(bytes: ByteArray) -> ByteArray - returns bytes decoded from bese64.

Almost all of those functions can be implemented from inside of Lua but the implementation is never perfect and is usually very slow; Example.

EDIT2: Edited the main part of post to incorporate changes.

@dgelessus
Copy link
Contributor

General question: why an extra component for this? For the codec/compression methods that would make sense, but data loading/dumping from/to bytes should be part of the base API.

  • tobytes/frombytes: Those would be very useful indeed. Integers are easy enough to load/dump with some bit shifting, but floating-point numbers are a pain. However I'd suggest that these functions should be able to load/dump multiple values based on a format string, similar to Python's struct.pack/unpack.
  • dataoutputstream/datainputstream: I'm not sure what exactly these would be used for... are these meant for dumping/loading more than one value?
  • endianness: IMO not very useful. If anything OC computers should have a standardized endianness, or the frombytes/tobytes functions should allow specifying an explicit endianness.
  • encode64/decode64: I might be wrong, but I wouldn't expect these to be much faster in Java/Scala than in Lua. And if Base64 would be supported, so should Base16, Base32, Ascii85 and uuencode.
  • deflate/inflate: These would perhaps have a more noticeable speed difference between Java/Scala and Lua. Again, this is just a one of many compression algorithms, for completeness there should also be zip, gzip, cpio, bzip2 and lzma support.

@fnuecke
Copy link
Member

fnuecke commented Apr 17, 2015

I'd like this be an extra component to keep the base API small. Changing the base API also potentially breaks persistence (i.e. will force computers to reboot when loaded from an old save), so I'd like to avoid it for that reason, too.

@Kubuxu
Copy link
Contributor Author

Kubuxu commented Apr 17, 2015

@dgelessus

  • tobytes/frombytes: problem with change suggested by you is that \0byte is completely good in Lua's string so you can't use it as String end char. This means that you would have to prefix every String with byte count. I wanted to keep tobytes, frombytes as simple as possible.
  • dataoutputstream/datainputstream```: This doesn't mean that I don't like the idea of format string. This is cool and more Lua like way of doing this thing (too much scala and Java to think this way). My proposition is to create introduce ```tobytef``` and ```frombytef which would work on format strings. There would be differentiation between String and ByteArray. Other thing would be to allow tobytes accept variable nuber of arguments, throw error if on of arguments is string and contains null byte and return, apart from byte array, as a second result format string for those arguments.
  • endianness: Lua has no defined endianness so it is impossible to choose one. Even OC computer/Eris has no defined endianness (if you move save from little-endian computer to big-endian computer, save-states will break).
  • encode64/decode64 are nice to have in place where you work with raw byte arrays. Those are unparsable by Java's String parsers.
  • deflate/inflate: I have chosen those two for three reasons: 1. They are implemented in Java as native libraries so they are extremely fast. 2. They are simplest and possible to implement (available) in pure Lua with somehow reasonable speed. 3. They are very efficient processor and memory wise. We don't want worker thread to be suck on Java method.

Those propositions are only pure base. As separate component further additions are easily possible.

@fnuecke this is why I proposed a separate component with more functions.

@dgelessus
Copy link
Contributor

Maybe tobytes should accept both forms. If given a single argument, that object dumped in some default form. If there are multiple arguments, the first one is a format string and all others are the values to be dumped.

frombytes always needs an extra parameter to tell it in what format the input bytes are. This could accept both Lua type names and format strings, since boolean, number, string and table are not valid format strings. When loading more than one value, strings should always require an explicit length. That way single strings can always be dumped/loaded without needing a format string, and for multiple strings there's no need to rely on null bytes.

Regarding endianness, there is no situation in which it would be necessary to know the real computer's endianness. None of the functions usable from Lua manipulate bytes in a way that endianness matters. The data processor component would be the only exception, and its default endianness should be consistent, no matter what the host computer uses.

@fnuecke
Copy link
Member

fnuecke commented Apr 17, 2015

@Kubuxu I know you know ;-) My post was in reply to @dgelessus's question why it should be an extra component.

@Kubuxu
Copy link
Contributor Author

Kubuxu commented Apr 17, 2015

It is necessary when communicating by low lever internet protocol. And also there is bit32.bswap

There will be no table in tobytes. Tables are complex structures w/o any specific schema.

Look into my edit of initial post I proposed that tobytes returns format string. It would make the API easier to use but I will reverse order of arguments in frombytes to incorporate your proposition.

@MyNameIsKodos
Copy link
Contributor

If you need a texture, I may or may not have this lying around in a spritesheet somewhere. Not sure if it's compatible or not though.

Edit: Looks like an easy enough fix.

@dgelessus
Copy link
Contributor

@Kubuxu I didn't mean that tobytes should return a format string, it should accept one. Otherwise it's impossible to specify e. g. how a number should be dumped (char, short, int, long, float, double, signedness, endianness). Returning a format string isn't a bad idea though... How about this:

  • tobytes(...) -> format: string, data: string - Takes any number of objects. Returns a format string to unpack the objects, plus the packed objects as a byte array (string).
  • tobytesf(format: string, ...) -> data: string - Takes a format string and any number of additional objects. Returns a byte array (string) containing all objects packed according to the format string.
  • frombytes(format: string, data: string) -> ... - Takes a format string and a byte array (string) containing packed objects. Returns all objects from the byte array, unpacked according to the format string. Instead of a format string, the type names "boolean", "number", or "string" may be given. In this case the byte array is assumed to hold a single boolean, floating-point or string value, respectively.

A few examples below. (There's no need to understand what the "data" strings mean, they are just the packed binary data for the values.)

-- Un/packing a string
format, data = tobytes("string with a \0 byte")
print(format, data) --> 20s    string with a \0 byte
print(frombytes(format, data)) --> string with a \0 byte

-- Packing a double
format, data = tobytes(math.pi)
print(format, data) --> d     \024-DT\251!\009@

-- Packing multiple values
format, data = tobytes(42, 24, "string", false)
print(format, data) --> 2b6s?    *\024string\000
print(frombytes(format, data)) --> 42    24    string    false

-- Packing using a format string
data = tobytesf("5si6s", "Hello", 1234567, "World!")
print(data) --> Hello\135\214\018\000World!\000
print(frombytes("5si6s", data)) --> Hello    1234567    World!

@natedogith1
Copy link
Contributor

for tobytes/frombytes are y'all looking for something like Lua 5.3's string.pack and string.unpack (http://www.lua.org/manual/5.3/manual.html#pdf-string.pack) ?

@Kubuxu
Copy link
Contributor Author

Kubuxu commented Apr 18, 2015

I must say that people from Lua thought it out great. What about:

  • pack and unpack from Lua5.3 specification (I can write this).
  • and tobytes or autopack or packauto which accepts variable number of arguments and returns them packed in most expanded form, returns data, fmt. Data is first as sometimes you don't even want to see the fmt.
    Then you would do:
-- Un/packing a string
data, fmt = tobytes("string with a \0 byte")
print(format, data) --> c20    string with a \0 byte
print(unpack(format, data)) --> string with a \0 byte

-- Packing a double
data, format = tobytes(math.pi)
print(format, data) --> n     \024-DT\251!\009@

-- Packing multiple values
data, format = tobytes(42, 24, "string", false)
print(format, data) --> nnc6b    *\024string\0
print(frombytes(format, data)) --> 42    24    string    false

-- Packing using a format string
data = pack("zi6z", "Hello", 1234567, "World!")
print(data) --> Hello\0\135\214\018\000World!\0
print(frombytes("zi6z", data)) --> Hello    1234567    World!

@dgelessus
Copy link
Contributor

Those functions from Lua 5.3 are basically what I'm suggesting. (I bet there are some underlying C functions that do the same thing.)
@Kubuxu When would you want to have the packed data without a format string to unpack it? If you want to do anything useful with the data, you need to know which types are packed there, and in what order. Which is what the format string is for.

@Kubuxu
Copy link
Contributor Author

Kubuxu commented Apr 18, 2015

The tobytes/autopack returns bytes and format string.

@MaHuJa
Copy link

MaHuJa commented May 26, 2015

@dgelessus You'd want to skip the format string when you already know what it is and have coded it in - or cached it - on the receiving end.

I'm curious as to what the intended use case for this is.
I can see uses for several of these things when an internet card is involved, in which case it would make more sense to include the lib from the internet card (comparable to how two programs, I think wget/pastebin, are available when you insert it).

But other than these external services, when would you use (un)pack in preference to http://ocdoc.cil.li/api:serialization ?

@Kubuxu
Copy link
Contributor Author

Kubuxu commented May 26, 2015

My intended use was to be able to compress data, save binary data in database w/o ugly hacks.
What I mean as ugly hack: https://gist.github.com/Kubuxu/e5e04c028d8aaeab4be8

@fnuecke
Copy link
Member

fnuecke commented May 26, 2015

I've kind of lost track, what's the current consensus? That this would provide Lua 5.3's string.pack, string.packsize, and string.unpack? In that case, I'd actually say let's implement those in a default API (the cleanest being a new StringAPI, I suppose), and expose the methods to the sandbox. The main reason I now prefer this to an extra block is that those exact methods would kind of need to become available in the 5.3 architecture (whenever that may or may not get finished). So the block would be utterly useless for that arch.

Would someone else like to take a shot at this? That'd be highly appreciated. If so, for reference have a look at the unicode API for example. Keep in mind this needs to more or less be implemented twice, once for the native arch and once for the LuaJ arch.

@MaHuJa
Copy link

MaHuJa commented May 26, 2015

String.pack would fix one (the most significant) part of the request. It does request compression and base64 utilities, but I think those were secondary, as in "if we make a block anyway, may as well add this too". So yes, string.pack and family would provide what's requested.

Presuming 5.3 is coming anyway it would be less work to simply wait for it and require it for this functionality 😈

@Kubuxu
Copy link
Contributor Author

Kubuxu commented May 26, 2015

@MaHuJa OC on Lua5.3 is only a case of fixing one bug in Eris for Lua 5.3 but we can't reproduce it outside of OC and OC itself is huge place to look for it.

@MaHuJa
Copy link

MaHuJa commented May 27, 2015

I saw it - but my argument is basically that resolving #811 would resolve this with no added work.
5 hours after your post, an interesting note was added to that issue.

@fnuecke
Copy link
Member

fnuecke commented Jun 14, 2015

Allright, since it's in now: if you want the string packing stuff use Lua 5.3, for other things, there's also the data card now, so I'll be closing this.

@fnuecke fnuecke closed this as completed Jun 14, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants