Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New file format #75

Open
whitequark opened this issue Oct 14, 2016 · 64 comments
Open

New file format #75

whitequark opened this issue Oct 14, 2016 · 64 comments

Comments

@whitequark
Copy link
Contributor

whitequark commented Oct 14, 2016

SolveSpace would benefit from a file format that suits the following required properties:

  • compact, with a binary representation;
  • has an alternate text representation, for tests;
  • fast to read and write;
  • easily extensible, while being backwards compatible;
  • easily supports deeply nested structures, such as for hierarchical sketches.

In addition, an optional but desirable properties would be:

  • easily readable and perhaps even writable from 3rd party software.

The current file format fulfills none of the above (not even the last requirement, since there is no explicit schema). I propose to replace it with a format based on Google Protocol Buffers, which has the following properties:

  • the binary representation is naturally compact--it writes your data and generally does no special tricks during it, except that it defines a variable-length encoding for integers;
  • the text representation is available alongside binary representation, and requires no manual work;
  • has a highly optimized reader and writer implementation in C++;
  • is literally designed because of a desire to handle protocol evolution well;
  • is naturally nested;
  • has a separate schema and very broad library support.

Such a format, with the initial version closely resembling the existing file format, seems like an ideal choice for us. The plan is to evolve it further by gradual extension, while remaining compatible with the old versions, as opposed to versioning; unknown entries are, generally, ignored, so with some care and testing, it is possible to retain an excellent degree of compatibility.

@whitequark whitequark added this to the 3.0 milestone Oct 14, 2016
@whitequark
Copy link
Contributor Author

@jwesthues Any objections? Disregard my earlier thoughts about using LLVM bitcode; after an in-depth discussion with LLVM folks that turned out to be a rather poor choice.

@whitequark
Copy link
Contributor Author

The new file format should also use 64-bit handles (as would the in-memory representation). The rationale is that, once we make large assemblies fast, remapping of entities can rapidly exhaust the 65535-entity limit for one group, if one imports many instances of a large sketch. This migration is an ideal opportunity to add the 32→64 bit expansion code robustly.

@jwesthues
Copy link
Member

That seems generally reasonable. It would be fashionable to use JSON or some other text representation today; but especially if you optimize assemblies to make better use of the GPU, a binary format's speedup seems material.

@whitequark
Copy link
Contributor Author

Actually, the version 3 of Protocol Buffers has a canonical JSON serialization, so we could just use that. (Google generally recommends version 2, for reasons that I do not fully understand; but 3 is probably fine too.)

@whitequark
Copy link
Contributor Author

It's also worth taking a look at Cap'n Proto, designed by the original author of Protocol Buffers while taking into account many of the downsides of the latter. It has some rather desirable properties, but a) completely lacks a human-readable format usable at runtime (in fact the author recommends the use of JSON in that case); b) lacks a bidirectional JSON parser and serializer; c) is not supported nearly as broadly. Overall I feel it's not the right choice, but it deserves a mention.

@Evil-Spirit
Copy link
Collaborator

The rationale is that, once we make large assemblies fast, remapping of entities can rapidly exhaust the 65535-entity limit

Actually when I implementing dxf import, I've easily found file which doesn't fit into 64k handles.

@Evil-Spirit
Copy link
Collaborator

The main problem is to build format which don't contains implementation details like remapping-lists. This format must be human-readable (in text form). I think it is good idea to have two sections:

  1. The basic compact format representation - it is only reqests, groups and main parameters. This is fully-human readable and logically based. It must be sufficient to build whole sketch using only this part.
  2. The precomputed part. This is for fast loading while building assemblies. By default this included into file, but can be avoided by user intention (for more compact files).

For the new-version migration we can just use 1) and don't bother about 2)

@jwesthues
Copy link
Member

The main problem is to build format which don't contains implementation details like remapping-lists.

Without remapping, how do you propose to support nested assemblies?

@Evil-Spirit
Copy link
Collaborator

Evil-Spirit commented Oct 15, 2016

@jwesthues
Use some effective way to represent handle-chain path to entities. Like FullSubentPath used in AutoCAD.

@jwesthues
Copy link
Member

You mean by making the entity's unique identifier variable-length?

@Evil-Spirit
Copy link
Collaborator

Evil-Spirit commented Oct 15, 2016

@jwesthues, yes. Since we want to have some nested sketches (assembly is one of the cases), we don't want to have one big heap of entities, so we can use variable-length identifiers.

@jwesthues
Copy link
Member

@Evil-Spirit So what used to be a fixed 64-bit field now must be allocated dynamically for every single entity, and you also break backwards compatibility. What's the offsetting benefit?

@Evil-Spirit
Copy link
Collaborator

@jwesthues, entities itself can contain only current-sketch-id (or group) as it was before. Only constraints should contain full-path since it can be applied between different groups/sketches.

benefits:

  1. This remove remap lists from files.
  2. This allows nested asm-in-asm strutures without exp grow the memory needs
  3. This faster of generation
  4. This makes ability to store only one sketch-copy in memory and insert it multiple times just using asm-group with csys basis and pointer to sketch.

I know what this is dramatical change and not all the implementations aspects is clear. But this is just the way we can think about new format. May be we can invent someting more robust.

@jwesthues
Copy link
Member

@Evil-Spirit The thing that dominates regeneration time is constraint solution and surface Booleans, and this speeds up neither. The memory saving is not material. You've proposed this change before, and I still don't see the benefit.

@Evil-Spirit
Copy link
Collaborator

Selection can operate on the nested structures and generate path for constraints. Actually existance of each entity in memory is not needed, it can be generated on the fly from requests and stored parameters for writing constraint equations of performing visualization. Reqesting for entity by its id inside group can use reqest id stored in entity id, entity id inside request and unique entity id inside group to decide what type and which params to take...

@Evil-Spirit
Copy link
Collaborator

Evil-Spirit commented Oct 15, 2016

@jwesthues
Constraint solution time is not the problem. I know what we can do with.
Booleans is pain, but still can be improved using some external libraries or improving the current solutions. It can be abstracted (as it actually done for using mesh or surfaces).

But we talking about file format and basis for all of aspects. It something like if we solve one small problem at now this solves a couple of big problems for the future. I not just talking about real changing, just thinking about how it can be done. May be some of this ideas help make better file format or solve some problems

@whitequark
Copy link
Contributor Author

@Evil-Spirit

Actually when I implementing dxf import, I've easily found file which doesn't fit into 64k handles.

I don't consider importing such files useful (what are you going to do with them exactly?), so it doesn't matter. This is just an artificial case.

This format must be human-readable (in text form).

I agree that there should be a human-readable representation, for one reason only: we need to write tests. I see no reason to have the human-readable representation, as our only or main one. There is no benefit (it is actually easier to read protobuf files, because you don't have to write any parsers), and there are drawbacks: the files are bloated, and the parsing is slow.

For the new-version migration we can just use 1) and don't bother about 2)

Nope, the new file format should be (at least) functionally equivalent to the old one.

@whitequark
Copy link
Contributor Author

@jwesthues Can you please write a (more in-depth than the comment in sketch.h) explanation of the function that remapping serves? I see that it's used for both assemblies and any groups that produce entities from entities, but I don't follow the logic behind that.

@jwesthues
Copy link
Member

@whitequark To maintain an associative link between two entities (like an entity imported from a part, and the imported entity in the assembly), you can either identify entities by a complete path that specifies the associative link, or keep a table that records the association. Assuming arbitrarily deep import nesting, the former requires arbitrarily long identifiers. The latter is that remap table.

@Evil-Spirit
Copy link
Collaborator

Evil-Spirit commented Oct 15, 2016

I don't consider importing such files useful (what are you going to do with them exactly?), so it doesn't matter. This is just an artificial case.

Actually this is not such big file as you can think. And SolveSpace is good enough to be good pure 2d editor (since we made dxf import-export).
look at
https://github.com/whitequark/solvespace-testcases/tree/master/tests/import/Download

@Evil-Spirit
Copy link
Collaborator

@jwesthues, This sounds resoneable for assemblies (where we actually have the case with asm-in-asm hierarchy and case where we have the arbitrary number of the same details), but why this is for each group? Can we painless avoid this?

@whitequark
Copy link
Contributor Author

Speaking of Protocol Buffers... I'm looking further into it and I see some potential scalability problems. Specifically, Protocol Buffers are designed to work by deserializing the entire message at once. The problems here are twofold:

  • Parsing demands. In general the documentation recommends to avoid messages larger than 1MB, has a 64MB soft-limit to guard against DoS, and has a 512MB hard-limit, at which point it breaks due to integer overflows. Additionally, parsing involves allocation (which is cheap, as it uses arenas, but it still expands the input in memory) as well as deserialization from the wire format.
  • Lack of a seek capability. When we are loading sketch that we know we will regenerate, it is absolutely wasteful to read all the meshes and such. Similarly, with hierarchical sketches, we might not be interested in many of them at all.

Cap'n Proto solves both quite elegantly. It looks like I'll have to prototype some code with both...

@jwesthues
Copy link
Member

@Evil-Spirit The nesting of associative links can get pretty deep even without assemblies, like if you sketch a contour, step translating, then step translating again for a 2d array, then extrude, then ...

@Evil-Spirit
Copy link
Collaborator

@jwesthues, This seems like case where we can apply just remapping function instead of storing all this inside map... or not?

@jwesthues
Copy link
Member

@Evil-Spirit How do you remap in a way that (a) uses a fixed-length identifier, (b) maintains the associative link, even when entities in the input group are added and deleted, and (c) doesn't require that table?

@Evil-Spirit
Copy link
Collaborator

@jwesthues, yes, (b) is serious fact. actually this solved if use pointers instead of id for runtime and use ids only for saving.. but this is the different story.

@baryluk
Copy link

baryluk commented Jan 16, 2020

+1 for the protocol buffers based format. It makes adding new fields very easy, and parsing / exporting is 1) fast, 2) easy to get right, 3) extensible in the future, 4) easy to use in other tools (like CLI post-process or analyzer written possibly in other language), 5) compact, 6) easy to debug even without access to protocol buffer description, 6) type checking works out of the box.

I can't stand XML and JSON in any serious project.

Maybe I am biased, but I used Protocol Buffers for last 8 years, and never found it problematic.

@traverseda
Copy link

I still think capnproto is a good idea, even years later. You can memory map a capnproto file (read only) and use copy-on-write techniques to make loading of complicated assemblies nearly instant, if your disc-format closely follows the actual data structures you use internally.

@baryluk
Copy link

baryluk commented Jan 16, 2020

@traverseda capnproto is not significantly faster, in fact if you include the object creation it can be much slower. capnproto can be a security risk, it doesn't validate all the fields afaik as strictly as protobuf during read. capnproto can and often is bigger, compared to protobuf, quite significantly (2-4x times in pathological cases). I did many benchmarks in the past, and capnproto was often the slowest of all serialization extensible binary methods.

So I would still lean on protobuf. Because it is good enough, extensible and has good cross platform support in many tools. Using a niche format wouldn't be a good idea.

Sure capnproto is somehow popular, but for how long? Plus IMHO capnproto IDL is ugly.

The best is to actually do benchmarks and comparison and do proper engineering on this topic. As selecting a format is quite a big deal long term. It impacts both changeability, stability of the file API, and also how you use things in the C++ code (setters, getters, extensions, etc).

@whitequark
Copy link
Contributor Author

The current plan is to use neither capnproto nor protobuffers but rather flatbuffers. Especially protobuf v3 has many awful design decisions that only pass for good because people think Google knows what they're doing and they aren't.

@baryluk
Copy link

baryluk commented Jan 16, 2020

Well, I don't like protobuf v3 either, not only due to change of some features (things like optional fields, extensions, etc), but also how they are used in C++ and Go. I like the v2 and had the most experience with it.

I don't have opinion on flatbuffers. I am just always skeptical of new solutions that claim to be superior / better. Usually they might be better in some aspects, but worse in others.

@whitequark
Copy link
Contributor Author

It's a good thing I spent a while going into detail on the possible solutions here, such as the two reasons protobuf v2 isn't suitable for the task.

@baryluk
Copy link

baryluk commented Jan 16, 2020

@whitequark Makes sense. Out of curiosity what kind of sizes of slvs file or memory usage you seen as an extreme?

My slvs files are usually 1, and are pretty simple, but that is in the text format. doing a quick gzip -1 on them to "emulate" binary format and remove basic field name repetitions, brings them down top 132kB.

I can imaging in current slvs format some medium projects can be pretty big indeed.

@whitequark
Copy link
Contributor Author

Out of curiosity what kind of sizes of slvs file or memory usage you seen as an extreme?

The current slvs format includes the entire rendered 3d model for inclusion in subassemblies. This can grow in an essentially unbounded way, easily many megabytes, and well over the recommended limit for protobuf. If the authors of protobuf specifically say to not use it in cases like this one, I would rather trust them.

Flatbuffers allows arbitrary seeking and provides zero-copy parsing so it solves this issue nicely. Unused rendered model will not incur any overhead if we memory map the uncompressed file, or only minor overhead if we use compressed models carefully.

@baryluk
Copy link

baryluk commented Jan 16, 2020

@whitequark Yes, I can confirm that ProtoBuffers are rather a bad fit for 100MB+ messages. They are simply not designed / optimized in that in mind from scratch.

@baryluk
Copy link

baryluk commented Jan 16, 2020

I see solvespace already do have flatbuffers as a dependency used in one of the exporters. Nice. 10% already done :D

I looked into file.cpp and it is relatively short, to at least trivially start using flatbuffers, so it doesn't look to be impossible task. It will not get all the benefits initially, but could be a good start.

Did anybody start designing a flatbuffers schema taking into account also the other future wishlists (like hierarchical sketches) that could impact the format?

@whitequark
Copy link
Contributor Author

No. I've been sick for several years and only had a limited ability to work on SolveSpace, unfortunately.

@baryluk
Copy link

baryluk commented Jan 16, 2020

@whitequark Ok, no worries! I will see what I can do, and do some prototyping for fun.

@whitequark
Copy link
Contributor Author

Just keep in mind that it is unlikely that your prototype will be merged in any reasonable timeframe.

@ConductiveInsulation
Copy link

I think it also would be great to have the possibility to add comments inside the slvs file, currently it's possible to add a comment somewhere in the text but it'll get removed when the "corrupted" file gets repaired.

I use the comments when I use git to share some files (works great since they are just text files) so it's not really helpful when they just disappear, having them in a separate file also adds some inconvenience

A possible solution would be to allow longer group names, I haven't counted the letters but after a certain length Solvespace just crashes while opening the file. Alternatively a Group or file description might be doable.

@phkahler
Copy link
Member

phkahler commented Dec 6, 2021

@ConductiveInsulation you know you can put comments directly on the sketch right? Press ; you can also select a point first to position a comment relative to it.

@ConductiveInsulation
Copy link

directly on the sketch .

Do you mean the text function? It incredibly bloats the file since not only the Message gets stored but also the shape/extrusion of the message. Unfortunately the Message itself also doesn't get stored on the beginning of the file so it won't be visible when the file is viewed because it for example got directly linked.

I made a example how i currently do it: https://bitbucket.org/conductiveinsulation/solvespace/raw/b638da1b1d850ff2bbe52965f9cc919ecf7775f9/Logo.slvs

@ruevs
Copy link
Member

ruevs commented Dec 6, 2021

@ConductiveInsulation No, not the "Text" function. The "Comment" function from the menu Constrain | Comment or with the ; (semicolon) shortcut key. If you select a point before adding the comment it will be positioned relative to the point and will move with it.

@ConductiveInsulation
Copy link

@ruevs I'm using Solvespace for a pretty long time now but I have to admit that I must have overlooked that function totally. I'll look into it after work and give you a update.

@ConductiveInsulation
Copy link

@ruevs it appears to be Impossible to store the comments in the beginning of the file since the constraints get stored in the end of the file

@hinell
Copy link

hinell commented Feb 12, 2022

Would certainly go for JSON based format. It would be best for WEB based software to handle such files. Performance of loading is less important when compared to a modelling one.

I would also vote for zip-based archiving approach. FreeCAD uses something like that for their main format.

@hinell
Copy link

hinell commented Aug 11, 2023

Checkout SQLite. I think it would be best approach to utilize SQLite DB as application file container. It has so many benefits I can't even describe it. The only major drawback is that it's not easily stremable: you have to buffer data.

@KronosTheLate
Copy link

KronosTheLate commented Oct 10, 2023

I want to mention that I had some issues getting my system to recognize the correct "MIME" type for `.slvs" files. (Linux)
My band-aid fix was to create a file "/usr/share/mime/packagessolvespace.xml", and put the following content inside:

<?xml version="1.0" encoding="utf-8"?>
<mime-type type="text/solvespace">
  <comment>Solvespace model source code</comment>
  <glob pattern="*.slvs"/>
</mime-type>

After running sudo update-mime-database /usr/share/mime in the shell, the file extension is now correctly recognized.

EDIT: I created this issue to get some help integrating the solvespace file type into the shared mime database.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests