-
Notifications
You must be signed in to change notification settings - Fork 538
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support or supplant vsi* networking virtual file system functionality in GDAL (e.g. vsicurl, vsis3) #538
Comments
@sgillies to answer your question in the previous thread,
The answer as you might have expected is, in my experience, no. For example, I didn't set the ACL correctly when uploading tiles, and instead of a ERROR 4: `/vsicurl/http://foo-bucket/bar.tiff' does not exist in the file system,
and is not recognised as a supported dataset name. error. I don't think relying on GDAL to give us nice information about what is happening over the network is not going to be a good path forward, without some PR's to GDAL. Perhaps @rouault or @warmerdam would have some more informed ideas about this? |
Yes /vsicurl/ might need some tweaking to better report errors |
This. VSI already does this (as I understand the purpose of VSI), so it might (having not read the code) be possible to expose the same hooks to Python (etc.) to facilitate alternate byte fetching implementations. |
Yes that should be feasible. You could conceptually have a VSIInstallPythonFileHandler( prefix, PyObject* pythonFileHandler ) that would be called from Python. That would create an instance of a C++ VSIPythonFilesystemHandler (implementing VSIFilesystemHandler interface) that would call back the Open() method of pythonFileHandler. That one would return a PyObject* pythonFileHandle that would be wrapped by a C++ VSIPythonVirtualHandle (implementing VSIVirtualHandle interface) to do the callbacks to read(), write(), seek(), tell(), flush()... |
Embedding Python could be feasible, but we'd still require (more so, I suspect, because of new development and debugging to be done) work in GDAL to pass libcurl errors up to Rasterio (to be turned into Python exceptions). So, if we're discussing GDAL work, that's where I'd like to see it start. |
Who's up for winter in paris? :) https://wiki.osgeo.org/wiki/Paris_Code_Sprint_2016 I'm likely to attend and try to work on a solution from the GDAL side, I have to work out the scheduling a bit. I can C and have been in the guts of GDAL before, but it would be nice to be sitting next to @rouault to bug him with question while I try. |
@lossyrob I won't be able to make the sprint, but I'll plan to have Rasterio mostly ready by then. Thanks for taking charge! |
I will be at the Paris sprint, and we've done something similar to this for PDAL. |
Using GDAL's trunk (post https://trac.osgeo.org/gdal/ticket/6243) and code in a branch of mine that I haven't committed yet, I'm able to access a raster dataset in a private bucket using temporary credentials:
I'm happy to report that signed S3 URLs work, too. There's a bit of work todo on the interface for this, but the raw power is helping me get over my misgivings, @rouault! |
#551 has my preliminary work on this feature. I've implemented an AWS Session object using Python's context manager concept: when you do with rasterio.open('s3://example/foo.tif', **aws.Session().options):
... The AWS Session object will attempt to fetch parameters from your AWS configuration in |
This is great stuff @sgillies. To be clear on the work items, this takes care of passing in config that GDAL wanted as environment variables, and from the GDAL side we need to bubble up libcurl errors and wrap them in python errors, correct? What other changes to GDAL (or rasterio) around this feature are we imagining will need to happen in order to do this right? |
Thanks @hobu, I'll be asking you about the PDAL code in Paris. |
@lossyrob You've got the gist of it. It will suffice for GDAL to raise curl errors in its own way. Rasterio can take care of mapping error codes to Python exceptions. A preview of this feature on OS X can be had like this:
I'm not decided on the API yet. In the code above I'm doing this: with rasterio.drivers(), aws.Session():
rasterio.open('s3://yourbucket/some.tif'):
... but might make some changes for interoperability with boto3 sessions or to enable default sessions for |
@perrygeo @sgillies wondering the proper implementation of these errors. I could use an existing What info would you want to be translated via CPLError? Thanks. |
@lossyrob Bonjour! For a start, Rasterio could get by with 4 distinct CPL errors, or one new CPL error with 4 distinct flavors. I'd like to be able to catch the 4 following kinds of failures and raise appropriate Rasterio exceptions. Lack of Access Key IdThis seems like it will be the most common authentication error. It occurs where there are no authentication params at all. AWS returns a 403 status code and
^^ I'd like to be able to raise, i.e., Invalid authentication tokensThe Access Key Id might be valid, but other params are not. AWS returns a 403 status code and
^^ I'd like to be able to raise, i.e., Bucket not foundA CPL error unique to AWS 404
^^ I'd like to be able to raise, i.e., Object not foundFinally, a CPL error unique to 404 "object not found", distinct from the one above, would be great to have.
^^ I'd like to be able to raise, i.e., |
Thanks @sgillies. Based on mocked output, I guess these don't even need to be new codes (so no addition to here: https://github.com/mapbox/rasterio/blob/master/rasterio/_err.pyx#L44-L53 or actually any rasterio), but just supply the appropriate error message, and give the |
@lossyrob no, CPLE_OpenFailed doesn't quite suffice for me. These first three in my list are different errors in my view and have a different recovery mode (authenticate or re-authenticate, revise bucket name, etc). In https://github.com/OSGeo/gdal/blob/trunk/gdal/port/cpl_error.h#L63-L65 it looks like the door is open for new error codes, though the project might want to have an RFC on adding them? /cc @rouault |
@sgillies Ok, that was the original plan, I misunderstood your post. Something like exception_map = {
1: RuntimeError, # CPLE_AppDefined
2: MemoryError, # CPLE_OutOfMemory
3: IOError, # CPLE_FileIO
4: IOError, # CPLE_OpenFailed
5: TypeError, # CPLE_IllegalArg
6: ValueError, # CPLE_NotSupported
7: AssertionError, # CPLE_AssertionFailed
8: IOError, # CPLE_NoWriteAccess
9: KeyboardInterrupt, # CPLE_UserInterrupt
10: ValueError, # ObjectNull
21: InvalidAccessKeyId, # CPLE_AWSInvalidAccessKeyId
22: SignatureDoesNotMatch, # CPLE_AWSSignaturDoesNotMatch
23: NoSuchBucket, # CPLE_AWSBucketNotFound
24: AwsObjectNotFoundError, # CPLE_AWSObjectNotFound
25: NoSuchObject, # CPLE_AWSNoSuchObject
26: HttpError # CPLE_HttpResponse
} Not sure about the grouping or the names, so any suggestions are welcome; I can talk with Evan about it more tomorrow. |
@lossyrob that looks great 👍 |
The GDAL work is completed and patched into trunk via OSGeo/gdal#98 |
🙇 @lossyrob. I'm just back from vacation and happy to say that the latest GDAL compiles and works fine. I'm going to try the new error codes this week. |
This feature is in the master branch and will be in 0.33. |
Related: OSGeo/gdal#2991. |
not really. This will not help /vsicurl and friends that don't use CPLHTTPFetch(), but curl directly |
Thanks for the clarification @rouault ! |
Between 0.29.0 and 0.30.0, functionality was introduced to specifically support the virtual file system functinality
vsigzip
,vsizip
, andvsitar
. As can be seen in this code, along with explicit support for some virtual file systems, some functionality for other virtual file system support that was previously and unintentionally exposed was explicitly disallowed, and calls that had taken advantage of that GDAL functionality began throwing a "VFS scheme {0} unknown" exception upon upgrade of rasterio.Part of the GDAL functionality that became inaccessible through rasterio was accessing rasters over HTTP with
vsicurl
. That feature in particular was useful to myself and others, since we were relying on it to do windowed reads of large rasters over the network, without pulling down the whole file. In this way we can have multiple processes pulling byte ranges of a GeoTiff off of S3 (which supports the byte range HTTP request header), as a mechanism of doing tiling work using rasterio and PySpark (see the OpenAerialMap tiler code here...this "chunking" step had some problems dealing with very large rasters so if you're interested in using these techniques make sure to look to the master branch's version and any unmerged PR's).I started a discussion on a merged pull request, which is the place I found the most definitive language about why
vsicurl
was no longer supported. It was smartly recommended that we start a separate ticket instead of talking on a merged PR, and this ticket can serve this purpose. Please read the prior discussion as if they were the first comments on this issue./cc @sgillies @perrygeo @mojodna
The text was updated successfully, but these errors were encountered: