Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support or supplant vsi* networking virtual file system functionality in GDAL (e.g. vsicurl, vsis3) #538

Closed
lossyrob opened this issue Jan 10, 2016 · 26 comments
Milestone

Comments

@lossyrob
Copy link

Between 0.29.0 and 0.30.0, functionality was introduced to specifically support the virtual file system functinality vsigzip, vsizip, and vsitar. As can be seen in this code, along with explicit support for some virtual file systems, some functionality for other virtual file system support that was previously and unintentionally exposed was explicitly disallowed, and calls that had taken advantage of that GDAL functionality began throwing a "VFS scheme {0} unknown" exception upon upgrade of rasterio.

Part of the GDAL functionality that became inaccessible through rasterio was accessing rasters over HTTP with vsicurl. That feature in particular was useful to myself and others, since we were relying on it to do windowed reads of large rasters over the network, without pulling down the whole file. In this way we can have multiple processes pulling byte ranges of a GeoTiff off of S3 (which supports the byte range HTTP request header), as a mechanism of doing tiling work using rasterio and PySpark (see the OpenAerialMap tiler code here...this "chunking" step had some problems dealing with very large rasters so if you're interested in using these techniques make sure to look to the master branch's version and any unmerged PR's).

I started a discussion on a merged pull request, which is the place I found the most definitive language about why vsicurl was no longer supported. It was smartly recommended that we start a separate ticket instead of talking on a merged PR, and this ticket can serve this purpose. Please read the prior discussion as if they were the first comments on this issue.

/cc @sgillies @perrygeo @mojodna

@lossyrob
Copy link
Author

@sgillies to answer your question in the previous thread,

Does vsicurl raise HTTP errors adequately?

The answer as you might have expected is, in my experience, no. For example, I didn't set the ACL correctly when uploading tiles, and instead of a 403 being returned, it gave the generic

ERROR 4: `/vsicurl/http://foo-bucket/bar.tiff' does not exist in the file system,
and is not recognised as a supported dataset name.

error. I don't think relying on GDAL to give us nice information about what is happening over the network is not going to be a good path forward, without some PR's to GDAL.

Perhaps @rouault or @warmerdam would have some more informed ideas about this?

@rouault
Copy link
Contributor

rouault commented Jan 10, 2016

Yes /vsicurl/ might need some tweaking to better report errors

@mojodna
Copy link
Contributor

mojodna commented Jan 11, 2016

From @perrygeo in #472:

An alternative solution might be some sort of two-way communication between the gdal drivers and the networking layer so that user's could plug in their own networking code. Driver would say "fetch these bytes", user's networking code could return the bytes. Decoupling the networking from the format driver would eliminate the concerns raised here about vsi. I have no idea how feasible this would be or how much work it might take to implement, just an idea.

This.

VSI already does this (as I understand the purpose of VSI), so it might (having not read the code) be possible to expose the same hooks to Python (etc.) to facilitate alternate byte fetching implementations.

@rouault
Copy link
Contributor

rouault commented Jan 12, 2016

Yes that should be feasible. You could conceptually have a VSIInstallPythonFileHandler( prefix, PyObject* pythonFileHandler ) that would be called from Python. That would create an instance of a C++ VSIPythonFilesystemHandler (implementing VSIFilesystemHandler interface) that would call back the Open() method of pythonFileHandler. That one would return a PyObject* pythonFileHandle that would be wrapped by a C++ VSIPythonVirtualHandle (implementing VSIVirtualHandle interface) to do the callbacks to read(), write(), seek(), tell(), flush()...

@sgillies
Copy link
Member

Embedding Python could be feasible, but we'd still require (more so, I suspect, because of new development and debugging to be done) work in GDAL to pass libcurl errors up to Rasterio (to be turned into Python exceptions). So, if we're discussing GDAL work, that's where I'd like to see it start.

@lossyrob
Copy link
Author

Who's up for winter in paris? :) https://wiki.osgeo.org/wiki/Paris_Code_Sprint_2016

I'm likely to attend and try to work on a solution from the GDAL side, I have to work out the scheduling a bit. I can C and have been in the guts of GDAL before, but it would be nice to be sitting next to @rouault to bug him with question while I try.

@sgillies
Copy link
Member

@lossyrob I won't be able to make the sprint, but I'll plan to have Rasterio mostly ready by then. Thanks for taking charge!

@hobu
Copy link
Contributor

hobu commented Jan 14, 2016

I will be at the Paris sprint, and we've done something similar to this for PDAL.

@sgillies
Copy link
Member

Using GDAL's trunk (post https://trac.osgeo.org/gdal/ticket/6243) and code in a branch of mine that I haven't committed yet, I'm able to access a raster dataset in a private bucket using temporary credentials:

$ rio info s3://mapbox/rasterio/shade.tif
{"count": 1, "crs": "EPSG:3857", "res": [9.5546285343, 9.5546285343], "interleave": "band", "dtype": "uint8", "driver": "GTiff", "transform": [9.5546285343, 0.0, -11858134.818413004, 0.0, -9.5546285343, 4813698.2926443005], "lnglat": [-106.47949217280885, 39.605688173875606], "height": 1024, "width": 1024, "shape": [1024, 1024], "blockxsize": 256, "tiled": true, "blockysize": 256, "nodata": 255.0, "bounds": [-11858134.818413004, 4803914.3530251775, -11848350.87879388, 4813698.2926443005]}
$ rio shapes --mask s3://mapbox/rasterio/shade.tif | geojsonio

geojson io

I'm happy to report that signed S3 URLs work, too.

There's a bit of work todo on the interface for this, but the raw power is helping me get over my misgivings, @rouault!

@sgillies
Copy link
Member

#551 has my preliminary work on this feature. I've implemented an AWS Session object using Python's context manager concept: when you do with aws.Session(): ... the required GDAL CPL config options are set and when execution exits that block the config options are returned to their previous state. It's going to be possible to do this, too, if you prefer:

with rasterio.open('s3://example/foo.tif', **aws.Session().options):
    ...

The AWS Session object will attempt to fetch parameters from your AWS configuration in ~/.aws (piggybacking on boto and awscli). These will be overridden by well known environment variables or explicit arguments to the Session constructor (again modeled after boto).

@lossyrob
Copy link
Author

This is great stuff @sgillies. To be clear on the work items, this takes care of passing in config that GDAL wanted as environment variables, and from the GDAL side we need to bubble up libcurl errors and wrap them in python errors, correct? What other changes to GDAL (or rasterio) around this feature are we imagining will need to happen in order to do this right?

@lossyrob
Copy link
Author

Thanks @hobu, I'll be asking you about the PDAL code in Paris.

@sgillies
Copy link
Member

@lossyrob You've got the gist of it. It will suffice for GDAL to raise curl errors in its own way. Rasterio can take care of mapping error codes to Python exceptions.

A preview of this feature on OS X can be had like this:

$ virtualenv ~/envs/rios3
$ source ~/envs/rios3/bin/activate
(rios3)$ pip install rasterio==0.32.a1 --extra-index-url=https://dd8kvqd1c0bor.cloudfront.net/gdal-dev/simple/
(rios3)$ export AWS_ACCESS_KEY_ID=...
(rios3)$ rio info s3://yourbucket/some.tif

I'm not decided on the API yet. In the code above I'm doing this:

with rasterio.drivers(), aws.Session():
    rasterio.open('s3://yourbucket/some.tif'):
        ...

but might make some changes for interoperability with boto3 sessions or to enable default sessions for rasterio.open() (there aren't any, now).

@lossyrob
Copy link
Author

@perrygeo @sgillies wondering the proper implementation of these errors. I could use an existing CPLError code (e.g. CPLE_OpenFailed which translates to an IOError), using the message body to carry through more info like the Http response code, or I could add a specific code here https://github.com/mapbox/rasterio/blob/master/rasterio/_err.pyx#L44-L53 for things like Http errors, with the message being the return code.

What info would you want to be translated via CPLError?

Thanks.

@sgillies
Copy link
Member

@lossyrob Bonjour!

For a start, Rasterio could get by with 4 distinct CPL errors, or one new CPL error with 4 distinct flavors. I'd like to be able to catch the 4 following kinds of failures and raise appropriate Rasterio exceptions.

Lack of Access Key Id

This seems like it will be the most common authentication error. It occurs where there are no authentication params at all. AWS returns a 403 status code and InvalidAccessKeyId in XML.

$ rio -vvv info s3://mybucket/rasterio/RGB.byte.tif
DEBUG:rasterio:Creating a chief GDALEnv in drivers()
DEBUG:GDAL:Option CPL_DEBUG=ON
DEBUG:GDAL:Option AWS_REGION=us-east-1
DEBUG:GDAL:Option AWS_ACCESS_KEY_ID=OFF
DEBUG:GDAL:Option AWS_SECRET_ACCESS_KEY=OFF
DEBUG:GDAL:Option AWS_SESSION_TOKEN=OFF
DEBUG:GDAL:CPLE_None in S3: GetFileList(/vsis3/mybucket/rasterio)
ERROR:GDAL:CPLE_AppDefined in The AWS Access Key Id you provided does not exist in our records.
DEBUG:GDAL:CPLE_None in S3: <?xml version="1.0" encoding="UTF-8"?>
<Error><Code>InvalidAccessKeyId</Code><Message>The AWS Access Key Id you provided does not exist in our records.</Message><AWSAccessKeyId>OFF</AWSAccessKeyId><RequestId>5C484CE884E62639</RequestId><HostId>ErRA6RGGCN+ns8R921Souv4HVSDWxHcRHuo3RiHUzUKUTr7fdkok0/ZMZRjIynZq</HostId></Error>
ERROR:GDAL:CPLE_AppDefined in The AWS Access Key Id you provided does not exist in our records.
DEBUG:GDAL:CPLE_None in VSICURL: GetFileSize(https://mybucket.s3.amazonaws.com/rasterio/RGB.byte.tif)=0  response_code=403
ERROR:rio:Exception caught during processing
Traceback (most recent call last):
  File "/Users/sean/code/rasterio/rasterio/rio/info.py", line 240, in info
    with rasterio.open(input, mode) as src:
  File "/Users/sean/code/rasterio/rasterio/__init__.py", line 123, in open
    s.start()
  File "rasterio/_base.pyx", line 74, in rasterio._base.DatasetReader.start (rasterio/_base.c:2615)
    with cpl_errs:
  File "rasterio/_err.pyx", line 70, in rasterio._err.GDALErrCtxManager.__exit__ (rasterio/_err.c:994)
    raise exception_map[err_no](msg)
RuntimeError: The AWS Access Key Id you provided does not exist in our records.
Aborted!

^^ I'd like to be able to raise, i.e., rasterio.errors.InvalidAccessKeyId instead of RuntimeError.

Invalid authentication tokens

The Access Key Id might be valid, but other params are not. AWS returns a 403 status code and SignatureDoesNotMatch in XML.

$ AWS_SECRET_ACCESS_KEY=BOGUS rio -vvv info s3://mybucket/rasterio/RGB.byte.tif
DEBUG:rasterio:Creating a chief GDALEnv in drivers()
DEBUG:GDAL:Option CPL_DEBUG=ON
DEBUG:GDAL:Option AWS_REGION=us-east-1
DEBUG:GDAL:Option AWS_ACCESS_KEY_ID=xxxx
DEBUG:GDAL:Option AWS_SECRET_ACCESS_KEY=BOGUS
DEBUG:GDAL:Option AWS_SESSION_TOKEN=xxxx
DEBUG:GDAL:CPLE_None in S3: GetFileList(/vsis3/mybucket/rasterio)
ERROR:GDAL:CPLE_AppDefined in The request signature we calculated does not match the signature you provided. Check your key and signing method.
DEBUG:GDAL:CPLE_None in S3: <?xml version="1.0" encoding="UTF-8"?>
<Error><Code>SignatureDoesNotMatch</Code><Message>The request signature we calculated does not match the signature you provided. Check your key and signing method.</Message>
...
</Error>
ERROR:GDAL:CPLE_AppDefined in The request signature we calculated does not match the signature you provided. Check your key and signing method.
DEBUG:GDAL:CPLE_None in VSICURL: GetFileSize(https://mybucket.s3.amazonaws.com/rasterio/RGB.byte.tif)=0  response_code=403
ERROR:rio:Exception caught during processing
Traceback (most recent call last):
  File "/Users/sean/code/rasterio/rasterio/rio/info.py", line 240, in info
    with rasterio.open(input, mode) as src:
  File "/Users/sean/code/rasterio/rasterio/__init__.py", line 123, in open
    s.start()
  File "rasterio/_base.pyx", line 74, in rasterio._base.DatasetReader.start (rasterio/_base.c:2615)
    with cpl_errs:
  File "rasterio/_err.pyx", line 70, in rasterio._err.GDALErrCtxManager.__exit__ (rasterio/_err.c:994)
    raise exception_map[err_no](msg)
RuntimeError: The request signature we calculated does not match the signature you provided. Check your key and signing method.
Aborted!

^^ I'd like to be able to raise, i.e., rasterio.errors.SignatureDoesNotMatch instead of RuntimeError.

Bucket not found

A CPL error unique to AWS 404 NoSuchBucket would be helpful.

$ rio -vvv info s3://bogusbucket/rasterio/RGB.byte.tif
DEBUG:rasterio:Creating a chief GDALEnv in drivers()
DEBUG:GDAL:Option CPL_DEBUG=ON
DEBUG:GDAL:Option AWS_REGION=us-east-1
DEBUG:GDAL:Option AWS_ACCESS_KEY_ID=xxxx
DEBUG:GDAL:Option AWS_SECRET_ACCESS_KEY=xxxx
DEBUG:GDAL:Option AWS_SESSION_TOKEN=xxxx
DEBUG:GDAL:CPLE_None in S3: GetFileList(/vsis3/bogusbucket/rasterio)
ERROR:GDAL:CPLE_AppDefined in The specified bucket does not exist
DEBUG:GDAL:CPLE_None in S3: <?xml version="1.0" encoding="UTF-8"?>
<Error><Code>NoSuchBucket</Code><Message>The specified bucket does not exist</Message><BucketName>bogusbucket</BucketName>...
</Error>
ERROR:GDAL:CPLE_AppDefined in The specified bucket does not exist
DEBUG:GDAL:CPLE_None in VSICURL: GetFileSize(https://bogusbucket.s3.amazonaws.com/rasterio/RGB.byte.tif)=0  response_code=404
ERROR:rio:Exception caught during processing
Traceback (most recent call last):
  File "/Users/sean/code/rasterio/rasterio/rio/info.py", line 240, in info
    with rasterio.open(input, mode) as src:
  File "/Users/sean/code/rasterio/rasterio/__init__.py", line 123, in open
    s.start()
  File "rasterio/_base.pyx", line 74, in rasterio._base.DatasetReader.start (rasterio/_base.c:2615)
    with cpl_errs:
  File "rasterio/_err.pyx", line 70, in rasterio._err.GDALErrCtxManager.__exit__ (rasterio/_err.c:994)
    raise exception_map[err_no](msg)
RuntimeError: The specified bucket does not exist
Aborted!

^^ I'd like to be able to raise, i.e., rasterio.errors.NoSuchBucket instead of RuntimeError.

Object not found

Finally, a CPL error unique to 404 "object not found", distinct from the one above, would be great to have.

$ rio -vvv info s3://mybucket/rasterio/xRGB.byte.tif
DEBUG:rasterio:Creating a chief GDALEnv in drivers()
DEBUG:GDAL:Option CPL_DEBUG=ON
DEBUG:GDAL:Option AWS_REGION=us-east-1
DEBUG:GDAL:Option AWS_ACCESS_KEY_ID=xxxx
DEBUG:GDAL:Option AWS_SECRET_ACCESS_KEY=xxxx
DEBUG:GDAL:Option AWS_SESSION_TOKEN=xxxx
DEBUG:GDAL:CPLE_None in S3: GetFileList(/vsis3/mybucket/rasterio)
DEBUG:GDAL:CPLE_None in VSICURL: GetFileSize(https://mybucket.s3.amazonaws.com/rasterio/xRGB.byte.tif)=0  response_code=404
DEBUG:GDAL:CPLE_None in VSICURL: GetFileSize(https://mybucket.s3.amazonaws.com/rasterio/xRGB.byte.tif.xml)=0  response_code=404
ERROR:GDAL:CPLE_OpenFailed in `/vsis3/mybucket/rasterio/xRGB.byte.tif' does not exist in the file system,
and is not recognized as a supported dataset name.

ERROR:rio:Exception caught during processing
Traceback (most recent call last):
  File "/Users/sean/code/rasterio/rasterio/rio/info.py", line 240, in info
    with rasterio.open(input, mode) as src:
  File "/Users/sean/code/rasterio/rasterio/__init__.py", line 123, in open
    s.start()
  File "rasterio/_base.pyx", line 74, in rasterio._base.DatasetReader.start (rasterio/_base.c:2615)
    with cpl_errs:
  File "rasterio/_err.pyx", line 70, in rasterio._err.GDALErrCtxManager.__exit__ (rasterio/_err.c:994)
    raise exception_map[err_no](msg)
IOError: `/vsis3/mybucket/rasterio/xRGB.byte.tif' does not exist in the file system,
and is not recognized as a supported dataset name.

Aborted!

^^ I'd like to be able to raise, i.e., rasterio.errors.NoSuchObject instead of IOError.

/cc @perrygeo @rouault

@lossyrob
Copy link
Author

Thanks @sgillies.

Based on mocked output, I guess these don't even need to be new codes (so no addition to here: https://github.com/mapbox/rasterio/blob/master/rasterio/_err.pyx#L44-L53 or actually any rasterio), but just supply the appropriate error message, and give the CPLE_OpenFailed reason (which translates to an IOError). Would that suffice?

@sgillies
Copy link
Member

@lossyrob no, CPLE_OpenFailed doesn't quite suffice for me. These first three in my list are different errors in my view and have a different recovery mode (authenticate or re-authenticate, revise bucket name, etc). In https://github.com/OSGeo/gdal/blob/trunk/gdal/port/cpl_error.h#L63-L65 it looks like the door is open for new error codes, though the project might want to have an RFC on adding them? /cc @rouault

@lossyrob
Copy link
Author

@sgillies Ok, that was the original plan, I misunderstood your post. Something like

exception_map = {
    1: RuntimeError,        # CPLE_AppDefined
    2: MemoryError,         # CPLE_OutOfMemory
    3: IOError,             # CPLE_FileIO
    4: IOError,             # CPLE_OpenFailed
    5: TypeError,           # CPLE_IllegalArg
    6: ValueError,          # CPLE_NotSupported
    7: AssertionError,      # CPLE_AssertionFailed
    8: IOError,             # CPLE_NoWriteAccess
    9: KeyboardInterrupt,   # CPLE_UserInterrupt
    10: ValueError,          # ObjectNull
    21: InvalidAccessKeyId,     # CPLE_AWSInvalidAccessKeyId
    22: SignatureDoesNotMatch,  # CPLE_AWSSignaturDoesNotMatch
    23: NoSuchBucket, # CPLE_AWSBucketNotFound
    24: AwsObjectNotFoundError, # CPLE_AWSObjectNotFound
    25: NoSuchObject,             # CPLE_AWSNoSuchObject
    26: HttpError                     # CPLE_HttpResponse
    }

Not sure about the grouping or the names, so any suggestions are welcome; I can talk with Evan about it more tomorrow.

@sgillies
Copy link
Member

@lossyrob that looks great 👍

@lossyrob
Copy link
Author

The GDAL work is completed and patched into trunk via OSGeo/gdal#98

@sgillies
Copy link
Member

🙇 @lossyrob. I'm just back from vacation and happy to say that the latest GDAL compiles and works fine. I'm going to try the new error codes this week.

@lossyrob
Copy link
Author

@sgillies I added some handling of the new error types in #591; don't have the time currently to test it, feel free to swipe the code if you find it useful or close out if not.

@sgillies
Copy link
Member

This feature is in the master branch and will be in 0.33.

@sgillies
Copy link
Member

sgillies commented Oct 1, 2020

Related: OSGeo/gdal#2991.

@rouault
Copy link
Contributor

rouault commented Oct 1, 2020

Related: OSGeo/gdal#2991.

not really. This will not help /vsicurl and friends that don't use CPLHTTPFetch(), but curl directly

@sgillies
Copy link
Member

sgillies commented Oct 1, 2020

Thanks for the clarification @rouault !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants