5. Cloud Native
● Network is between storage and computation
● Storage is chunked and probably Key/Value
● Computation must scale
● Coordination is limited and risky
6.
7. Cloud Optimized GeoTiff
● Defines an internal structure of GeoTiffs that is
optimized for streaming reading from blob
storage like S3 or Google Cloud Storage.
● That is, it’s best utilized for requests that use
HTTP GET Range Requests
10. “A GeoTIFF file is a TIFF 6.0 file, and
inherits the file structure as described
in the corresponding portion of the
TIFF spec. All GeoTIFF specific
information is encoded in several
additional reserved TIFF tags, and
contains no private Image File
Directories (IFD's), binary structures
or other private information invisible
to standard TIFF readers.”
http://geotiff.maptools.org/spec/geotiff2.4.html#2.4
GeoTiff
11.
12. TIFF
● Tagged Image File Format
● Binary image is separated out into
“segments”
● Contains metadata in a sequence of “tags”
● Can contain metadata for multiple images in
different “image file directories” (IDFs)
13. GeoTiff Custom Tags
● GeoTiff adds custom tag GeoKeys
○ non-TIFF keys for CRS map space etc.
● This is how we know where the pixels are on
the surface of the earth (the geospatial bit).
http://geotiff.maptools.org/spec/geotiff6.html#6.1
14. GeoTiff Streaming Read
● Step 1: Fetch the header
● Step 2: Decide what segments you want
● Step 3: Read the segments
20. TIFF Segments
● Chunks of an image are stored at different
locations. This allows for parts of the image to
be decompressed and loaded separately.
● There are two methods for segment layout:
Striped and Tiled
43. GeoTrellis Avro Layers
● Many small files, one per tile
○ PUT requests get expensive
● Data duplication hazzard
○ Probably not authoritative
● Limited access
○ Requires GeoTrellis code to read
45. GeoTrellis Strucuted COG
● Base zoom tiles are in base GeoTiff layer
● Introduces “DATE_TIME” tag
● Includes VRT for layer files
● GeoTiff segments match tile size and layout
● Additional zoom levels stored in overviews
● Pyramid up when overviews “run out”
49. GeoTrellis Unstructured COG
● Lots of good raster sources are already COG
● They don’t have a strict layout though
● Lets “inline” ingest process into the batch job
● Slower but reduces data duplication
51. GeoTrellis Unstructured COG
● Requires mechanism to query for raster names
○ WFS, STAC, PostgreSQL, Scan
● COG provies “gradual” wins over GeoTiff
● Reproject on read to match workflow
52. Coregistration Workflow
● Working with multiple raster layers with differing:
○ CRS, Resolution, Extent
● Must pick the working/target layout
● We can “push-down” this process into IO
55. COGs for GeoTrellis
● Treating GeoTiff as a Key/Value tile store
○ With built in notion level of detail
● Turns an ingest workflow into query workflow
○ Use the metadata to plan the read
● Opens up the results
○ QGIS, GeoServer, GDAL
56. COGs in General
● GeoJSON of raster formats
● Gradated spec
● Set of best practices
● Driven by implementations
This is original formalization of the idea by Even Rouault, its authoritative.
Includes benchmarks using GDAL showing the difference in range read speeds.
This is site managed by Chris Holmes, it is better overview and critical includes links to implementations and tools relevant to COG.
TIFF header.
Arrows are offsets in the TIFF file
No order is implied
Header is minimal does not have image information, IDFs have image information
All of these are valid structures for TIFF.
Left, easy to read, all the headers are at the top. (IDF should be seen as headers)
Center, looks good but not clear what the advantage of this organization is.
Right, easy to write if you don’t know when the image stream will end.
Only the left organization is good for us because we get all the relevant header information at the top.
COG spec is not firm about compression, deflate is common but slow, LZW is probably the most conservative choice.
In any case the choice must be something that is understood by GDAL across all installs to preserve interoperability.
In reality strips can even be one or two pixels high for large files.
Fetching larger AOI will require a lot of IO time and network traffic
Using overviews when downsampled information is sufficient, it very often is, saves the IO.
Provenance is a huge problem, temptation is to use TIFF tags to keep track, but no standard exists yet.
Histograms are super useful if you ever intend to view the file
File naming convention seems tempting but impossible to enforce and thus expect
Catalogs are the only other option
This COG profile optimizes the IO to 1 request = 1 segment
No reason to require this for all COGs
Conclusion is that there is room to optimize COG layout for each system or application
STAC static files provide reliant way to serve some index which is described using GeoJSON files
REST and WFS 3.0 Services on top of static index provide an integration point for application
You can play games with the names of the GeoTIFF files to come up with a scheme of finding the file we want only from request information, ID and geospatial bbox. You should consider sharding the scheme to avoid hotspotting your Blob store like S3. (See S3 best practices). People from GeoWave/GeoMesa/GeoTrellis project will have a lot of lessons about building static index like this that they would be happy to share. Also there are reusable components available from those projects.
RasterFoundry is the raster management and manipulation system built on top of GeoTrellis.
Level 0 support for COGs is serving tiles. RF does this using GeoTrellis.
Tile request from RasterFoundry.
Catalog is application specific
COG Header is hot information, must be cached first
Fetching COG segments may be too slow for quick screen refreshes
An alternative view of tile request. Highlight here is that we’re reprojecting the COG on the fly per tile request.
GeoTrellis layers are the constructs that represent a distributed raster collection that we can work over in cluster memory.
Avro layers work really great on sorted stores like Accumulo but on S3 we end up with lots of small files.
COG becomes our meta tile, its great.
As a bonus the layers are now accessible from non GeoTrellis applications.
This has potential to drastically reduce the data duplication.
Note we had to add the DATE_TIME tag to keep track of the actual time.
VRT files tie them together and make them accessible.
There is only so much you can resample a set of tiles before you don’t have enough information to save
At that point you need to collect the tops of the pyramid and reshuffle them into higher levels
GeoTrellis handles this process when it writes out pyramided COG layers.
Snapshot from OSM / RasterFrames / GeoTrellis machine learning project
Best part is that now every layer written by GeoTrellis is accessible through QGIS
This is a workflow where we combine source data, produces GeoTrellis layer and published OSM layer
Thinking about COGs as mini-image databases
We can put tile reading process in front of a distributed batch job.
Instead of running it per request we run it per record.
This effectively inlines the ingest process into the batch job itself.
Same concerns that we talked about for making viewable COG catalogs are in place for making distributed imagery piplines sourcing COGs.
Often we need more than one layer to do something interesting
Fetch the metadata from the catalog to find what we want
Fetch the metadata from the COGs to find out what it is
Filter the potential IO using the COG headers
Fetch the segments as tiles we know we need in a way we want
Proof is in the pudding
This GeoTrellis Landsat tiler coregisters scenes from multiple CRCs
Reponsivness here gives us the idea that cost in a batch process is actually minimal.