Structure from motion: tips and tricks

While 3D modelling is neither the only nor the most important technique that I use in my work, I do spend some time playing, exploring and using structure-from-motion to create 3D models and orthophotos of objects, sites and landscapes. Occasionally someone asks me either general or specific questions about photo acquisition, processing etc. That’s why I am sharing my experiences here. This is not a manual but rather a collection of tips and tricks based on my own experiences as well as on discussions with colleagues and in the Agisoft user forum.

Things you can do with Structure-from-Motion

  • create 3D point clouds describing an object
  • create meshed 3D models
  • create digital surface models
  • create digital terrain models
  • create orthophotos

What you need

  • a set of overlapping digital photographs taken from different positions relative to the object
  • SfM software (I mostly use Agisoft Photoscan [This is not advertising – I do not get anything from Agisoft for mentioning their software],  but there is also free software like VisualSfM that can be used.)

Camera choice

  • Barely useable to very good 3D models can be created with any camera. In many cases (and up to a certain limit), illumination, image acquisition and processing details have a stronger influence on the result than the type of camera used.
  • Fixed focal length, fixed focus cameras without any movement of lens elements relative to one another or relative to the sensor will allow grouped camera calibration which will produce better and more reliable results.
  • I f lens elements can be expected to be moving between images (even very slightly) relative to one another or relative to the sensor (zoom, focus, image stabilisation), separate camera calibration will often produce better results.
  • Many cell phone and video cameras are subject to the rolling shutter problem (sensor lines read sequentially producing non-radial distortion which can not be dealt with by the algorithms implemented in Photoscan. Image alignment may fail or be very poor if camera was moved relative to the object even though the images may not appear blurred. Such cameras are useable, but even more care should be taken to avoid movement.
  • For small objects, cameras with a physically small sensor are often better suited because depth of field will be wider. Narrow depth of field can be very problematic for very small objects. Focus stacking is an option but requires greatly increased time and effort.
  • Avoid extreme wide angle lenses. Camera calibration will not be very good for these, and the very large view angle changes between images will make it harder for the algorithm to find matching points.
  • Scanned paper photographs, negatives and slides: While this type of imagery is also useable for SfM, it can be very problematic. There are at least three reasons for this: (i) Many (low-quality) scanners introduce non-radial distortions somewhat similar to the rolling shutter problem. This is because the scan line is often not perfectly perpendicular to the direction of its movement across the image. An additional source of distortion can be warping of the images, negatives or slides. (ii) In most cases, the scan does not cover the entire image frame, i.e. the edges are cut off to an unknown and often unequal extent. This means that the centre of the digital image may be quite far from the principal point of the camera, and this is something that results in poor camera calibration. (iii) Many analog images suffer from grain, dust or scratches which can result in poor 3D modelling results due to spurious matching points and noisy depth maps.
  • Scanned aerial photographs taken with a professional camera (including fiducial markers): These can be very good and may often be the only available source, but they can suffer the same problem as other paper/negative photographs, in particular if they were digitised using a low-quality scanner. However, if all fiducial markers are visible in the images, the distortions due to scanning can (in principle) be removed to a large extend.


  • Agisoft Photoscan is able to produce decent 3D models even if images with very different illumination are used in a single project. However, camera alignment and 3D reconstruction will usually be better if changes in illumination are minimised.
  • The entire 3D reconstruction process relies on image texture. If possible, objects should be illuminated so as to enhance image texture. Texture-less surfaces can not be modelled. Contrast stretching (applying the same contrast stretch to all images) can help if texture is poor. Point-pattern or image projection may be an option in some cases.
  • Avoid hard shadows, because 3D reconstruction in the shadowed areas may be poor.
  • Avoid built-in or direct flash, because this will result in very different illumination for every image and also hard shadows. If you need to use a flash, use tripods to set up one or several remotely triggered flashes with soft boxes, then take your images with this stable illumination. Of course, the same goes for other lights: soften the illumination and keep it as constant as possible between images. Don’t forget that you and your camera may cast undesirable shadows.
  • Using a lens-mounted circular LED illumination has been reported to work well because it creates a relatively even illumination of the field of view.
  • Glossy surfaces (water, wet surfaces, metal etc.) are problematic. Eliminating or reducing specular reflections / gloss will be very beneficial for camera calibration, alignment and 3D reconstruction. This can be done by allowing surfaces to dry before image capture, modifying illumination to minimise gloss and applying a polarising filter.

Image capture

  • Capture sufficient images with an overlap of at least 67% (i.e. every single point of the object is seen in at least 3 images). Often, an overlap of 80-90% will be a good choice. It does not cost much too take more pictures, but if you take too few you may never be able to go back and take more. If using a wide angle lens (often in interior settings), increase overlap to above 90%, because view angle changes from image to image will be very large which will make it difficult to for the algorithm to find matching points.
  • Avoid “corridor mapping” (i.e. a single or very few long parallel strips) because in such cases small errors tend to accumulate; this can lead to an overall warped model. When working with the pro version of Photoscan, using camera GPS positions as control points in the image alignment can reduce this problem.
  • When capturing an object, capture images all the way around it even if you are ultimately not interested in all sides. Closing the circle can greatly improve model quality as it reduces warping. For example, terrestrial image acquisition of the façade of a long building or wall will be equivalent to a “corridor mapping” approach as only one or very few parallel strips of images are acquired. Taking images all around the building will improve model accuracy.
  • Re-orienting the camera relative to object (i.e. not only capturing parallel strips but adding several additional strips perpendicular to the others) usually improves camera calibration and image alignment.
  • Shallow view angles can result in poor (or failed) camera alignment. View angle should usually be between 45° and 90°. View angle change from image to image should be low (less than 45°). If views from very different directions are required, adding images at intermediate view angles will greatly improve camera alignment. The same goes for strips: If several strips of images are acquires under different view angles, view angle changes between strips should be low (less than 45°). Adding an additional strip in-between will often greatly improve camera alignment. Generally, taking pictures from an elevated position allows steeper view angles. While flying is expensive, drones, kites and poles can be options to get elevated viewing positions.
  • Edges: If the object to be captured has relatively sharp edges (c. 90° or less), use higher overlap to make sure that the edges will be well-represented in the model.
  • Background masking: In principle, Photoscan is able to model both background and foreground, and then you can simply create your model from the object in the foreground and ignore those parts of the model that are background. However, background can be problematic. One reason for this is that there may be movement in the background (people walking past the scene, clouds drifting across the sky etc.) which can have a negative effect on camera calibration and alignment. Another reason is that parts of the background may accidently become part of the model you want to capture. Background masking therefore improves modelling results. Furthermore, by reducing the image areas that the software has to deal with, processing will also be faster. Using a blue (or any other colour not present in the object) screen can make background masking much easier and faster. A computer monitor has the advantage that there will be (almost) no shadow on the background.
  • Full body capture: The minimum number of photographs to properly capture an object from all directions is larger than one would expect when only thinking about the overlap constraint. Because view incidence angle in many parts of the object is very shallow (even more so if a wider angle lens is used at close distance), more images are necessary to get good results. As a first approximation, this will mean that you need to acquire one or several roughly parallel circles (c. 16-20 images each) around the centre of the object plus several more to properly capture top and bottom. Background masking becomes particularly important for full body capture, and becomes indispensible if the object is rotated (rather than the camera moved around the object).

Rotation is the enemy

Last week I have published a simple tool that calculates (among a few other things) motion blur resulting from camera movement relative to the photographed object. Looking at the results of those calculations, one could say that motion blur is a very minor issue in UAV photography: at a platform speed of 30 km/h and a shutter speed of 1/1000 s, motion blur is as low as 0.8 cm. Flying a Canon G12 at the wide angle limit (28 mm) and 200 m above ground, this amounts to only 0.25 image pixels. From the calculation results of UAVphoto, motion blur does not appear to be a relevant issue. The need to take images at short intervals to achive sufficient overlap appears to be much more important when using a UAV. But why do I even get blurred images when using a kite that is almost immobile relative to the ground?

The point is that motion blur due to translation (i.e. linear movement of the camera relative to the object) is only one reason for blurred images. Another (and much more relevant) reason is rotation of the camera. Unfortunately, this is also much harder to measure and to control. To show how important rotation is for image blur, I have added the calculation of rotation blur to the new version of UAVphoto. Two types of rotation have to be distinguished: rotation about the lens axis and rotation about the focal point but perpendicular to the lens axis. I am not using the terms pitch, roll and yaw here because the relation of platform pitch, roll and yaw to rotation about different camera axes depends on how the camera is mnounted to the platform.

Rotation about the lens axis results in rotation blur that is zero at the image centre and reaches a maximum at the image corners. Rotation about an axis orthogonal to the lens axis results in rotation blur that is at first sight indistinguishable from motion blur due to high speed translation movement. Of course, all types of blur combine to the total image blur. Rotation blur about the lens axis is independend of focal length. Orthogonal rotation blur, on the other hand, increases with increasing focal length. In both cases an increase in shutter speed will result in a proportional decrease in image blur.

Most UAV rotation movements are due to short-term deflections by wind gusts or steering. Wind gusts are also the main source of rotation movements of kite-carried cameras. Let’s say we’re using a Canon G12 at the wide angle limit (28 mm). The maximum rotation rate which will not result in image blur (using a 0.5 pixel threshold) is 12.4 °/s (or 29 s for a full circle) for rotation about the lens axis and 8.1 °/s (or 44 s for a full cirlce) for rotation orthogonal to the lens axis. At a focal length of 140 mm, the maximum rotation rate orthogonal to the lens axis is only 1.9 (or 189 s for a full circle). If all this sounds very slow to you, you’ve got the point: even slow rotation of the camera during image capture is a serious issue for UAV photography, in most cases much more important than flying speed.

UAVphoto – a simple calculation tool for aerial photography

I have to admit that I am sometimes a bit lazy. Rather than solving a problem once and working with the solution, in some cases I keep twiddling with the same problem again and again. Calculating things like viewing angles, ground resolution, motion blur or image overlap for aerial photography is a case in point. There must be a dozen or so spreadsheet files on my various computers which I used to do such calculations. I kept re-inventing the wheel again and again for myself and when others asked me for help.


Now I finally got around writing a small piece of software for this specific task. It is a simple tool that allows to calculate parameters like ground pixel size, motion blur and sequential image overlap from UAV flight parameters (velocity and altitude) and camera parameters (focal length, shutter time, image interval etc.). Calculation assumes a vertical camera view for simplicity. Image y dimensions are those in flight direction, image x dimensions are those perpendicular to flight direction. Default camera values are for Canon G12 at the wide angle limit. Five to six seconds is the approximate minimum image interval using a CHDK interval script. In continous shooting mode, a minimum interval of approximately one second can be achieved.

Now that I created this tool, why not share it? UAVphoto is published under the GNU General Public License and can be downloaded from Sourceforge.

SfM mapping of the giant walls of the Inca fortress Sacsayhuamán, Cuzco, Peru

Structure from Motion (SfM) is a widely used method for creating 3D models of almost anything from small objects to entire landscapes. For landscapes or large structures such as prehistoric or historic fortifications, it would be preferable to have aerial photographs, perhaps in combination with terrestrial photographs, to capture the entire feature. Often, this is not possible, and all that is available is terrestrial photographs taken from limited accessible areas. It’s often not possible to take photographs of all parts of the structure.

This is the sitiuation I faced when visiting the Inca fortress Sacayhuamán high above the city of Cuzco. As a tourist, only parts of the fortress were accessible for me, and there was no possibility to take pictures from above. Of course it is possible to use SfM to create 3D models of the giant walls, but any attempt to create a 3D model of the entire fortress would result in gaping holes.

The defensive walls of the Inca fortress Sacsayhuaman.

The defensive walls of the Inca fortress Sacsayhuamán.

Besides admiring the famous Inca stonework, I wanted to use my visit to Sacsayhuamán to test the possibilities of SfM. Nothing fancy, but I wanted to see how far I could get by simply taking many photographs as I walked along the walls. Knowing that I would not be able to capture the entire fortress, I wanted to see how well SfM would be suited for creating a map of the giant walls. I had two GPS loggers in my backpack that recorded my position at one second intervals.

Satellite image of Sacsayhuaman showing the GPS positions of the photographs. Satellite image (c) Google Earth.

Satellite image of Sacsayhuamán showing the GPS positions of the photographs. Satellite image (c) Google Earth.

In the end I put 171 photographs into a SfM software (Agisoft Photoscan). Photoscan was able to automatically align 168 of these images and to create 163 depth maps at the highest quality settings. Out of this came a dense 3D point cloud containing 593 million points. I used Java Graticule 3D to compute transformation parameters (using modelled camera positions and GPS camera positions) that allowed me to georeference the point cloud. Using these transformation parameters, I rotated the point cloud and created gridded data sets of elevation (i.e., a DSM), vertex colour (i.e., an orthophoto) and the number of vertices per grid cell.

Of course, the DSM and the orthophoto have large holes because many parts of the fortress were not visible on any of my photographs. Because all of my photographs show parts of the giant walls, most of the vertices in the point cloud belong to the walls. The point density map therefore clearly shows the outlines of the walls.

SfM-based point density map showing the outlines of the defensive walls of the Inca fortress Sacsayhuaman.

SfM-based point density map showing the outlines of the defensive walls of the Inca fortress Sacsayhuamán.

When loading this map into Google Earth, the outlines of the walls in the point density map matched very well with the satellite image. So all of this worked perfectly? There is a “but”: everything worked perfectly when using the position data from one of my GPS loggers (i-blue 747). When using the data from the other GPS logger (Polaris), the modelled outlines of the walls did not match with the Google Earth satellite images. I had realised earlier that the computation of the transformation parameters with Java Graticule 3D had resulted in much larger standard deviations for the Polaris (more than 5 metres) than for the i-blue 747 (0.25 metres). Comparing the two GPS tracks shows that the Polaris logger (which usually works well) had problems recording the correct track (walking diagonally across the walls is impossible!). In this case, checking GPS tracks for plausibility helps to avoid problems. In areas where the plausibility of a track can not be verified, standard GPS data can be very problematic. Looking at the standard deviations reported by Java Graticule 3D helps to see whether there may be a problem.

Comparison of the two GPS logger tracks (yellow: i-blue 747, red: Polaris). Only the positions from one of the tracks resulted in a correctly georeferenced model and point density map. Satelllite image (c) Google Earth.

Comparison of the two GPS logger tracks (yellow: i-blue 747, red: Polaris). Only the positions from the i-blue 747 track resulted in a correctly georeferenced model and point density map. Satelllite image (c) Google Earth.

3D acquisition techniques: Structure from Motion

Structure from Motion (SfM), also known as Multi-View Stereo (MVS), is a technique for creating three-dimensional digital models from a set of photographs. Fundamentally, it is a stereo vision approach, that is, it uses the parallax (the difference in the apparent relative position of an object if seen from different directions) to derive 3D information (distance/depth) from 2D images. (Look at something that has foreground and background, e.g. a plant on your window sill and the house on the other side of the street, alternate between your right and left eye – the shifts you see is the parallax.)

What you need for SfM is several (2, 3, … thousands) photographs capturing an object from many different directions. Then, there are by now several software products that can perform SfM 3D reconstruction. VisualSfM (open source) and Agisoft Photoscan ($179 standard, $3499 professional) appear to be among the more popular solutions. There are several more, and there are quality, speed and usability differences, but essentially they all do the same.

A series of photographs of a rock in the desert.

A series of photographs of a rock in the desert.

The software will first analyse the photographs to find many (usually several thousand) characteristic feature points within each image, using SIFT (Scale-Invariant Feature Transform) or similar algorithms. The feature points are defined by analysing the surrounding pixels which will help to recognise similar feature points in other images. Algorithms like SIFT (rather than just trying to directly match small regions among images, e.g. by correlation) ensure that this works irrespective of scale or rotation.

The next step is to match these feature points among images to find corresponding feature points. Some feature points will not be found in any other image – so what, there are many more. Some others will be mismatched, so outliers will be recognised and excluded. The next step is the actual heart of the SfM approach: bundle adjustment. The theory for this has been around for decades, but only with powerful computer hardware was it possible to actually do this (who wants to do billions of computations on a piece of paper…). What happens during bundle adjustment is that the software tries to find appropriate camera calibration parameters and the relative positions of cameras and the feature points on the object. It is quite a challenging optimisation problem, but after all its not magic but just very sophisticated number crunching.

The algorithm starts by using two images. It assumes the known camera focal length (from the JPG’s EXIF data) or (if no data are available) a standard lens. It then uses the relative (2D) positions of the feature points in the two images to derive a rough estimate of the 3D coordinates of these points relative to the camera positions, then iteratively refines both the camera parameters and the 3D point coordinates until either a certain number of iterations or a certain threshold in the variance of the point positions has been reached. Then, the next image is added, and the software tries to fit the matched feature points of that image into the existing 3D model, iteratively refines the parameters and so on. After all images have been added, an additional bundle adjustment is run to refine the entire model. And then, voilá, there’s your sparse 3D point cloud and your modelled camera calibrations and positions.

Sparse point cloud with modelled camera positions.

Sparse point cloud with modelled camera positions.

Now, there are three or four things still worth considering. One is that you may observe slight differences in modelled camera calibration parameters or 3D data. That’s a relatively common issue when dealing with iteratively optimised data. What comes out of the process is not the “absolute truth” but something that is usually quite close to it.

The second point is that the 3D model somehow floats in space – it is not referenced to any coordinate system. Depending on the software, referencing can be done directly by inserting control points with known positions or by using known camera positions as control points. Alternatively, referencing can be done in external software (for example by applying a coordinate transform to the point cloud using Java Graticule 3D. For some purposes, the non-referenced model is good enough, for others you might at least want to scale it, for yet others a complete rotate-scale-translate transformation to reference the model to a coordinate system is necessary.

The third point is that you may want a denser point cloud. To get this, software like VisualSfM or Agisoft Photoscan uses the camera positions, orientation and calibration together with the sparse point cloud and the actual images to compute depth (distance) maps: for every pixel in the image (but usually resampled to a lower resolution), the distance between camera and object is computed. From these distance maps, a dense point cloud is then created.

Finally, what if you need a meshed model? It is either created directly after dense point cloud generation (Agisoft Photoscan), or meshing can be done in external software.

test descr

Meshed model of the rock in the desert.

SPAR Conference in The Hague

Taking part in the SPAR Europe Conference in The Hague this week made me feel a little out of place among all those business executives with their suits and ties. The t-shirt and sweater fraction was very likely below 10%, quite similar to the percentage of women attending this conference. So, yes, it was quite technical and business-oriented. It was interesting nonetheless to see the newest examples of 3D acquisition and application techniques.

Of course, the presentations by the business representatives were quite often of the “our product is the best on the market” or at least of the “look what we can do!” type. The more inquisitive approach that you find at scientific conferences would have made it more interesting (for me at least).

Still, the conference drew together a quite diverse range of people. The main focus seemed to be on terrestrial laser scanning on one hand and on airborne structure-from-motion on the other hand. Especially in the field of 3D generation from multiple digital photographs, the differences in technological complexity (and hardware costs) could not have been greater: from using smartphones or 99 € digital cameras to 260-megapixel cameras with 5 spectral channels or multi-camera setups capable to acquire one or two gigapixels per second, from working with a few images to processing 2.8 Million images in one large process (Building Rome on a cloudless day), from consumer laptops running open-source software to dedicated multi-processor number-crunchers deployable in aircrafts or hangars.

Asked about the processing time to create a dense 3D model from a given set of a few thousand photographs, one speaker said, well, that would depend on how many hundred processor cores we throw at the task. How surprising is it that one of the presentations with the technically most advanced setup had not only “gigapixels” but also “soldiers” in it? So while 3D methods (in particular photogrammetry / Structure from Motion) are becoming increasingly available for scientists having to work on a low budget (for example many archaeologists) or non-profesionals, military applications are, as usual, quite far ahead.