Stanford 2D-3D-Semantics Dataset (2D-3D-S)

RGB image

Semantics in 2D


3D Mesh

Semantics in 3D

Surface Normals

Use your mouse on the 360° images to look around.

Select from the images below to visualize them in the 360° viewer.

Select from the areas below to visualize them in 3D


Area 1


Area 2


Area 3


Area 4


Area 5


Area 6

(these point clouds are also part of the Stanford 3D Large-Scale Indoor Spaces Dataset)

back to top

> Download

  • Full 2D-3D-S Dataset To download the full 2D-3D-S dataset click here.
  • S3DIS Dataset: To download only the Stanford Large-Scale 3D Indoor Spaces Dataset (S3DIS) used in this paper, which contains only the 3D point clouds with ground truth annotations, click here. These 3D point clouds are included in the 2D-3D-S dataset.


  • If you use this dataset please cite the 2D-3D-S paper (bibtext).
  • If you use the S3DIS portion only, you can cite the 3D Semantic Parsing of Large-Scale Indoor Spaces, CVPR2016 paper (bibtex).

back to top

> Overview

The 2D-3D-S dataset provides a variety of mutually registered modalities from 2D, 2.5D and 3D domains, with instance-level semantic and geometric annotations. It covers over 6,000 m2 and contains over 70,000 RGB images, along with the corresponding depths, surface normals, semantic annotations, global XYZ images (all in forms of both regular and 360° equirectangular images) as well as camera information. It also includes registered raw and semantically annotated 3D meshes and point clouds. In addition, the dataset contains the raw RGB and Depth imagery along with the corresponding camera information per scan location. The dataset enables development of joint and cross-modal learning models and potentially unsupervised approaches utilizing the regularities present in large-scale indoor spaces.

In more detail, the dataset is collected in 6 large-scale indoor areas that originate from 3 different buildings of mainly educational and office use. For each area, all modalities are registered in the same reference system, yielding pixel to pixel correspondences among them. In a nutshell, the presented dataset contains a total of 70,496 regular RGB and 1,413 equirectangular RGB images, along with their corresponding depths, surface normals, semantic annotations, global XYZ OpenEXR format and camera metadata. It also contains the raw sensor data, which comprises of 18 HDR RGB and Depth images (6 looking forward, 6 towards the top, 6 towards the bottom) along with the corresponding camera metadata per each of the 1,413 scan locations, yielding a total of 25,434 RGBD raw images. In addition, we provide whole building 3D reconstructions as textured meshes, as well as the corresponding 3D semantic meshes. We also include the colored 3D point cloud data of these areas with the total number of 695,878,620 points, that has been previously presented in the Stanford large-scale 3D Indoor Spaces Dataset (S3DIS).

back to top

> Collection and Processing

We collected the data using the Matterport Camera, which combines 3 structured-light sensors at different pitches to capture 18 RGB and depth images during a 360° rotation at each scan location. Each 360° sweep is performed in increments of 60°, providing 6 triplets of RGB-D data per location. The output is the reconstructed 3D textured meshes of the scanned area, the raw RGB-D images, and camera metadata. We used this data as a basis to generate additional RGB-D data and make point clouds by sampling the meshes. We semantically annotated the data directly on the 3D point cloud, rather than images, and then projected the per point labels on the 3D mesh and the image domains.

3D modalities

The dataset contains colored point clouds and textured meshes for each scanned area. 3D semantic annotations for objects and scenes are offered for both modalities, with point-level and face-level labels correspondingly. The annotations were initially performed on the point cloud and then projected onto the closest surface on the 3D mesh model. Faces in the 3D mesh that account for no projected points belong to non-annotated parts of the dataset and are labeled with a default null value. We also provide the tightest axis-aligned boudning box of each object instance and further voxelize it into a 6x6x6 grid with binary occupancy and point index correspondence.

Fig 1. 3D Modalities. The dataset includes both the textured and semantic 3D mesh models of all areas as well as their point clouds.

2D modalities:

In addition to the raw RGB and depth data, the dataset contains densely sampled RGB images per scan location. These images were sampled from equirectangular images that were generated per scan location and modality using the raw data captured by the scanner. All images in the dataset are stored in full high-definition at 1080x1080 resolution. For more details on the random sampling of RGB images read section 4.2 in the paper. We provide the camera metadata for each generated image.

Fig 2. Data Processing. The RGB output of the scanner is registered on the 3D modalities (each yellow marker represents one scan). We then process the RGB and 3D data to make depth, surface normals, and 2D semantic (projected from 3D semantics) images for each scan. The processed equirectangular images are sampled to make new regular images.

Fig 3. Sampling images from the equirectangular projection. We use the equirectangular projections to sample 72 images per scan location, all with consistent depth, surface normal, and semantic information.

Fig 4. Sampling distributions. We sample camera parameters from the above distributions of yaw, pitch and Field of View (FOV).

We also provide the corresponding dense sampled depth images that were computed on the 3D mesh instead of directly from the scanner as with the raw Depth images, as well as surface normal images. 2D semantic annotations are computed for each image by projecting the 3D mesh labels on the image plane. Due to certain geometric artifacts present at the mesh model mainly because of the level of detail in the reconstruction, the 2D annotations occasionally present small local misalignment to the underlying pixels, especially for points that have a short distance to the camera. This issue can be easily addressed by fusing image content with the projected annotations using graphical models. The dataset also includes 3D coordinate encoded images where each pixel encodes the X, Y, Z location of the point in the world coordinate system. Such images can be used for conveniently relating the content of RGB images, e.g. forming dense correspondences. Last, an equirectangular projection is also provided per scan location and modality. For image statistics look at Table 2.

back to top

> Sample Data

The figure below provides a representative sample of the generated data showing the diversity of the indoor scenes, in terms of scene category, appearance, intra-class variation, density of objects and amount of clutter. This also shows the varying degree of difficulty of the data, which consists of both easy and hard examples.

Fig 5. Examples Images of 2D and 2.5D Modalities. RGB, Semantic, Surface Normals and Depth images.

back to top

> Train and Test splits

Certain areas in the dataset represent parts of buildings with similarities in their appearance and architectural features, thus we define standard training and testing splits so that no areas from similarly looking buildings appear in both. We split the 6 areas in the dataset as per the Table below and follow a 3-fold cross-validation scheme.

Fold # Training Testing
(Area #) (Area #)
1 1, 2, 3, 4, 6 5
2 1, 3, 5, 6 2, 4
3 2, 4, 5 1, 3, 6

Table 1. Training and Testing Splits (3-fold cross-validation)

back to top

> Paper

Joint 2D-3D-Semantic Data for Indoor Scene Understanding

Iro Armeni*, Alexander Sax*, Amir R. Zamir, Silvio Savarese

[Paper] [Bibtex]

back to top

Dataset Statistics

The annotations are instance-level, and consistent across all modalities and correspond to 13 object classes (structural elements: ceiling, floor, wall, beam, column, window, door, and movable elements: table, chair, sofa, bookcase, board and clutter for all other elements) and 11 scene categories (office, conference room, hallway, auditorium, open space, lobby, lounge, pantry, copy room, storage and WC). You can view more statistics at the Table below.

To see the 3D parsing results on the 3D point clouds of this dataset, see the results page.

# of Images per 2/2.5D modality
Area Image Type R Image Type I Image Type E Total
1 3,420 10,327 190 42,068
2 5,382 15,714 299 64,052
3 1,530 3,704 85 15,156
4 4,644 13,268 258 54,104
5 6,714 17,593 373 71,864
6 3,744 9,890 208 40,392
Total 25,434 70,496 1,413 287,636

Table 2. Statistics of Images.
(R: Raw Images, I: Regular Images,
E: Equirectangular Images)

Area (m2) Volume (m3) Office Conf. Room Auditorium Lobby Lounge Hallway Copy Room Pantry Open Space Storage WC Total Num
Area 1 965 2850 31 2 - - - 8 1 1 - - 1 45
Area 2 1100 3065 14 1 2 - - 12 - - - 9 2 39
Area 3 450 1215 10 1 - - 2 6 - - - 2 2 24
Area 4 870 2780 22 3 - 2 - 14 - - - 4 2 49
Area 5 1700 5370 42 3 - 1 - 15 - 1 - 4 2 55
Area 6 935 2670 37 1 - - 1 6 1 1 1 - - 53
Total Num 6020 17,360 156 11 2 3 3 61 2 3 1 19 9 270

Table 3. Disjoint Spaces Statistics per Building Area.

ceiling floor wall beam column door window table chair sofa bookcase board
Area 1 56 45 235 62 58 87 30 70 156 7 91 28
Area 2 82 51 284 62 58 94 9 47 546 7 49 18
Area 3 38 24 160 14 13 38 9 31 68 10 42 13
Area 4 74 51 281 4 39 108 41 80 160 15 99 11
Area 5 77 69 344 4 75 128 53 155 259 12 218 43
Area 6 64 50 248 69 55 94 32 78 180 10 91 30
Total Num 391 290 1,552 165 260 549 174 461 1,369 61 590 143

Table 4. Object Class Statistics.

back to top


back to top