The complete pipeline for producing digital tours (referred to as DIGITOUR and shown in Figure 3) is as follows.

2.1 Tag Placement and Image Capturing

While creating a digital tour for any real-estate property, it is essential to click 360β—¦ images from different property locations such as bedroom, living room, kitchen, etc., then automatically stitching them together to have a β€œwalkthrough” experience without being physically present at the location. Therefore, to connect multiple equirectangular images, we propose placing paper tags on the floor covering each location of the property, and placing the camera (in our case, we used Ricoh-Theta) in the middle of the scene to capture the whole site (front, back, left, right and bottom).

DIGITOUR: Automatic Digital Tours for Real-Estate Properties 🏠 | by Prateek Chhikara | Mar, 2024 - image 1vSu9pTzkcCrpKPj18831VA on
Figure 4: Proposed bi-colored tag format and color scheme for each digit with their corresponding HSV values. (Source: Image by the author)

Moreover, we ensure that the scene is clear of all noisy elements such as dim lighting and β€˜unwanted’ artifacts for better model training and inference. As shown in Figure 4, we have standardized the tags with dimensions of 6” Γ— 6” with two properties:

  1. they are numbered which will help the photographer place tags in sequence and
  2. they are bi-colored to formulate the digit recognition problem as classification task and facilitate better learning of downstream computer vision tasks (i.e. tag detection and digit recognition).

Please note that different colors are assigned to each digit (from 0 to 9) using the HSV color scheme and leading digit of a tag has a black circle to distinguish it from the trailing digit as shown in Figure 4. The intuition behind standardizing the paper tags is that it allows to train tag detection and digit recognition models, which are invariant to distortions, tag placement angle, reflection from lighting sources, blur conditions, and camera quality.

2.2 Mapping Equirectangular Image to Cubemap Projection

An equirectangular image consists of a single image whose width and height correlate as 2 : 1 (as shown in Figure 1). In our case, images are clicked using a Ricoh-Theta camera having dimensions 4096 Γ— 2048 Γ— 3. Typically, each point in an equirectangular image corresponds to a point in a sphere, and the images are stretched in the β€˜latitude’ direction. Since the contents of an equirectangular image are distorted, it becomes challenging to detect tags and recognize digits directly from it. For example, in Figure 1, the tag is stretched at the middle-bottom of the image. Therefore, it is necessary to map the image to a less-distorted projection and switch back to the original equirectangular image to build the digital tour.

In this work, we propose to use cube map projection, which is a set of six images representing six faces of a cube. Here, every point in the spherical coordinate space corresponds to a point in the face of the cube. As shown in Figure 5, we map the equirectangular image to six faces (left, right, front, back, top and bottom) of a cube having dimensions 1024 Γ— 1024 Γ— 3 using python library vrProjector.

DIGITOUR: Automatic Digital Tours for Real-Estate Properties 🏠 | by Prateek Chhikara | Mar, 2024 - image 1vSu9pTzkcCrpKPj18831VA on
Figure 5: Conversion of an equirectangular image to its corresponding six faces cubemap projection. (Source: Image by the author)

2.3 Tag Detection

Once we get the six images corresponding to the faces of a cube, we detect the location of tags placed in each image. For tag detection, we have used the state-of-the-art YOLOv5 model. We initialized the network with COCO weights followed by training on our dataset. As shown in Figure 6, the model takes an image as input and returns the detected tag along with coordinates of the bounding box and confidence of the prediction. The model is trained on our dataset for 100 epochs with a batch size of 32.

DIGITOUR: Automatic Digital Tours for Real-Estate Properties 🏠 | by Prateek Chhikara | Mar, 2024 - image 1vSu9pTzkcCrpKPj18831VA on
Figure 6: Tag detection using Yolov5. (Source: Image by the author)

2.4 Digit Recognition

For the detected tags, we need to recognize the digits from the tag. In a real-world environment, the detected tags might have incorrect orientation, poor luminosity, reflection from the bulbs in the room, etc. Due to these reasons, it is challenging to use Optical Character Recognition (OCR) engines to have good digit recognition performance. Therefore, we have used a custom MobileNet model initialized on Imagenet weights, which uses color information in tags for digit recognition. In the proposed architecture, we have replaced the final classification block of the original MobileNet with the dropout layer and dense layer with 20 nodes representing our tags from 1 to 20. Figure 7 illustrates the proposed architecture. For training the model, we have used Adam as an optimizer with a learning rate of 0.001 and a discounting factor (𝜌) to be 0.1. We have used categorical cross-entropy as a loss function and set the batch size to 64 and the number of epochs to 50.

DIGITOUR: Automatic Digital Tours for Real-Estate Properties 🏠 | by Prateek Chhikara | Mar, 2024 - image 1vSu9pTzkcCrpKPj18831VA on
Figure 7: Digit recognition using custom MobileNet model. (Source: Image by the author)

2.5 Mapping tag coordinates to the original 360β—¦ Image and Virutal Tour Creation

Once we have detected the tags and recognized the digits we use the python library vrProjector to map the cube map coordinates back to the original equirectangular image. An example output is shown in Figure 8. For each equirectangular image, the detected tags form the nodes of a graph with an edge between them. In the subsequent equirectangular images of a property, the graph gets populated with more nodes, as more tags are detected. Finally, we connect multiple equirectangular images in sequence based on recognized digits written on them and the resulting graph is the
virtual tour as shown in Figure 2(b).

DIGITOUR: Automatic Digital Tours for Real-Estate Properties 🏠 | by Prateek Chhikara | Mar, 2024 - image 1vSu9pTzkcCrpKPj18831VA on
Figure 8: Mapping tags to original equirectangular image. (Source: Image by the author)

We have collected data by placing tags and clicking equirectangular images using Ricoh-Theta camera for several residential properties in Gurugram, India (Tier 1 city). While collecting images we made sure that certain conditions were met such as all doors were opened, lights were turned on, β€˜unwanted’ objects were removed and the tags were placed covering each area of the property. Following these instructions, average number of equirectangular images clicked per residential property was 7 or 8. Finally, we have validated our approach on the following generated datasets (based on background color of the tags).

  1. Green Colored Tags: We have kept the background color of these tags (numbered 1 to 20) to be green. We have collected 1572 equirectangular images from 212 properties. Once we convert these equirectangular images to cubemap projection, we get 9432 images (corresponding to cube faces). Since not all of the cube faces have tags (for e.g. top face), we get 1503 images with atleast one tag.
  2. Proposed Bi-colored Tags (see Figure 4): For these tags, we have collected 2654 equirectangular images from 350 properties. Finally, we got 2896 images (corresponding to cube faces) with atleast one tag.

Finally, we label the tags present in cube map projection images using LabelImg which is an open-source tool for labeling images in several formats such as Pascal VOC and YOLO. For all the experiments, we reserved 20% of data for testing and the remaining for training.

For any input image, we first detect the tags and finally recognize the digits written on the tags. From this we were able to identify the true positives (tags detected and read correctly), false positives (tags detected but read incorrectly) and false negatives (tags not detected). The obtained mAP, Precision, Recall and f1-score at 0.5 IoU threshold are 88.12, 93.83, 97.89 and 95.81 respectively. Please note that all metrics are averaged (weighted) over all the 20 classes. If all tags across all equirectangular images of a property are detected and read correctly, we receive a 100% accurate virtual tour since all nodes of the graph are detected and connected with their appropriate edges. In our experiments, we were able to accurately generate 100% accurate virtual tour for 94.55% of the properties. The inaccuracies were due to the presence of colorful artifacts that were falsely detected as tags; and bad lightning conditions.

Figure 9 demonstrates the performance of Yolov5 model for tag detection based on green colored and bi-colored tags. Further, experiments and comparison of models on digit recognition is shown in Figure 10.

DIGITOUR: Automatic Digital Tours for Real-Estate Properties 🏠 | by Prateek Chhikara | Mar, 2024 - image 1vSu9pTzkcCrpKPj18831VA on
Figure 9: Tag detection performance (%). (Source: Image by the author)
DIGITOUR: Automatic Digital Tours for Real-Estate Properties 🏠 | by Prateek Chhikara | Mar, 2024 - image 1vSu9pTzkcCrpKPj18831VA on
Figure 10: Comparison of different state-of-the-art models on bi-colored tags dataset for digit recognition task. (Source: Image by the author)

We propose an end-to-end pipeline (DIGITOUR) for automatically generating digital tours for real-estate properties. For any such property, we first place the proposed bi-colored paper tags covering each area of the property. Then, we click equirectangular images, followed by mapping these images to less distorted cubemap images. Once we get the six images corresponding to cube faces, we detect the location of tags using the YOLOv5 model, followed by digit recognition using the MobileNet model. The next step is to map the detected coordinates along with recognized digits to the original equirectangular images. Finally, we stitch together all the equirectangular images to build a virtual tour. We have validated our pipeline on a real-world dataset and shown that the end-to-end pipeline performance is 88.12 and 95.81 in terms of mAP and f1-score at 0.5 IoU threshold averaged (weighted) over all classes.

If you find our work beneficial and utilize it in your projects, we kindly request that you cite it. 😊

title={Digitour: Automatic digital tours for real-estate properties},
author={Chhikara, Prateek and Kuhar, Harshul and Goyal, Anil and Sharma, Chirag},
booktitle={Proceedings of the 6th Joint International Conference on Data Science \& Management of Data (10th ACM IKDD CODS and 28th COMAD)},

[1] Dragomir Anguelov, Carole Dulong, Daniel Filip, Christian Frueh, StΓ©phane Lafon, Richard Lyon, Abhijit Ogale, Luc Vincent, and Josh Weaver. 2010. Google street view: Capturing the world at street level. Computer 43, 6 (2010), 32–38.

[2] Mohamad Zaidi Sulaiman, Mohd Nasiruddin Abdul Aziz, Mohd Haidar Abu Bakar, Nur Akma Halili, and Muhammad Asri Azuddin. 2020. Matterport: virtual tour as a new marketing approach in real estate business during pandemic COVID-19. In International Conference of Innovation in Media and Visual Design (IMDES 2020). Atlantis Press, 221–226.

[3] Chinu Subudhi. 2021. Cutting-Edge 360-Degree Virtual Tours.

[4] Glenn Jocher, Ayush Chaurasia, Alex Stoken, Jirka Borovec, NanoCode012, Yonghye Kwon, TaoXie, Jiacong Fang, imyhxy, Kalen Michael, Lorna, Abhiram V, Diego Montes, Jebastin Nadar, Laughing, tkianai, yxNONG, Piotr Skalski, Zhiqiang Wang, Adam Hogan, Cristi Fati, Lorenzo Mammana, AlexWang1900, Deep Patel, Ding Yiwei, Felix You, Jan Hajek, Laurentiu Diaconu, and Mai Thanh Minh. 2022. ultralytics/yolov5: v6.1 β€” TensorRT, TensorFlow Edge TPU and OpenVINO Export and Inference.

[5] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and LiangChieh Chen. 2018. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4510–4520.

Source link