Indoor navigation is a topical issue in the use of robotics and unmanned aerial vehicles (UAVs) in a smart city, especially in the absence/weak signals of satellite navigation. One of the promising solutions of these issues is the use of visual simultaneous localization and mapping (vSLAM) algorithms, and one or more remote data processing servers for localization and map construction of the robots and the UAVs. The paper analyzes the current state of remote vSLAM and gives the requirements of a promising similar framework for UAVs in a smart city. In general, it is possible to distinguish two approaches to remote vSLAM systems for robots/UAVs. The first one implies sending raw data from sensors (video stream) via wireless connection directly to a remote server, which can be used for cloud/fog/edge computing. The second approach involves data pre-processing on the robots/UAVs, with subsequent transmission of data to the remote cloud/fog/edge computing server for further processing, acquiring map keyframes and map updates. The considered approaches, algorithms and solutions are classified by the type and the number of servers used, the maximum number of robotic agents/UAVs, the presence of navigation data fusion, as well as the type of video sensor used. Based on the analysis of existing solutions and vSLAM classification, it is possible to form a set of requirements for promising vSLAM systems for indoor navigation of robots/UAVs, including their camera input, resolution, number of frames per second (FPS), bandwidth, support of wireless standards, channels, and protocols, multi-robot/multi-server support, and time consumption.