Most of the existing multi-view multi-person 3D human pose estimation methods predict the location of each joint of one target person following a top-down paradigm after finding his region. However, these works neglect the interference of others’ joints in the region. When the scene is crowded and the target person is surrounded by others, the information of his joints tends to be disturbed which results in significant errors in 3D results. To overcome this problem, this paper takes advantage of a bottom-up method in 2D pose estimation. We incorporate the Associative Embedding method into 3D pose estimation and propose a Voxel Hourglass Network to predict 3D heatmaps along with 3D tag-maps. As a result, the adverse effects from surrounding persons can be eliminated through the difference between tags. Moreover, we design a three-stage coarse-to-fine framework which can effectively reduce the quantization error. The size of the search space drops at each stage while the resolution increases. We test our method on the CMU Panoptic dataset where it outperforms the related top-down methods.