Massive multiple-input multiple-output (MMIMO) is a key technology for 5G mobile communication systems, which enables to simultaneously form and transmit multiple directional signal beams to multiple mobile terminals (MTs) on the same frequency channel with high array beamforming gains and throughput. One of the challenges in MMMIO beamforming is how to allocate the transmit power to multiple beams sent from a MMIMO base station to multiple MTs and schedule data transmissions, given heterogeneous traffic and channel conditions of multiple MTs. Furthermore, the statistics of users’ packet arrivals and channel states may not be known a priori and vary over time. In this paper, we propose a framework to optimize MMIMO beam power allocation and transmission scheduling in millimeter wave networks with time-varying traffic and channel conditions. The optimization problem is formulated as a Markov decision process (MDP) with the objective to minimize the overall queueing delay of multiple MTs by taking their heterogeneous and dynamic traffic and channel states into account. An online reinforcement learning scheme is designed which allows achieving the long-term optimal system performance with no requirement for a priori knowledge of user traffic statistics and wireless network states. Evaluation results show that our proposed scheme outperforms the state-of-the-art baselines.