Scene context is known to facilitate object recognition in both machines and humans, suggesting that the underlying representations may be similar. Alternatively, they may be qualitatively different since the training experience of machines and humans are strikingly different. Machines are typically trained on images containing objects and their context, whereas humans frequently experience scenes without objects (such as highways without cars). If these context representations are indeed different, machine vision algorithms will be improved on augmenting them with human context representations, provided these expectations can be measured and are systematic. Here, we developed a paradigm to measure human contextual expectations. We asked human subjects to indicate the scale, location and likelihood at which cars or people might occur in scenes without these objects. This yielded highly systematic expectations that we could then accurately predict using scene features. This allowed us to predict human expectations on novel scenes without requiring explicit measurements. Next we augmented decisions made by deep neural networks with these predicted human expectations and obtained substantial gains in accuracy for detecting cars and people (1-3%) as well as on detecting associated objects (3-20%). In contrast, augmenting deep network decisions with other conventional computer vision features yielded far smaller gains. Taken together, our results show that augmenting deep neural networks with human-derived contextual expectations improves their performance, suggesting that contextual representations are qualitatively different in humans and deep neural networks.