At the current rate of technological advancement and social acceptance thereof, it will not be long before wearable devices will be common that constantly record the field of view of the user. We introduce a new database of image sequences, taken with a first person view camera, of realistic, everyday scenes. As a distinguishing feature, we manually transcribed the scene text of each image. This way, sophisticated OCR algorithms can be simulated that can help in the recognition of the location and the activity. To test this hypothesis, we performed a set of experiments using visual features, textual features, and a combination of both. We demonstrate that, although not very powerful when considered alone, the textual information improves the overall recognition rates.