Introduction

First domesticated between 4000 and 2500 B.C.E. (Ekesbo 2011), horses have long contributed to human civilization. It is thus useful for humans to understand equine cognitive abilities in order to optimize human–horse interactions. While various methods for developing a positive relationship between humans and horses have been a research focus for the last 50 years (e.g., Hausberger et al. 2008; a review; Henry et al. 2005, 2006; Søndergaard and Jago 2010; Birke et al. 2011), equine social cognitive abilities in regard to humans have only recently been attracting increased interest. Domestic horses comprehend human pointing gestures (Proops et al. 2010; Maros et al. 2008; McKinley and Sambrook 2000) and are able to discriminate between an attentive and inattentive person (Krueger et al. 2011; Sankey et al. 2011; Proops and McComb 2010). They not only show evidence of a long-term categorical and conceptual memory (Hanggi and Ingersoll 2009), but can also form experience-based lasting negative or positive memories of humans that impact the future horse–human interactions (Fureix et al. 2009; Sankey et al. 2010a) and ensure consistency in the horses’ reactions to different persons; apparently, horses are able to recognize social counterparts (Sankey et al. 2010b). They have the ability to differentiate between familiar and unfamiliar humans when hearing (Sankey et al. 2011) or seeing them (Krueger et al. 2011) and to discriminate among human faces on photographs, even in novel settings (Stone 2010). Similarly, other common livestock animals such as pigs (Tanida and Nagano 1998), cows (Rybarczyk et al. 2001) and sheep (Boivin et al. 1997; Peirce et al. 2001) have been shown to discriminate between familiar and unfamiliar humans, which is not surprising given that they were domesticated even earlier than horses (Ekesbo 2011). These studies, however, did not demonstrate interspecific cross-modal recognition abilities in horses.

Cross-modal recognition is a brain’s ability to identify a person (or object) on the basis of interacting senses, thus to integrate identity cues of disparate sense modalities into a cognitive representation that allows the brain to substitute the information of one sensory mode with that of another (e.g., Calvert 2001). Stored in long-term memory, such a multimodal representation enables the brain, when deprived of one or two senses, to maintain person recognition by matching, for example, a played back voice recording with the remembered face or smell of an individual.

Domestic dogs were shown to be capable of cross-modal recognition of a familiar human by Adachi et al. (2007), using incongruent and congruent auditory and visual cues. Upon hearing a playback of their owners’ voices, dogs generate an internal representation of their owners’ faces so that they act surprised when confronted by the photograph of a stranger’s face shortly after. Likewise, rhesus and squirrel monkeys were found to form cross-modal representations of familiar humans (Sliwa et al. 2011; Adachi and Fujita 2007; Adachi 2009).

While humans rely heavily on the sense of sight when distinguishing other persons, horses place additional emphasis on olfaction and audition when discriminating other horses. Studies by Krueger and Flauger (2011, olfaction) and Lemasson et al. (2009, audition) demonstrated that each of these senses, taken on its own, reveals the social category of another horse to subjects, but possibly also its individuality. Proops et al. (2009) demonstrated equine cross-modal recognition of other individual horses. The presentation of visual/olfactory identity cues from a herd member passing by and disappearing behind a wall activated a preexisting multimodal representation of this stimulus horse: When hearing a recorded equine voice from the direction where the stimulus horse had just disappeared, the subject either matched this auditory signal with the internal representation or showed heightened interest (“surprise”) when the voice recording was taken from a different herd member.

Sankey et al. (2011) attempted to show that horses are also capable of cross-modally recognizing humans. However, they may have only documented that horses can discriminate between (a) a familiar female and an unfamiliar male human voice and between (b) different attentional states of humans (with discerning the latter not being a recognition of individuality but of category). After having been trained to remain immobile for 1 min upon a vocal command given by a single female trainer, subjects were tested by being exposed to the familiar trainer versus a male stranger. In separate trials, each stimulus person gave the familiar command while being visible at the same time and displaying cues of different levels of attention. It was shown that in trials in which the stimulus persons displayed visual cues of less attention, horses complied less with the strange man’s command and monitored him more. When both stimulus persons were attentive, the subjects also monitored the stranger more, but obeyed the order equally well. These findings, however, could be interpreted without having to use cross-modal recognition ability as explanation. When subjects heard the strange male voice give the familiar order usually spoken by their female trainer, they may have been puzzled whether or not the command was really meant for them (Engh et al. 2006). If so, they would have looked increasingly for additional (visual) cues to solve the puzzle, a well-documented behavior (Basile et al. 2009; Waring 2003), and decided that they did not have to comply with the order if the source of the strange voice standing in their stall displayed visual cues of low attention. The consistent obedience to the trainer, on the other hand, was independent from visual attentional cues and could be explained by the familiar acoustic stimulus alone.

The present study addressed the question of interspecific cross-modal recognition ability in horses from a different angle by investigating how horses react to identity cues of humans when being exposed to congruent and incongruent combinations of acoustic and visual (olfactory included) information. The subject horses were first exposed to visual/olfactory identity cues, which then vanished and were followed by the playback of voice cues. Their responses to presentations of congruent and incongruent identity cues were recorded during a standardized time following the onset of the auditory cues. The hypothesis was that horses would be able to distinguish persons cross-modally and thus would show signs of heightened interest by looking more quickly (“response latency”), more often (“number of looks”) and longer (“duration of first look” and “total looking time”) in the direction of an incongruent auditory cue. Thus, the experiment tested the everyday experience of horse enthusiasts who claim that horses recognize their caretakers even when only hearing their voices and not seeing and smelling them—and vice versa.

Materials and methods

Subjects

The 12 subject horses (age, 8–15 years; 8 geldings, 4 mares) were hunter/jumpers who regularly interacted with humans. All had known the familiar stimulus person for at least 6 months (range, 6 months–13 years) and had interacted with him on a daily basis. The horses were new to research studies, under regular veterinary supervision, and suffered from no observable hearing or vision problems. All were part of one herd, which normally grazed on the meadows by the stable. Farm stalls were used only for feeding, grooming and tacking up the horses.

Stimulus persons/equipment

All subject horses were tested with the same pair of male human stimuli consisting of one familiar and one completely unfamiliar person, both of whom were trained in good horsemanship. The familiar person, not in charge of feeding, was the farm manager who regularly patted and rode the horses and taught daily riding lessons and whose voice, face and posture were well known to the horses.

Digital voice recordings of the stimulus persons (mono) were obtained using a Fostex MR-8 Digital Multitracker (44.1 kHz, 16 bit) and Shure PG58 microphone and played back using an Apple iPod and Sony SS-CBX20 Speaker System (at ±60 db, measured from the subject’s position). For each trial, the acoustic cue consisted of a standardized text [“Hey, (name of the horse), what are you doing in there? Are you having a good day today? We have many riding lessons this week, don’t we? The semester has started at JMU. You be a good girl/boy today!”]. A single voice recording of each stimulus person was used, and individual horse names and either “girl” or “boy” were digitally inserted into the voice sample using WavePad Sound Editor. Identical pause durations between the inserted words and the rest of the standard text were ensured. This individualized direct address was intended to make the subject horses focus on the acoustic identity cues and minimize distraction.

A Panasonic SDR-S26-K SD Camcorder with an optical image stabilizer was used for videotaping. Video clips were imported with iMovie onto a computer, where Final Cut Express was employed for video analysis.

Design

A 2 × 2 within-subject design was used. The independent variables were auditory cue (familiar or unfamiliar) and congruency (congruent or incongruent trial). The dependent variables were (1) latency to initial response, (2) duration of the first look, (3) total number of looks and (4) total looking time. The order of the congruency and auditory cue combinations was counterbalanced across horses using an incomplete Latin square design (all combinations occurred in all ordinal positions three times).

Procedure

Horses were treated within APA ethical guidelines at all times. In order to avoid the subjects’ eyes following a stimulus person because of feeding expectations, the experimental trials were completed shortly after the second normal feeding time of the day (11 AM); neither of the stimulus persons provided the daily feedings. For each subject horse, there were at least 4 days between trials in order to prevent habituation. Each experimental trial consisted of the following:

  1. 1.

    The subject horse was placed in its stall and loosely tied in a normal manner so that it could stick its head out of the open stall door. The small video camera was positioned 3 m directly in front of the horse. The stable building’s doors were closed after all persons had exited.

  2. 2.

    The stimulus person walked toward the horse from the side, passed by the horse for 1.5 m and approached the opposite stall wall, then turned around to move close to the subject to pat it on the neck, face and shoulder. In this way, the horse saw the person from different angles, while person-unspecific patting enabled the horse to smell the person up close. After 57 s of visual/olfactory exposure, the stimulus person, who remained silent during the interaction, started to walk out of sight.

  3. 3.

    Following a 12-s delay, the congruent or incongruent voice replay was turned on. The sound came from behind a wooden wall where the stimulus person had exited. The time intervals (12 s and a total of 60 s of visual/olfactory exposure) replicated the method of Proops et al. (2009) to ensure comparability.

During the three steps, the subjects experienced nothing unusual. On the busy horse farm, they frequently heard familiar or novel voices behind wooden walls while seeing persons other than those speaking. In the same way, subjects frequently saw new persons without receiving auditory identity cues. Thus, no structural element of the trials, not even the perception of incongruent identity cues, was a novelty that could have interfered with experimental results.

In order to ensure calmness of each subject and avoid concentration on potential distant sounds from the herd outside, two companions of the subject were left in their own stalls during each trial. The risk of habituation of these companions, who also served as subjects, was minimal. Closed stall doors prevented them from sticking their heads out and interfering.

Analysis of videotapes

Videos were blind-analyzed in a random order frame by frame (frame, 0.04 s). “Looking” was defined when a horse faced the nostrils ≤45° to the right or to the left of the hidden loudspeaker and had at least one moment (of ≥120 ms) of gazing fixedly. The “beginning” or “end” of a look was defined when the horse’s head started to move into or out of the ≤45° zone, respectively. The 45° angle was reached when (a) the horse’s eyeball facing the loudspeaker disappeared with only the curve of the eye socket remaining visible and (b) the nostril of that same side was out of sight. The “moment” (≥120 ms) of gazing fixedly compared to the fact that dogs needed an average of 95 ms to check a blank monitor for novelties (Somppi et al. 2010).

In some trials, subjects were already holding their heads at a ≤45° angle to the speaker when the auditory cue started to play. In these cases, the moment of the onset of the voice sample could not be noted as the beginning of a look triggered by the auditory cue, because it actually was motivated by the visual/olfactory cue. However, if in these trials a horse kept looking in the direction of the speaker after the onset of the voice tape and either started (a) to narrow the angle to the speaker and/or (b) to blink, this was counted as the beginning of a “look.” This decision was made based on the observation that all subjects, when starting to move their heads from a >45° into a ≤45° position in order to gaze in the direction of the acoustic cue, almost always (95.7 %) blinked at the beginning of this movement, giving the impression that blinking can be interpreted as the refocusing of attention, a phenomenon that has not been previously reported for horses. It has been documented, however, that in mammals a saccadic eye movement from the fixation on one point to another usually is accompanied by a lid lowering (“blink”), which needs to be distinguished from the even faster spontaneous blink of equal amplitude (Evinger 1995; Evinger et al. 1984, 1991). Thus, blinking is more than keeping the cornea moist; it can be related to cognitive processing (Evinger 1995; Bacher and Smotherman 2004), often marking the end of a cognitive task (Evinger 1995). This matches the physiological finding that, with each blink, the eyeball even of small mammals is retracted back into the socket and the eye rotated into a centered position (Evinger 1995, Evinger et al. 1984) from where it has to be repositioned into the direction required by the visual task. The decision to interpret the first blink in 6 videos as a refocusing of attention parallels past research documentation that, when an auditory cue triggers their interest, horses search for additional, visual cues by gazing in the direction of the source of the sound (Basile et al. 2009; Waring 2003).

Analysis of all videotapes by a second rater provided an interobserver reliability of 0.992 (p < 0.0001) for response latency, 0.998 (p < 0.0001) for duration of first look, 0.960 (p < 0.0001) for number of looks and 0.987 (p < 0.0001) for total looking time, calculated by Pearson’s r correlation.

Statistical analysis

To normalize the distributions of scores, data were transformed using log10 (x + 1) for “duration of first look” and square root for “latency” and “number of looks” values. Data were analyzed by using a two-way repeated-measure ANOVA for each of the four dependent variables, using congruency and auditory cue type as within-subjects factors (α = 0.05).

Unpaired t tests were used to test whether the horses’ gender affected overall recognition abilities. As in Proops et al. (2009), each subject’s overall recognition ability was calculated for each dependent variable by summing each horse’s incongruent trial measurements and subtracting its congruent trial measurements. Pearson’s correlation coefficient was used to examine a possible correlation between overall recognition ability and age.

Results

Recognition ability versus age and gender

A significant correlation between age and overall recognition ability was only found for “duration of first look” (Pearson’s r = –0.693, p = 0.012). No significant effects of gender were found.

Results for “response latency”

As predicted, horses had significantly shorter response latency for auditory cues in incongruent trials, F (1, 11) = 7.357, p = 0.020, partial η2 = 0.401 (means 6.101 vs. 13.935 s, back-transformed values; see Fig. 1). Neither the main effect of the auditory cue (F (1, 11) = 1.179, p = 0.301) nor the interaction between auditory cue and congruency (F (1, 11) = 0.666, p = 0.432) was significant.

Fig. 1
figure 1

Estimated marginal means of dependent variables

Results for “duration of first look”

The duration of horses’ first look in the direction of the auditory cue was significantly longer in incongruent trials than in congruent trials, F (1, 11) = 11.053, p = 0.007, partial η2 = 0.501 (means 9.280 vs. 4.559 s, back-transformed values). As illustrated in Fig. 2, there was also significant interaction between auditory cue and congruency (F (1, 11) = 8.088, p = 0.016, partial η2 = 0.424); when exposed to unfamiliar auditory cues, horses looked for significantly longer times during incongruent trials. The interaction result indicates that after having seen and smelled the familiar person, horses on average looked for significantly longer times in the direction of the loudspeaker than after visual/olfactory exposure to the stranger (means 8.885 vs. 4.794 s, back-transformed values). There was no significant main effect of the auditory cue (F (1, 11) = 0.185, p = 0.676).

Fig. 2
figure 2

Dependent variables as functions of auditory cue and congruency

Results for “number of looks”

As predicted, horses had a significantly higher number of looks in the direction of the auditory cue in incongruent trials than in congruent trials, F (1, 11) = 6.162, p = 0.030, partial η2 = 0.359 (means 2.55 vs. 1.76, back-transformed values). No other effects were significant (auditory cue, F (1, 11) = 0.311, p = 0.588; interaction of auditory cue and congruency, F (1, 11) = 0.136, p = 0.719).

Results for “total looking time”

Horses spent significantly longer time looking in the direction of the auditory cue in incongruent trials than in congruent trials, F (1, 11) = 5.352, p = 0.041, partial η2 = 0.327 (means 23.993 vs. 17.242 s). Again, the effect neither of the auditory cue (F (1, 11) = 1.142, p = 0.308) nor of the interaction between auditory cue and congruency (F (1, 11) = 2.866, p = 0.119) was significant.

Discussion

Effect of congruency

The present study investigated whether domestic horses are capable of cross-modal recognition of familiar humans. Subject horses responded to incongruent visual (and olfactory) and auditory cues with more curiosity than to congruent ones, looking quicker, more often and longer in the direction of the incongruent auditory cue. If the voice cue was of the stranger, subjects had a different expectation after having just seen (and smelled) the familiar stimulus person. Conversely, they showed more interest when they heard the familiar voice after just having seen (and smelled) the stranger, who had disappeared in the direction from where the voice cue originated. The findings suggest that horses are capable of integrating multisensory identity cues of a familiar human into a cognitive representation that is independent of sensory modality. In this way, they recognize familiar humans when they hear their voices but do not see and smell them—and vice versa.

While the findings indicate that the subject horses had formed a cross-modal representation of the familiar person prior to the trials, they would not have created such a representation of the unfamiliar person during trials. The subject horses were never exposed to this person’s visual/olfactory and auditory identity cues at the same time, and research has shown that primates (Pascalis and Bachevalier 1998; Adachi and Fujita 2007) and sheep (Peirce et al. 2001) need training before they can recognize specific human faces. In monkeys, the neuronal network responsible for individual face recognition most likely can only be established by prolonged practice as this network is distributed over several temporal cortical areas (Desimone 1991; Pascalis and Bachevalier 1998).

Effect of interaction and effect of auditory cue

The significant interaction between congruency and auditory cue during first looks indicates that the subjects’ first look lasted longer after they had seen the familiar person regardless of the congruency condition. Subjects may have expected more interaction with him than just the 1 min of visual exposure since they were used to interacting with him on a daily basis. A similar effect can be observed in human infants who continue to look toward where their caretaker disappeared, exhibiting what Cohen (2004) called a “preference for familiarity prior to a preference for novelty.” In the present study, however, the familiarity effect was absent later in time after the stimulus person disappeared (i.e., for number of looks and total looking time); significant interaction effects were not found for the other three dependant variables, suggesting that horses did not respond primarily on the basis of differing levels of familiarity.

This is further indicated by the fact that the main effect of auditory cue was not significant in any analyses. It is true that, in within-species social contexts, horses are more responsive (increased vigilance, larger angle of head rotation) to the voice calls of unfamiliar horses than to familiar ones when preparing for potential dyadic encounters with an unfamiliar conspecific (Lemasson et al. 2009), but the present interspecific study did not reproduce this phenomenon, suggesting that an unfamiliar human was perceived as having less potential for competition or conflict than a strange conspecific.

Age and recognition ability

The significant correlation between age and overall recognition ability, only found for “duration of first look,” suggests higher reactivity in younger animals, who expressed their heightened interest in incongruent trials more intensely than older ones in their first response. Because no significant correlations between age and recognition ability could be found for the other three dependent variables, there was not enough evidence to support that the ability to recognize humans cross-modally is more developed in younger than in older horses. Proops et al. (2009) did not discover a significant correlation between age and equine within-species recognition ability.

Conclusion

The present study reports evidence that domestic horses are able to recognize humans cross-modally. Cross-modal integration of identity cues would be evolutionarily advantageous for a prey and flight animal such as the horse. Moreover, it has allowed the domestic horse, in its longtime co-evolution with humans, to easily recognize those humans with whom it interacts and on whom it depends on a regular basis. Follow-up studies need to test the equine cross-modal recognition capacity by pairing equally familiar stimulus persons. While negative results would be predicted for an unfamiliar–unfamiliar condition, given the present finding, it is likely that the subject horses would seize on the incongruency among familiar persons. In addition, further investigation is needed to establish which modes of sensory stimuli are most influential for equine cross-modal recognition of humans, comparing, for example, vision with olfaction. It also would be desirable to study the neuronal networks involved in equine cross-modal recognition for comparison with investigations of human brains (e.g., Calvert 2001).