Transcript
➠
➡ DEVELOPING EFFECTIVE TEST SETS AND METRICS FOR EVALUATING AUTOMATED MEDIA ANALYSIS SYSTEMS John W Mateer University of York, UK ABSTRACT
2. THE EVALUATION FALLACY
This paper first looks at current methods of evaluating automated content-based media analysis systems. Several key deficiencies are identified, particularly with regard to test set creation and metric design. A new framework is proposed that better reflects real-world conditions and end-user requirements. This is based on the author’s experience as a professional filmmaker and researcher in this domain. Specific approaches for data set selection, including the importance of understanding the physical, production and aesthetic attributes of footage, are presented. A discussion of related evaluation methods and means of effective assessment follow. It is hoped the suggestions proposed will facilitate more effective analysis of these systems.
To date, evaluation of automated media systems has typically consisted of trials conducted with footage at hand: easily accessed broadcast television programs, promotional videos produced by the organization or feature films rented from a video store. On the face of it, it would seem that such test sets could be good indicators of system performance. In actual fact, the quality of the analysis depends critically on the specific footage chosen, how much the researcher understands the characteristics of that footage and what he or she expects to learn from the trial. The physical, technical and aesthetic make up of the test set must be thoroughly understood to derive accurate conclusions. Many factors can affect system performance and it is vital these be identified (specific attributes and their impact are described in section 3). The majority of studies that have been conducted use different test sets, rendering comparison with competing approaches virtually impossible. Even trials conducted using different films with similar basic characteristics (i.e., genre, date of production, director, etc.) may not yield comparable results for a number of reasons such as the cinematographic or editing techniques employed to name but two. To properly compare systems common footage in a common format must be used. Fortunately this deficiency has not gone completely unnoticed. The Text Retrieval Conference Video Retrieval Evaluation (TRECVid [5]) was established specifically to enable direct comparison of competing techniques. With a clear task structure, analysis criteria and a consistent marking scheme, it is generally well conceived. However, as carefully designed as TRECVid has been, its test sets have not been chosen with a full appreciation of the range of real-world footage nor true end-user needs. For example, in 2002 one task was to test techniques for shot boundary detection – hard cuts as well as gradual transitions such as fades and dissolves. The test set was comprised principally of industrial documentaries, old promotional films and home movies – a seemingly good mix of footage. Upon closer examination though it becomes clear that many real-world conditions are not represented. Attributes and techniques such as fast paced montage (where several consecutive shots have very short
1. INTRODUCTION As research into automated media analysis has matured claims have emerged that low-level attributes, such as cut location and basic camera movement, are obtainable with consistently high degrees of accuracy (for example [1-3]). This has lead to the impression that these problems are basically solved and thus do not warrant further investigation. The present emphasis on research appears to have shifted toward techniques on extracting higherlevel semantic information, a seemingly more challenging task. But how can we be certain of the true effectiveness of any of these techniques in a real-world context? How far has the state-of-the-art actually advanced? Are any claims in this area validated? It has been suggested that end-user requirements must be fully considered if contentbased media analysis systems are to be truly viable [4] yet few have seemed to heed this call. As these systems will be used for archiving and professional post-production – both highly precise disciplines – there is a clear need for a common set of metrics based on a formal understanding of these domains. Cinematic production techniques, physical media properties as well as the history of the usage and application of media should formally be considered if proper evaluation is to take place.
0-7803-7965-9/03/$17.00 ©2003 IEEE
II - 201
ICME 2003
➡
➡ duration), jump cuts and scenes with heavy occlusion or strong relative subject-camera movement are all lacking. Indeed, even more basic concepts such as drop frames, match transitions and lighting changes are significantly underrepresented. As a result, the findings from TRECVid are skewed and do not adequately reflect system performance on the vast range of conditions present either in production footage or in archives spanning over 100 years. Given shot boundary detection forms the backbone of a vast number of content analysis tasks this is a major shortcoming in an otherwise highly laudable initiative. 3. CONSIDERATIONS FOR TEST SETS The creation of a challenging yet fair test set for evaluating automated media analysis tools requires a recognition and understanding of numerous footage characteristics. This is not to say that all features will be relevant to a specific area being tested. However, physical, production and aesthetic attributes are closely interrelated – any one can have a profound impact on the interpretation of another – therefore it is important to consider them together. 3.1. Physical Media Attributes Media footage varies greatly in quality. Attributes such as substrate density, tears, marks, flicker and the use of splice tape can drastically effect the parsing of film. In the same way, tape stock, format, standard encoding (i.e., NTSC, PAL, etc.) and generational loss can affect video. In both media, frame rates and aspect ratios must equally be considered, particularly with regard to films transferred to video where fundamental changes can occur depending on the type of transfer (i.e., direct, letterboxed or pan-andscan). This is particularly pronounced with early film where frame rates are non-standard. Color characteristics are a vital consideration. Some techniques, such as color histogram analysis, are often ineffective on black and white or faded footage. Likewise certain types of tints, including hand tinting prevalent in the early days of filmmaking and even cel-based animation, must be understood and accommodated if a system is to be tested on all types of footage. In the creation of test sets subtle issues may arise that are not immediately apparent. For example, a modern feature film is typically shot on 35mm film at 24 frames per second then transferred to NTSC video (with a frame rate of 29.97fps) using a field insertion process known as 3:2 pull down. If this footage is then converted to an AVI to facilitate analysis, it can contain regular occurrences of duplicate frames, potentially yielding an incorrect detection of a series of freeze sequences or other anomalies. Likewise if a set of JPEG stills is created
directly from the source film, it will contain fewer frames than the AVI thus possibly negating the validity of direct comparison between systems using the two sets. Attention must be given to the original format of test footage to ensure the test set is valid. 3.2. Production Techniques Directors employ a vast range of cinematographic, aural and editorial techniques to convey information in a style appropriate to their audiences. The genre and intended aim of the piece help to guide the director’s approach. Specific methods can be examined individually, however for the purposes of selecting test footage, it is worthwhile to examine the cinematic language being used by the director to understand the use of these methods in context. Cinematic language in this instance does not refer to critical constructs but rather specific styles of filmmaking. Richards defines a number of cinematic languages as used by directors for production [6]. The most common of these is Master Scene Cinema Language, whereby an initial wide shot establishes the scene and subsequent closer shots (e.g., medium and close-up) present the salient information. Camera movement is minimal, the pace of editing is relatively regular and the overall presentation is highly controlled (a good example is Wyler’s The Big Country). As a result, source footage using these techniques is less challenging than footage employing other types of cinematic language for the identification of basic attributes (e.g., cut location, camera movement, etc.). It may, however, be well suited to higher-level analysis (e.g., scene identification, location detection, etc.). Approaches such as Constructive and Collision Cinema Languages, where shots are presented in a consistent pattern and pace so that juxtaposition imparts meaning (as in Eisenstein’s Battleship Potempkin) are also likely better suited for evaluating the effectiveness of extracting higher-level information. More modern languages, such as Vorkapich Cinema Language, where an action is broken down and shown using several component shots rather than one longer shot (as exemplified in Katzkin’s Le Mans), and Cinema Verité, where events are shown with as little intervention as possible (i.e., in terms of camera angle change, editing, lighting, etc.), are best suited for testing robustness of more extreme, real-world conditions. Footage using these languages can contain challenging uses of camera consisting of shaky handheld shots, swish pans, snap zooms, selective and rack focus, dynamic moving pointof-view shots and/or shots with high levels of occlusion, making them a good choice for testing the classification of camera work. The related editorial methods include fast montage, jump cuts, match transitions, freezes and/or fast or slow motion thus creating significant temporal
II - 202
➡
➡ discontinuity. It is this type of footage that can fully test the effectiveness of boundary detection strategies. By recognizing and understanding how cinematic languages are used and the techniques behind them, test set selection can be much more accurate and efficient.
common measures of “precision” and “recall” as indicators. However, these do not fully take into account the complexities of the media domain nor the ultimate needs of end-users. New metrics, based on the requirements of archival and post-production professionals, are necessary for these systems to be useful.
3.3. Aesthetics and Historical Context 4.1. “Hard” Versus “Soft” Measures In order for a test set to best reflect the breadth of conditions present in real-world archives it is important to have an appreciation of the historical context and purpose of test footage. The numerous vaults of unclassified footage span a wide range of eras and genres. To categorize and index them effectively requires an understanding of the context in which they were made and their intended purpose. Genre detection is a key component and a hot area for research yet present test sets fail to reflect an appreciation for the complexities involved. For example, documentaries of the 1920’s (such as Flaherty’s staged Nanook of the North) vary significantly in style to those of the 1960’s (Wiseman’s Cinema Verité Titicut Follies, for instance) even though they are within the same genre. Likewise propaganda films often employ the same presentation style as documentaries yet their purpose is decidedly different. To reliably classify footage according to genre requires a deep understanding of that genre and a test set that reflects the variety within the domain. It is also important to recognize latent grammars that have evolved with visual media over time, particularly that of “continuity.” Continuity and its components – consistency of motion, time and space, and most notably “the line” [7,8] – have been used in the vast majority of programs irrespective of genre. Identifying how continuity is created (or destroyed) by the director – through manipulation of composition, focus, eye line or editing, for example – can provide insight into the extraction of higher-level semantic information. Film theorists have studied this extensively and it is valuable to be clear on concepts such as the effect of deep focus in imparting meaning (as used by Renoir in Grand Illusion) when selecting test footage for semantic analysis. Mast and Cohen provide a relevant and well-conceived compilation of several key film theories [9]. By understanding the specific objectives of a particular trial and selecting test footage based on a considered understanding of the three areas above, evaluation can become much more efficient and effective. 4. EVALUATION STRATEGIES Once a representative test set has been created, metrics must be used that accurately assess performance of the system or technique in question in an equally targeted manner. Many researchers have fallen back on the
Many attributes of films or videos are immediately quantifiable. For example, the accuracy of cut detection is easily measured – the location of the cut is either correct, or it is not. To media professionals this must be absolute as frame accuracy is vital to the editorial process. If a reported cut is actually one frame off, it should be counted as two mistakes – one for the missed cut, the other for the false detection. There are numerous other fundamental characteristics that require such precision – drop frame detection, camera movement classification and location identification to name but three. These should also be scored in such a rigorous manner. Some attributes do not require such precision. For example, locating the exact beginning and end frames of a camera move is desirable, although in practice edits are rarely made using these precise points of the movement. Traditionally, there is a small pause (or ‘beat’ [6]) where the camera holds, before gradually starting the motion, with another beat at the end of the move. Indeed, ‘feathered’ moves (where the camera starts and stops in a very smooth, graduated motion) make precise start and stop frame detection difficult even for human experts. It is reasonable therefore that this type of information be judged with a relative accuracy, typically ± 5 frames (though there is no clear consensus for this number). For many types of semantic analysis this approach should be equally valid. The key to designing effective metrics lies with understanding the ultimate use of the system being examined. 4.2. The Importance of Standardized Nomenclature Media professionals use particular terminology to characterize all aspects of pre-production, production, post-production and archiving. Systems designed to extract content from media should use the same common nomenclature. Camera work and shot types are best described using ubiquitous Hollywood terms (e.g., “pan right to a medium two-shot,” “zoom in to an extreme close-up,” etc. [7]). Likewise editing attributes should be categorized in a similar way (e.g., “25 frame dissolve,” “30 frame wipe right,” etc. [10]). The use of consistent terminology enables direct the comparison of different systems. Asset management should also be performed in a manner consistent with standard industry practice. To
II - 203
➡
➠ date, management of test sets has been done using a variety of methods. Some studies, such as TRECVid, utilize Gregorian day time coding (ISO 8601) as a means of indexing footage. While this standard is growing in acceptance in a number of different communities, it is by no means universal in this domain; the vast majority of post-production and stock footage archives utilize a reel number/time code metaphor (with SMPTE time code or related variations). Until other standards are firmly established the latter method should be adopted to simplify direct comparison and thus standardize evaluation. It also has the added benefit of making system integration with existing post-production equipment more efficient. 4.3 Measures of Attributes Attributes of media footage range greatly and must be classified in ways appropriate to their context. Evaluating the effectiveness of classification techniques requires the use of a number of different measures to provide an overall picture. In our work on ASAP, an automated shot analysis program for post-production [11], Robinson and I developed specific metrics to test the characterization of camera movement. Our approach utilizes generic techniques than can be applied in a number of contexts. The ground truth log should be prepared by an expert with a full understanding of the test footage (as described in section 3). Accuracy of this log is vital if the evaluation is to be effective. Metrics should be developed based on the significance and relative value of the extracted information. Typically this means first a measure for whether a classification was correct. We check that the attribute identified by the system has extents that overlap with an identical attribute in the ground truth. If it does not, the classification is reported as false and any different attributes listed in the ground truth are counted as missed. The classification rate per shot (or scene) is calculated as the number of correct attribute classifications divided by the total of correct, false and missed instances. Classification accuracy is calculated as the proportion of time the attribute is listed by both the system and ground truth as occurring divided by the total length of time either list the attribute as being present. Average category classification accuracy simply expands the analysis to gauge the performance of classifying related attributes as a group. Durationweighted accuracy, defined by the proportion of frames within a shot or scene that the system and ground truth report the same attribute present divided by the total number of frames in the section, emphasizes the overall amount the time the system is correct so attributes present over a longer period carry more weight than shorter ones. Using all of these measures it is possible to gauge relative versus overall effectiveness, particularly useful in
assessing the extraction of multi-level data (i.e., where attributes are interdependent). Through the use of ‘hard’ or ‘soft’ measures different end-user requirements can be represented. Care must be taken when choosing these as an inappropriate choice can skew or invalidate the results. 5. CONCLUSIONS This paper has examined current methods of evaluating automated content-based media analysis systems and identified several deficiencies. It is suggested that any system ultimately intended for professional use should be assessed using representative test sets and measures based on end-user requirements, and judged on its effectiveness in meeting real-world needs. A new framework for developing suitable data sets and appropriate metrics, based on standard methods and practice, has been presented. It is hoped that continued research will help to further develop and refine the approaches described. 6. REFERENCES [1] R. Lienhart, “Comparison of Automatic Shot Boundary Detection Algorithms,” Image and Video Processing VII, Proc. SPIE 3656-29, January 1999. [2] N. V. Patel and I. K. Sethi, "Video shot detection and characterization for video databases," Pattern Recognition, vol. 30, no. 4, pp. 583--592, April 1997. [3] P. Bouthemy , M. Gelgon and G. Fabrice, “A Unified Approach to Shot Change Detection and Camera Motion Characterization” Technical Report RR-3304, INRIA, 1997. [4] S.-F. Chang, “The Holy Grail of Content-Based Media Analysis,” IEEE Multimedia Magazine, Vol. 9, Issue 2, pp. 6-10, April-June 2002. [5] Text Retrieval Conference Video Retrieval Evaluation (TRECVid), http://www-nlpir.nist.gov/projects/t01v/ (checked on 15 January 2003) [6] Richards, R., A Director’s Method for Film and Television, Butterworth-Heinemann, Stoneham, 1992. [7] Katz, S. D., Film Directing Shot by Shot, Michael Wiese Productions,/Focal Press, Stoneham, 1991. [8] Mascelli, J. V., The Five C’s of Cinematography, SilmanJames Press, Beverly Hills, 1965. [9] Mast, G. and M. Cohen, Film Theory and Criticism, 4th Edition, Oxford University Press, New York, 1992. [10] Reisz, K. and G. Millar, The Technique of Film Editing, Focal Press, New York, 1968. [11] J. W. Mateer and J. A. Robinson, “Robust Automated Footage Analysis for Professional Media Applications,” Visual Information Engineering 2003, Guildford UK (in press)
II - 204