MNIST将初学者领进了深度学习领域,而Imagenet数据集对深度学习的浪潮起了巨大的推动作用。深度学习领域大牛Hinton在2012年发表的论文《ImageNet Classification with Deep Convolutional Neural Networks》在计算机视觉领域带来了一场“革命”,此论文的工作正是基于Imagenet数据集。
Imagenet数据集有1400多万幅图片,涵盖2万多个类别;其中有超过百万的图片有明确的类别标注和图像中物体位置的标注,具体信息如下:
1)Total number of non-empty synsets: 21841
2)Total number of images: 14,197,122
3)Number of images with bounding box annotations: 1,034,908
4)Number of synsets with SIFT features: 1000
5)Number of images with SIFT features: 1.2 million
COCO(Common Objects in Context)是一个新的图像识别、分割和图像语义数据集,它有如下特点:
1)Object segmentation
2)Recognition in Context
3)Multiple objects per image
4)More than 300,000 images
5)More than 2 Million instances
6)80 object categories
7)5 captions per image
8)Keypoints on 100,000 people
Xiaorong Li 维护的数据集。PhD ,Intelligent Systems Lab Amsterdam.research on video and image retrieval.
Flickr-3.5M: A collection of 3.5 million social-tagged images.
Social20: A ground-truth set for tag-based social image retrieval.
Biconcepts2012test: A ground-truth set for retrieving bi-concepts (concept pairs) in unlabeled images.
neg4free: A set of negative examples automatically harvested from social-tagged images for 20 PASCAL VOC concepts.
4
wikipedia featured articles 函数图片(以及特征)以及对应的wiki文本。可以看看文章A New Approach to Cross-Modal Multimedia Retrieval,还有一批文章On the Role of Correlation and Abstraction in Cross-Modal Multimedia Retrieval不过还没有下载链接
To our knowledge, this is the largest real-world web image dataset comprising over 269,000 images with over 5,000 user-provided tags, and ground-truth of 81 concepts for the entire dataset. The dataset is much larger than the popularly available Corel and Caltech 101 datasets. Though some datasets comprise over 3 million images, they only have ground-truth for a small fraction of images. Our proposed NUS-WIDE dataset has the ground-truth for the entire dataset.
LabelMe is a web-based image annotation tool that allows researchers to label images and share the annotations with the rest of the community. If you use the database, we only ask that you contribute to it, from time to time, by using the labeling tool.
1521 images with human faces, recorded under natural conditions, i.e. varying illumination and complex background. The eye positions have been set manually.
15,560 pedestrian and non-pedestrian samples (image cut-outs) and 6744 additional full images not containing pedestrians for bootstrapping. The test set contains more than 21,790 images with 56,492 pedestrian labels (fully visible or partially occluded), captured from a vehicle in urban traffic.
The dataset FlickrLogos-32 contains photos depicting logos and is meant for the evaluation of multi-class logo detection/recognition as well as logo retrieval methods on real-world images. It consists of 8240 images downloaded from Flickr.
30000+ frames with vehicle rear annotation and classification (car and trucks) on motorway/highway sequences. Annotation semi-automatically generated using laser-scanner data. Distance estimation and consistent target ID over time available.
Phos is a color image database of 15 scenes captured under different illumination conditions. More particularly, every scene of the database contains 15 different images: 9 images captured under various strengths of uniform illumination, and 6 images under different degrees of non-uniform illumination. The images contain objects of different shape, color and texture and can be used for illumination invariant feature detection and selection.
California-ND contains 701 photos taken directly from a real user's personal photo collection, including many challenging non-identical near-duplicate cases, without the use of artificial image transformations. The dataset is annotated by 10 different subjects, including the photographer, regarding near duplicates.
A dataset for testing object class detection algorithms. It contains 255 test images and features five diverse shape-based classes (apple logos, bottles, giraffes, mugs, and swans).
A dataset for Attribute Based Classification. It consists of 30475 images of 50 animals classes with six pre-extracted feature representations for each image.
The PubFig database is a large, real-world face dataset consisting of 58,797 images of 200 people collected from the internet. Unlike most other existing face datasets, these images are taken in completely uncontrolled situations with non-cooperative subjects.
The data set contains 3,425 videos of 1,595 different people. The shortest clip duration is 48 frames, the longest clip is 6,070 frames, and the average length of a video clip is 181.3 frames.
The Microsoft Research Cambridge-12 Kinect gesture data set consists of sequences of human movements, represented as body-part locations, and the associated gesture to be recognized by the system.
This dataset contains 250 pedestrian image pairs + 775 additional images captured in a busy underground station for the research on person re-identification.
Face tracks, features and shot boundaries from our latest CVPR 2013 paper. It is obtained from 6 episodes of Buffy the Vampire Slayer and 6 episodes of Big Bang Theory.
ChokePoint is a video dataset designed for experiments in person identification/verification under real-world surveillance conditions. The dataset consists of 25 subjects (19 male and 6 female) in portal 1 and 29 subjects (23 male and 6 female) in portal 2.
The set was recorded in Zurich, using a pair of cameras mounted on a mobile platform. It contains 12'298 annotated pedestrians in roughly 2'000 frames.
MIT traffic data set is for research on activity analysis and crowded scenes. It includes a traffic video sequence of 90 minutes long. It is recorded by a stationary camera.
This dataset contains videos of crowds and other high density moving objects. The videos are collected mainly from the BBC Motion Gallery and Getty Images website. The videos are shared only for the research purposes. Please consult the terms and conditions of use of these videos from the respective websites.
Contains hand-labelled pixel annotations for 38 groups of images, each group containing a common foreground. Approximately 17 images per group, 643 images total.
Image segmentation and boundary detection. Grayscale and color segmentations for 300 images, the images are divided into a training set of 200 images, and a test set of 100 images.
For the CAVIAR project a number of video clips were recorded acting out the different scenarios of interest. These include people walking alone, meeting with others, window shopping, entering and exitting shops, fighting and passing out and last, but not least, leaving a package in a public place.
24 scenarios recorded with 8 IP video cameras. The first 22 first scenarios contain a fall and confounding events, the last 2 ones contain only confounding events.
This dataset consists of a set of actions collected from various sports which are typically featured on broadcast television channels such as the BBC and ESPN. The video sequences were obtained from a wide range of stock footage websites including BBC Motion gallery, and GettyImages.
This dataset features video sequences that were obtained using a R/C-controlled blimp equipped with an HD camera mounted on a gimbal.The collection represents a diverse pool of actions featured at different heights and aerial viewpoints. Multiple instances of each action were recorded at different flying altitudes which ranged from 400-450 feet and were performed by different actors.
The dataset was captured by a Kinect device. There are 12 dynamic American Sign Language (ASL) gestures, and 10 people. Each person performs each gesture 2-3 times.
Contains six types of human actions (walking, jogging, running, boxing, hand waving and hand clapping) performed several times by 25 subjects in four different scenarios: outdoors, outdoors with scale variation, outdoors with different clothes and indoors.
Hollywood-2 datset contains 12 classes of human actions and 10 classes of scenes distributed over 3669 video clips and approximately 20.1 hours of video in total.
This dataset contains 5 different collective activities : crossing, walking, waiting, talking, and queueing and 44 short video sequences some of which were recorded by consumer hand-held digital camera with varying view point.
The dataset is designed to be realistic, natural and challenging for video surveillance domains in terms of its resolution, background clutter, diversity in scenes, and human activity/event categories than existing action recognition datasets.
Collected from various sources, mostly from movies, and a small proportion from public databases, YouTube and Google videos. The dataset contains 6849 clips divided into 51 action categories, each containing a minimum of 101 clips.
Fully annotated dataset of RGB-D video data and data from accelerometers attached to kitchen objects capturing 25 people preparing two mixed salads each (4.5h of annotated data). Annotated activities correspond to steps in the recipe and include phase (pre-/ core-/ post) and the ingredient acted upon.
50 Salads - fully annotated 4.5 hour dataset of RGB-D video + accelerometer data, capturing 25 people preparing two mixed salads each (Dundee University, Sebastian Stein)
PRINTART: Artistic images of prints of well known paintings, including detail annotations. A benchmark for automatic annotation and retrieval tasks with this database was published at ECCV. (Nuno Miguel Pinho da Silva)
The BOSS project aims at developing an innovative and bandwidth efficient communication system to transmit large data rate communications between public transport vehicles and the wayside. In particular, the BOSS concepts will be evaluated and demonstrated in the context of railway transport. As a matter of fact, security issues, traditionally covered in stations by means of video-surveillance are clearly lacking on-board trains, due to the absence of efficient transmission means from the train to a supervising control centre. Similarly, diagnostic or maintenance issues are generally handled when the train arrives in stations or during maintenance stops, which prevents proactive actions to be carried out.
Dataset include 15 sequences shot by 9 cameras and 8 microphones, all synchronized together to give the possibility of 3D video/audio reconstruction.
In these datasets, we can find the following events:
- Cell phone theft (in Spanish language).
- Check out - a passenger checking out another man's wife, then fighting (in French language).
- Disease - a series of 3 passengers fainting, alone in the coach (both in French and Spanish).
- Disease in public (both in French and Spanish).
- Harass - 3 sequences in which a man harasses a woman. In "Harass2", there are other passengers in the coach.
- Newspaper - two sequences (one in French, one in Spanish) in which a passenger harasses another passenger for his newspaper, and end up assaulting him.
- Panic (in French language) - a passenger notices a fire in the next coach, and everybody runs out of the train.
- Two more sequences are provided, containing no incidents whatsoever. They were shot to assess the robustness of incident detection software to false alarms.
- Other sequences are provided, which are not acted incidents but were used for specific incident detection tasks.
Metadata:
Events generated by the BOSS processing are given for some sequences, in a file called "nameofthesequence.xml", in the same directory as the data set of the sequence itself. The format and types of the events are described in a PDF files.
Contextual info:
All the sequences were shot in a Madrid suburban train kindly lent by RENFE who are gratefully acknowledged.
In order to allow as much flexibility as possible, all the video files are uncalibrated, the calibration files are provided along with each sequence and the description of how to use them is given in calibTutorial.pdf . An associated Matlab library is provided in BOSScalibTutorial.zip.
Comments:
Copyrights:
The sequences are provided free of charge for academic research. For any other use, please ask the contact person. Should you care to publish these sequences or results obtained using, please indicate their origin as "BOSS project", and mention the address of the project: http://www.celtic-boss.org.
You are welcome to provide a link to the location of the sequences, but copying them to another web site is subject to prior consent of the contact person.
The objective of the EMAV 2009 (European Micro Aerial Vehicle Conference and Flight Competition) conference is to provide an effective and established forum for discussion and dissemination of original and recent advances in MAV technology. The conference program will consist of a theoretical part and a flight competition. We aim for submission of papers that address novel, challenging and innovative ideas, concepts or systems. We particularly encourage papers that go beyond MAV hardware, and address issues such as the collaboration of multiple MAVs, applications of computer vision, and non-GPS based navigation.
Dataset:
For computer vision researchers an image set is published. The set consists of photos taken with various MAV platforms at different locations. The photos are always stills from movies made by the platform. For this EMAV, there is no explicit assignment or competition linked to this data set. However, possible tasks with the data set are: segmentation of the images in meaningful entities, specific object recognition (cars / roads), construction of image mosaics on the basis of the films, etc.
The Caltech Pedestrian Dataset consists of approximately 10 hours of 640x480 30Hz video taken from a vehicle driving through regular traffic in an urban environment. About 250,000 frames (in 137 approximately minute long segments) with a total of 350,000 bounding boxes and 2300 unique pedestrians were annotated.
Metadata:
The annotation includes temporal correspondence between bounding boxes and detailed occlusion labels. More information can be found in our CVPR09 paper.
Associated Matlab code is available. The annotations use a custom "video bounding box" (vbb) file format. The code also contains utilities to view seq files with the annotations overlayed, evaluation routines used to generate all the ROC plots in the paper, and also the vbb labeling tool used to create the dataset (a slightly outdated video tutorial of the labeler is also).
Contextual info:
Comments:
Copyrights:
Contact:
pdollar[at]caltech.edu
NGSIM
Website:
Datasets are available here (registration is needed):
This dataset consists in meeting room scenarios, with two people sitting around meeting tables
Around two-thirds of the data has been elicited using a scenario in which the participants play different roles in a design team, taking a design project from kick-off to completion over the course of a day. The rest consists of naturally occurring meetings in a range of domains.
Metadata:
Annotations are available for many different phenomena (dialog acts, head movement etc. ).
MORYNE aims at contributing to greater transport efficiency, increased transport safety and more environmental friendly transport by improving traffic management in an urban and sub-urban area.
Dataset:
There are sequences from both demonstration busses of the MORYNE project.
Filenames explicitly provide the date and time of acquisition.
Metadata:
Ground truth is provided in XML format as following:
< event >
< time >2008-01-18T10:05:10.747209< /time >
< name >ODOINFO
This file gives the distance covered by the bus during the interval starttime - stoptime.
Contextual info:
.idx files
----------
.idx files contain the date and time for each frame in the sequence. The structure of this file is:
- header of 12 bytes
- For each frame, a structure of 24 bytes
The structure contains:
- unsigned 32 bits integer: seconds since Epoch
- unsigned 32 bits integer: microseconds in the second
- unsigned 64 bits integer: offset in bytes in the .avi file
- unsigned 32 bits integer: frame number starting with 0
- unsigned 32 bits integer: frame type as defined by libavcodec (may be useless)
All integers are encoded in little endian.
Comments:
The material for camera calibration and bus speed/context metadata will be added as soon as possible.
Copyrights:
This folder contains a list of test sequences which have been recorded for the MORYNE project (http://www.fp6-moryne.org).
They can be used for non-commercial purpose only, if a reference to the MORYNE project is associated to their use (e.g. in publications, video demontrations...).
These are the smoothed flow sequences for the Waverly train station scene. There are 4 files number. (002) is used for testing, the remaining used for training.
Data for the simulated scene
These are the smoothed flow sequences for the train station simulation. There are 30 files divided in the groups below. Use from frame 1100 to 4000. The emergency is at frame 2000.
Group 1: Normal - Training
Group 2: Normal - Testing
Group 3: Emergency - Blocked exit at the bottom of the scene.
A number of video clips were recorded acting out the scenario of interest: left objects. 31 sequences of two minutes have be recorded, showing different left objects scenarios (1 or more objects, person staying close to the left object, etc).
The 31 scenarios have been recorded using 2 different cameras (not synchronised), with two different views:
- a Panasonic camera - miniDV, model NV-DS28EG (camera1)
- a Sony camera - miniDV, model DSR-PD170P (camera2)
The videos have the following caracteristics:
- A resolution of 720x576 pixels
- 25 frames per second
- A compression using MPEG4
- The file sizes are of 75 Mo for camera1 and 65 Mo for camera2.
Metadata:
All the sequences are annotated using XML format. Each sequence is associated with a ".xml" annotation file with the same name ending by .gt.xml.
For each left object, we can find in the xml:
- the exact time of the detection
- the position of the object in the image
Contextual info:
Comments:
In each sequence, nothing appends before 30 seconds, and after 1m45s.
Copyrights:
Free download from website. If you publish results using the data, please acknowledge the data as coming from the CANTATA project, found at URL: http://www.hitech-projects.com/euprojects/cantata/. THE DATASET IS PROVIDED WITHOUT WARRANTY OF ANY KIND
Traffic intersection sequence recorded at the Durlacher-Tor-Platz in Karlsruhe by a stationary camera (512 x 512 grayvalue images (GIF-format))
Traffic intersection sequence recorded at the Ettlinger-Tor in Karlsruhe by a stationary camera (512 x 512 grayvalue images (GIF-format))
Traffic intersection sequence recorded at the Nibelungen-Platz in Frankfurt by a stationary camera (720 x 576 grayvalue images (GIF-format))
Traffic sequence showing the intersection Karl-Wilhelm-/ Berthold-Straße in Karlsruhe, recorded by a stationary camera (740 x 560 grayvalue images (GIF-format))
Another traffic sequence showing the intersection Karl-Wilhelm-/ Berthold-Straße in Karlsruhe, recorded by a stationary camera (702 x 566 grayvalue images (PM-format))
Traffic sequence showing the intersection Karl-Wilhelm-/ Berthold-Straße in Karlsruhe, recorded by a stationary camera (768 x 576 grayvalue images (PGM-format),normal conditions)
Traffic sequence showing the intersection Karl-Wilhelm-/ Berthold-Straße in Karlsruhe, recorded by a stationary camera (768 x 576 grayvalue images (PGM-format),normal conditions)
Traffic sequence showing the intersection Karl-Wilhelm-/ Berthold-Straße in Karlsruhe, recorded by a stationary camera (768 x 576 color images (PPM-format),heavy fog)
Traffic sequence showing the intersection Karl-Wilhelm-/ Berthold-Straße in Karlsruhe, recorded by a stationary camera (768 x 576 color images (PPM-format),heavy snowfall)
Traffic sequence showing the intersection Karl-Wilhelm-/ Berthold-Straße in Karlsruhe, recorded by a stationary camera (768 x 576 color images (PPM-format),snow on lanes)
Traffic sequence showing an intersection at Rheinhafen, Karlsruhe (688 x 565 grayvalue images (PM.GZ-format))
Traffic sequence showing a taxi in Hamburg(256 x 191 grayvalue images (PGM-format))
Metadata:
Camera projection data in the file proj.dat which uses the following format:
tx ty tz # Translation vector Global <---> Camera Coordinates
r11 r12 r13 #
r21 r22 r23 # > 3x3 Rotation Matrix Global <---> Camera
r31 r32 r33 # /
fx # Focal length x-direction (pixels)
fy # Focal length y-direction (pixels, usually 4/3 * fx)
x0 # Image Center X (pixels)
y0 # Image Center Y (pixels)1# Sharp shadows visible (1=true, 0=false)
phi # Azimut angle for shadow
theta # Polar angle for shadow
Two different scenarios have been relaized during the CANDELA project : "Indoor abandonned object" and "road intersection".
o Scenario 1: Abandoned object. The detection of abandoned objects is more or less the detection of idle (stationary or non-moving) objects that remain stationary over a certain period of time. The period of time is adjustable. In several types of scenes, idle objects should be detected. In a parking lot e.g., an idle object can be a parked car or a left suitcase. For this scenario we are not looking at the object types "person" or "car", but at unidentified objects, called "unknown objects". An unknown object is any object that is not a person or a vehicle. In general, unknown objects cannot move. What should be detected? : Whenever an unknown object appears in the scene and remains stationary for some amount of time person, an alarm needs to be generated. This alarm must remain active, as long as the unknown object remains stationary.
o Scenario 2: Persons are allowed to cross the street at zebra crossings, a crossing controlled with lights. Alarms should be generated when persons are not allowed to be on the crossing, or when dangerous scenarios occur (cars driving when people crossing). Since the external signal from the traffic light is not available (when the crossing is regulated by traffic lights), detection needs to be done automatically. Detection of persons on the crossing itself is pretty easy, but alarms should only be given when persons are on the crossing, and cars are driving.
Metadata:
Detailed information about data and metadatas can be found here:
The ObjectVideo Virtual Video provides the ability to generate virtual video sequences. These video sequences can then be used to test VCA algorithms.
Metadata:
The automatically generated ground truth is generated in a propriety binary format. The format is open, and a conversion program can be created to convert metadata to any format. A simple bounding box scheme is available, for more powerful validation a "blob" video can be created.
Contextual info:
Virtual environment, the user can make his own environment from the internet. Several camera settings can be changed to simulate real-world cameras more closely.
Comments:
This is not a dataset as is but using these tools, very powerful and tailored; test videos can be created.
Copyrights:
The ObjectVideo Virtual Video Tool is provided free for non-commercial use, for your own research and development purposes. If you publish or distribute images, videos or derivative results based on this software, you must acknowledge ObjectVideo by including "ObjectVideo Virtual Video Tool".
To use the ObjectVideo Virtual Video tool a licence for the commercial game Half-Life 2 is needed (www.steampowered.com).
Contact:
Rick Koeleman, VDG-Security bv. rick@vdg-security.com
This is a dataset for multiple people/faces visual detection and tracking. The dataset is composed of 3 sequences (same scenario); 4 targets repeatedly occlude each other while appearing and disappearing from the field of view of the camera. The sequence motinas_multi_face_frontal shows frontal faces only; in motinas_multi_face_turning the faces are frontal and rotated; in motinas_multi_face_fast the targets move faster that in the previous two sequences. Total number of images: 2769, DivX 6 compression,640 x 480 pixels,25 Hz.
Sensor details - video camera: JVC GR-20EK
Metadata:
Contextual info:
Comments:
Copyrights:
Requested citation acknowledgment: E. Maggio, E. Piccardo, C. Regazzoni, A. Cavallaro. "Particle PHD filter for multi-target visual tracking", in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2007), Honolulu (USA), April 15-20, 2007
This is a dataset for single person/face visual detection and tracking. The dataset is composed of five sequences with different illumination conditions and resolutions. Three sequences (motinas_toni, motinas_toni_change_ill and motinas_nikola_dark) are shot with a hand held camera (JVC GR-20EK). In motinas_toni the target moves under a constant bright illumination; in motinas_toni_change_ill the illumination changes from dark to bright; the sequence motinas_nikola_dark is constantly dark. Two sequences (motinas_emilio_webcam and motinas_emilio_webcam_turning) are shot with a webcam (Logitech Quickcam) under a fairly constant illumination.Total number of images: 3018, DivX 6 compression, 640 x 480 pixels and 25 Hz (motinas_toni, motinas_toni_change_ill, motinas_nikola_dark), 320 x 240 pixels and 10 Hz (motinas_emilio_webcam and motinas_emilio_webcam_turning)
Metadata:
The ground truth data is available in the .zip files for the sequences motinas_toni and motinas_emilio_webcam. In the ground truth files each line of text describes the objects' position and size in a frame. The syntax of a line is the following: frame number_of_objects obj_1_name x y half_width half_height angle obj_2_name x y half_width half_height angle ...
Contextual info:
Comments:
Copyrights:
Requested citation acknowledgment E. Maggio, A. Cavallaro, "Hybrid particle filter and mean shift tracker with adaptive transition model", in Proc. of IEEE Int. Conference on Acoustics, Speech and Signal Processing (ICASSP 2005), Philadelphia, 19-23 March 2005, pp. 221 - 224.
This is a dataset for uni-modal and multi-modal (audio and visual) people detection tracking. The dataset consists of three sequences recorded in different scenarios with a video camera and two microphones. Two sequences (motinas_Room160 and motinas_Room105) are recorded in rooms with reverberations. The third sequence (motinas_Chamber) is recorded in a room with reduced reverberations. The camera is placed in the centre of a bar that supports two microphones. Total number of images: 3271, Format of images: 8-bit color AVI 360 x 288 pixels 25 fps, audio sampling rate: 44.1 kHz.
Sensor details - The camera is placed in the centre of a bar that supports two microphones
- Distance between the microphones: 95 cm - Microphones: Beyerdynamic MCE 530 condenser microphones
- Camera: KOBI KF-31CD analog CCD surveillance camera
Metadata:
The ground truth data are provided together with the sequences in the corresponding .zip file, as list of XML files representing the positions of the objects in the field of view.
Contextual info:
Comments:
Copyrights:
Requested citation acknowledgment Courtesy of EPSRC funded MOTINAS project (EP/D033772/1)
Contact:
Xavier Desurmont, desurmont@multitel.be
ETISEO - Surveillance
Website:
Datasets are available here: (registration is needed)
86 video clips. These sequences constitute a representative panel of different video surveillance areas.
They merge indoor and outdoor scenes, corridors, streets, building entries, subway station... They also mix different types of sensors and complexity levels.
These datasets are composed of 24 Hours of real sequences, showing a level crossing where some vehicles stop due to its particular configuration: on the right side of the LC, there is an avenue, parallel to the LC. So a traffic light is located just after the LC. Consequently, sometimes, vehicles stopped on the LC due to this traffic light. The Total Amount of data is about 7 GigaBytes.
Metadata:
For each video files, there is a corresponding ground truth file in XML that gives the timestamp of events "stopped vehicles".
The dataset comprises of two views of various scenario's of people acting out various interactions. Ten basic scenarios were acted out. These were called InGroup (IG), Approach (A), WalkTogether (WT), Split (S), Ignore (I), Following (FO), Chase (C), Fight (FI), RunTogether (RT), and Meet (M).The data is captured at 25 frames per second. The resolution is 640x480. The videos are available either as AVI's or as a numbered set of JPEG single image files.
Metadata:
Tracking, Event detection.
Contextual info:
3D coordinates of points for calibration purposes provided.
Comments:
The site will be updated when more of the ground truth becomes available.
The datasets are multisensor sequences containing the following 3 scenarios, with increasing scene complexity: 1. loitering, 2. attended luggage removal (theft), 3. unattended luggage.
Metadata:
Event Detection
Contextual info:
Calibration provided
Comments:
Free download from website . The UK Information Commisioner has agreed that the PETS 2007 datasets described here may be made publicly available for the purposes of academic research. The video sequences are copyright UK EPSRC REASON Project consortium and permission is hereby granted for free download for the purposes of the PETS 2007 workshop.
Surveillance of public spaces, detection of left luggage events. Scenarios of increasing complexity, captured using multiple sensors.
Metadata:
All scenarios come with two XML files. The first of these files contains camera calibration parameters, these are given in the sub-directory 'calibration'. See the previous section (Calibration Data) for information on this XML file format. The second XML file (given in the sub-directory 'xml') contains both configuration and ground-truth information.
Contextual info:
Calibration provided.
Comments:
Copyrights:
Free download from website . The UK Information Commisioner has agreed that the PETS 2006 data-sets described here may be made publicly available for the purposes of academic research. The video sequences are copyright ISCAPS consortium and permission is hereby granted for free download for the purposes of the PETS 2006 workshop.
Contact:
Dimitrios Makris, d.makris@kingston.ac.uk
PETS - 2005 - WAMOP
Website:
Datasets are available here: (registration is needed)
A number of video clips were recorded acting out the different scenarios of interest. These include people walking alone, meeting with others, window shopping, fighting and passing out and last, but not least, leaving a package in a public place. All video clips were filmed with a wide angle camera lens. The resolution is half-resolution PAL standard (384 x 288 pixels, 25 frames per second) and compressed using MPEG2. The file sizes are mostly between 6 and 12 MB, a few up to 21 MB.A number of video clips were recorded acting out the different scenarios of interest. These include people walking alone, meeting with others, window shopping, fighting and passing out and last, but not least, leaving a package in a public place. All video clips were filmed with a wide angle camera lens. The resolution is half-resolution PAL standard (384 x 288 pixels, 25 frames per second) and compressed using MPEG2. The file sizes are mostly between 6 and 12 MB, a few up to 21 MB.
3D coordinates of points for calibration purposes provided.
Comments:
Copyrights:
Free download from website. If you publish results using the data, please acknowledge the data as coming from the EC Funded CAVIAR project/IST 2001 37540, found at URL:http://www.dai.ed.ac.uk/homes/rbf/CAVIAR/
Indoor people tracking (and counting). Two training and four testing sequences consist of people moving in front of a shop window. Sequences are provided as both MPEG movie format and as individual JPEG images.
Metadata:
People tracking, counting and activity recognition.
Outdoor people and vehicle tracking (two synchronised views; includes omnidirectional and moving camera). PETS'2001 consists of five separate sets of training and test sequences, i.e. each set consists of one training sequence and one test sequence. All the datasets are multi-view (2 cameras) and are significantly more challenging than for PETS'2000 in terms of significant lighting variation, occlusion, scene activity and use of multi-view data.
Metadata:
Tracking information on image plane and ground plane can be found at:
4 scenarios (Parked Vehicle, Abandoned Package, Doorway Surveillance and Sterile Zone) x 2 datasets (training, testing) each. Each dataset contains about 24 hours of footage in few different scenes.
Metadata:
Event-based Ground truth.
Contextual info:
Images of a pedestrian model in different positions are given for calibration purposes
Comments:
7 free clips for 2 scenarios (Parked Vehicle, Abandoned Package) are available from: http://www.elec.qmul.ac.uk/staffinfo/andrea/avss2007_d.html
Copyrights:
A user agreement and a payment (£500-£650 per dataset) is required to obtain each dataset. Datasets are provided in hard disks.
The Digital Database for Screening Mammography (DDSM) is a resource for use by the mammographic image analysis research community. The database contains approximately 2620 cases available in 43 volumes (healthy and diseased).
Metadata:
Images containing suspicious areas have associated pixel-level "ground truth" information about the locations and types of suspicious regions.
Contextual info:
Each study includes two images of each breast, along with some associated patient information (age at time of study, ACR breast density rating, subtlety rating for abnormalities, ACR keyword description of abnormalities) and image information (scanner, spatial resolution, ...). A case consists of between 6 and 10 files. These are an "ics" file, an overview "16-bit PGM" file, four image files that are compressed with lossless JPEG encoding and zero to four overlay files. Normal cases will not have any overlay files.
Comments:
Copyrights:
If you use data from DDSM in publications:
Please credit the DDSM project as the source of the data, and reference: ?The Digital Database for Screening Mammography, Michael Heath, Kevin Bowyer, Daniel Kopans, Richard Moore and W. Philip Kegelmeyer, in Proceedings of the Fifth International Workshop on Digital Mammography, M.J. Yaffe, ed., 212-218, Medical Physics Publishing, 2001. ISBN 1-930524-00-5?. ?Current status of the Digital Database for Screening Mammography, Michael Heath, Kevin Bowyer, Daniel Kopans, W. Philip Kegelmeyer, Richard Moore, Kyong Chang, and S. MunishKumaran, in Digital Mammography, 457-460, Kluwer Academic Publishers, 1998; Proceedings of the Fourth International Workshop on Digital Mammography?. Also, please send a copy of your publication to Professor Kevin Bowyer / Computer Science and Engineering / University of Notre Dame / Notre Dame, Indiana 46530.
Mainly CT, PET, MRI. Additional comments are available, all the dataset are not only medical content, you could find a scan of a bonzaï. The raw data can be extracted easily using the PVM tools distributed with the V^3 volume rendering package available at http://www.stereofx.org/
Copyrights:
Commercial use is prohibited and no warranty whatsoever is expressed, credit should be given to the group who created the dataset.
Contact:
Stefan Roettger (roettger@cs.fau.de) or Cedric Marchessoux (cedric.marchessoux@barco.com)
MyPACS.net is still free, and it now has over 16,500 teaching files contributed by 14,000 registered users. With 75,000 key images categorized by anatomy and pathology, you can quickly find examples of any disease. The web-based viewer has been improved with more PACS-like features, and it still works instantly in your browser, requiring nothing to download.
The datasets contain:
1. Cranium and Contents (1205)
2. Face and Neck (398)
3. Spine and Peripheral Nervous System (504)
4. Skeletal System (3433)
5. Heart (160)
6. Chest (894)
7. Gastrointestinal (1271)
8. Genitourinary (800)
9. Vascular/Lymphatic (416)
10. Breast (62)
11. Other (458)
Metadata:
Description of the pathology by medical doctors.
Contextual info:
Environment conditions (calibration, scene...): Medical modality described: Brand and acquisition conditions
Comments:
Copyrights:
MyPACS.net is still free, you need to be registered.
Contact:
Cedric Marchessoux (cedric.marchessoux@barco.com)
The NCIA (National Cancer Imaging Archive from National Cancer Institute) data base
The user should ask for a login. You may browse, download, and use the data for non-commercial, scientific and educational purposes. However, you may encounter documents or portions of documents contributed by private institutions or organizations. Other parties may retain all rights to publish or produce these documents. Commercial use of the documents on this site may be protected under United States and foreign copyright laws. In addition, some of the data may be the subject of patent applications or issued patents, and you may need to seek a license for its commercial use. NCI does not warrant or assume any legal liability or responsibility for the accuracy, completeness or usefulness of any information in this archive.
Contact:
Cedric Marchessoux (cedric.marchessoux@barco.com)
Conventional x-ray mammography data base
Website:
No official website, via Elizabeth Krupinski (krupinski@radiology.arizona.edu)
Dataset:
Real masses, micro calcifications, backgrounds, conventional x-ray mammography, bmp images with resolution of 256x256.
Metadata:
None, signals can be extracted by substraction between backrgrounds alone and background+signals at 100% density
Contextual info:
Environment conditions (calibration, scene...): X-ray system
Around 5 datasets of 250 images, x-ray chest healthy and diseased with nodules. 2048x2048, white is zero, big endian.
Metadata:
Per image, clinical metadata in txt file for each image with patient information age, sexe and images in itf with nodule, cancer, infection position.
Contextual info:
Environment conditions (calibration, scene...): X-ray system
Comments:
THe dataset should be ordered by email with a Visa card number. The dataset is delivered by post after one week. The price per dataset is more than reasonable.
Copyrights:
For publication credit should be given by citing in references the following article:
o J. Shiraishi et al. Development of a Digital Image Database for Chest Radiographs with and without a Lung Nodule: Receiver Operating Characteristic Analysis of Radiologists, Detection of Pulmonary Nodules. AJR, 174(1):71-74, 2000.
Datasets are here composed of sets of images to evaluate optical flow.
Sets can be made of 2 or 8 images for the evaluation in color or graylevel format.
Metadata:
GT is not provided for all datasets
Contextual info:
Flow accuracy and interpolation evaluation
We report two measures of flow accuracy (angular and end-point error) and two measures of interpolation quality. For each of the 4 measures we report 8 error metrics, resulting in a total of 32 tables. Links to the 4 measures are included below, but the tables are also linked among each other. At this point we do not identify a "default" measure or metric, and thus we do not provide an overall ranking of methods.
Comments:
The ground-truth flow is provided in a .flo format. Information and C++ code is provided in flow-code.zip, which contains the file README.txt. A Matlab version is also available in flow-code-matlab.zip.
Copyrights:
thanks to Brad Hiebert-Treuer and Alan Lim, who spent countless hours creating the hidden texture datasets
This page gives access to the first acquisition campaign of basket ball data during the APIDIS European project.
Dataset:
The dataset is composed of a basket ball game.
Seven 2-Mpixels color cameras around and on top of a basket ball court
Note: Due to bandwidth limitations, only a part of the basket ball game is availbale from this web site. Please contact us (bottom of this page) for more data.
Metadata:
Time stamp for each frame (all cameras being captured by a unique server at ~22 fps)
Manually annotated basket ball events
Manually annotated objects positions
Calibration data
Metadata XML files
Annotated events and salient-objects are recorded into two kinds of XML files.
Users could find the syntax of tags of both kinds of metadata in the two following XML Schema Definition (xsd) files: apidis-annotation-ver23.xsd and apidis-salientobj-ver1.xsd.
A simplified structural diagram of event xml files is: http://www.apidis.org/Public/all/metadata/event-xml-simple.png.
You can also find a full view of all tags defined in apidis-annotation-ver23.xsd and their structures here.
All cameras are Arecont Vision AV2100M IP cameras. The datasheets can be downloaded from the constructor site here and here.
Lenses: The fish-eye lenses used for the top view cameras are Fujinon FE185C086HA-1 lenses.
Comments:
Copyrights:
This dataset is available for non-commercial research in video signal processing only. We kindly ask you to mention the APIDIS project when using this dataset (in publications, video demonstrations...).
Contact:
christophe.devleeschouwer(at)uclouvain.be or Damien.Delannay(at)uclouvain.be
The objective of the International Music Information Retrieval Systems Evaluation Laboratory project (IMIRSEL) is the establishment of the necessary resources for the scientifically valid development and evaluation of emerging Music Information Retrieval (MIR) and Music Digital Library (MDL) techniques and technologies.
The RWC (Real World Computing) Music Database is a copyright-cleared music database (DB) that is available to researchers as a common foundation for research.
Metadata:
MIDI files, genre, lyrics
Contextual info:
Comments:
Copyrights:
Users who have submitted the Pledge and received authorization may freely use the database for research purposes without facing the usual copyright restrictions, but all of the copyrights and neighboring rights connected with this database belong to the National Institute of Advanced Industrial Science and Technology and are managed by the RWC Music Database Administrator. Persons or organizations that have not submitted a Pledge and that have not received authorization may not use the database.
Video data (.avi, DivX compressed). Dataset includes three types of sports: European (team) handball (3 synchronized videos, 10 min, 25 FPS, 384x288, Divx 5 AVI), Squash (2 videos from 2 separate matches, 25 FPS, 384x288, DivX AVI) , Basketball (videos only, 2 synchronized overhead videos in 2 quality modes 368x288, 25FPS, 5 minutes each and 720x576, 25 FPS 2 minutes each).
Metadata:
Annotations (individual player actions, group activity). Suitable for use as a gold standard. Trajectories (player positions in court and camera coordinate systems). These are not intended to be used as a gold standard, since their accuracy is not particularly high.
HD progressive image in jpeg for synthetic video sequence of soccer.
Metadata:
XML (position is 2D, 3D of objects and camera)
Contextual info:
no
Comments:
The dataset is fully described in "TRICTRAC Video Dataset: Public HDTV Synthetic Soccer Video Sequences With Ground Truth", X. Desurmont, J-B. Hayet, J-F. Delaigle, J. Piater, B. Macq, Workshop on Computer Vision Based Analysis in Sport Environments (CVBASE), 2006.
Copyrights:
All data is publicly available and downloadable. If you publish results using the data, please acknowledge the data as coming from the TRICTRAC project, found at URL: http://www.multitel.be/trictrac. THE DATASET IS PROVIDED WITHOUT WARRANTY OF ANY KIND.
Pets 2009 : Eleventh IEEE International Workshop on Performance Evaluation of Tracking and Surveillance
One-day workshop organised in association with CVPR 2009, supported by the EU project SUBITO.
The datasets for PETS 2009 consider crowd image analysis and include crowd count and density estimation, tracking of individual(s) within a crowd, and detection of separate flows and specific crowd events. Click on the link to the left to view the benchmark data.
The dataset is organised as follows:
Calibration Data
S0: Training Data
contains sets background, city center, regular flow
S1: Person Count and Density Estimation
contains sets L1,L2,L3
S2: People Tracking
contains sets L1,L2,L3
S3: Flow Analysis and Event Recognition
contains sets Event Recognition and Multiple Flow
Metadata:
Contextual info:
Comments:
Copyrights:
Please e-mail datasets@pets2009.net if you require assistance obtaining these datasets for the workshop.
GavabDB is a 3D face database. It contains 549 three-dimensional images of facial surfaces. These meshes correspond to 61 different individuals (45 male and 16 female) having 9 images for each person. The total of the individuals are Caucasian and their age is between 18 and 40 years old. Each image is given by a mesh of connected 3D points of the facial surface without texture. The database provides systematic variations with respect to the pose and the facial expression. In particular, the 9 images corresponding to each individual are: 2 frontal views with neutral expression, 2 x-rotated views (±30º, looking up and looking down respectively) with neutral expression, 2 y-rotated views (±90º, left and right profiles respectively) with neutral expression and 3 frontal gesture images (laugh, smile and a random gesture chosen by the user, respectively).
Metadata:
Contextual info:
Comments:
Copyrights:
Those publications that use this signature date must reference the following work: A.B. Moreno y A.Sanchez. GavabDB: A 3D Face Database. Proc. 2nd COST Workshop on Biometrics on the Internet: Fundamentals, Advances and Applications, C. Garcia et al (eds): Proc. 2nd COST Workshop on Biometrics on the Internet: Fundamentals, Advances and Applications, Ed. Univ. Vigo, pp. 77-82, 2004
120 persons were asked to pose twice in front of the system: in Nov 97 (session1) and in January 98 (session2). For each session, 3 shots were recorded with different (but limited) orientations of the head: straight forward / Left or Right / Upward or downard.
Among the 120 people, two thirds consist of students from the same ethnic origins and with nearly the same age. The last third consists of people of the academy, all aged between 20 and 60.
Different problems encountered in the cooperative scenario were taken into account. People sometimes worn their spectacles, sometimes didn't. Beards and moustaches were represented. Some people smiled in some shots. Small up/down and left/right rotations of the head were requested. We regret that only a few (14) women were available.
Human action in video sequences can be seen as silhouettes of a moving torso and protruding limbs undergoing articulated motion. We regard human actions as three-dimensional shapes induced by the silhouettes in the space-time volume. We adopt a recent approach by Gorelick et. al. for analyzing 2D shapes and generalize it to deal with volumetric space-time action shapes. Our method utilizes properties of the solution to the Poisson equation to extract space-time features such as local space-time saliency, action dynamics, shape structure and orientation. We show that these features are useful for action recognition, detection and clustering. The method is fast, does not require video alignment and is applicable in (but not limited to) many scenarios where the background is known. Moreover, we demonstrate the robustness of our method to partial occlusions, non-rigid deformations, significant changes in scale and viewpoint, high irregularities in the performance of an action and low quality video.
The current video database containing six types of human actions (walking, jogging, running, boxing, hand waving and hand clapping) performed several times by 25 subjects in four different scenarios: outdoors s1, outdoors with scale variation s2, outdoors with different clothes s3 and indoors s4 as illustrated below. Currently the database contains 2391 sequences. All sequences were taken over homogeneous backgrounds with a static camera with 25fps frame rate. The sequences were downsampled to the spatial resolution of160x120 pixels and have a length of four seconds in average.
The researcher was asked to perform a set of common household activities during the four-hour period using a set of instructions. Activities included the following: preparing a recipe, doing a load of dishes, cleaning the kitchen, doing laundry, making the bed, and light cleaning around the apartment. The volunteer determined the sequence, pace, and concurrency of these activities and also integrated additional household tasks. Our intent was to have a short test dataset of a manageable size that could be easily placed on the web without concerns about anonymity. We wanted this test dataset, however, to show a variety of activity types and activate as many sensors as possible, but in a natural way. In addition to the activities above, the researcher searches for items, uses appliances, talks on the phone, answers email, and performs other everyday tasks. The researcher five mobile accelerometers (one on each limb and one on the hip) and a Polar M32 wireless heart rate monitor. The researcher carried an SMT 5600 mobile phone that ran experience sampling software that beeped and presented a set of questions about her activities.
Metadata:
The dataset includes four hours of partially (and soon to be fully) annotated video. The annotation was done using custom annotation software written by Randy Rockinson and Leevar Williams of MIT House_n. This software (called HandLense) is available for researchers to use to study this dataset. [Overview of HandLense and executable]
The annotations include descriptors for body posture, type of activity, location, and social context.
Here is collected a large body of human action video (MuHAVi) data using 8 cameras. There are 17 action classes performed by 14 actors. So far we have processed videos corresponding to 7 actors in order to split the actions and provide the JPG image frames. However, we have included some image frames before and after the actual action, for the purpose of background subtraction, tracking, etc. The longest pre-action frames correspond to the actor called Person1. Each actor performs each action several times in the action zone highlighted using white tapes on the scene floor. As actors were amateurs the leader had to interrupt the actors in some cases and ask them to redo the action for consistency. We have used 8 CCTV Schwan cameras located at 4 sides and 4 corners of a rectangular platform. Note that these cameras are not necessarily synchronised. We are working on improving the synchronisation between the images corresponding to different cameras.
Metadata:
Calibration information may be included here in the future. Meanwhile, one can use the patterns on the scene floor to calibrate the cameras of interest.
This dataset provides a large body of synthetic video data generated for the purpose of evaluating different algorithms on human action recognition which are based on silhouettes. The data consist of 20 action classes, 9 actors and up to 40 synchronised perspective camera views. It is well known that for the action recognition algorithms which are purely based on human body masks, where other image properties such as colour and intensity are not used, it is important to obtain accurate silhouette data from video frames. This problem is not usually considered as part of the action recognition, but as a lower level problem in the motion tracking and change detection. Hence for researchers working on the recognition side, access to reliable Virtual Human Action Silhouette (ViHASi)data semmes to be both a necessity and a relief. The reason for this is that such data provide a wat of comprehensive experimentation and evaluation of the methods under study, that might even lead to thier improvments.
The dataset contains a collection of pedestrian and non-pedestrian images. It is made available for download on this site for benchmarking purposes, in order to advance research on pedestrian classification.
The dataset consists of two parts:
a base data set. The base data set contains a total of 4000 pedestrian- and 5000 non-pedestrian samples cut out from video images and scaled to common size of 18x36 pixels. This data set has been used in Section VII-A of the paper referenced above.
Pedestrian images were obtained from manually labeling and extracting the rectangular positions of pedestrians in video images. Video images were recorded at various (day) times and locations with no particular constraints on pedestrian pose or clothing, except that pedestrians are standing in upright position and are fully visible. As non-pedestrian images, patterns representative for typical preprocessing steps within a pedestrian classification application, from video images known not to contain any pedestrians. We chose to use a shape-based pedestrian detector that matches a given set of pedestrian shape templates to distance transformed edge images (i.e. comparatively relaxed matching threshold).
additional non-pedestrian images. An additional collection of 1200 video images NOT containing any pedestrians, intended for the extraction of additional negative training examples. Section V of the paper referenced above describes two methods on how to increase the training sample size from these images, and Section VII-B lists experimental results.
Metadata:
Contextual info:
Comments:
Copyrights:
This dataset is made available to the scientific community for non-commercial research purposes such as academic research, teaching, scientific publications, or personal experimentation. Permission is granted to use, copy, and distribute the data given.
The dataset consists of nine different cameras, deployed over several different rooms and a hallway in a ``laboratory/office" setting. Several different scenarios were collected from the cameras. A two minute sequence was captured of researchers/staff/visitors going about their daily activities. In addition three different scenarios were scripted so that particular behaviors were exibited in the data.
During data collection, all cameras wrote raw (uncompressed) data at a resolution of 640x480. All machine clocks were synchonrized via the NTP. In addition to each frame, a timestamp was recorded so that frames can be associated with one another across cameras.
Selected Ground Truth (102 MB) - frames with hand-marked labels of individuals and objects
Unscripted Activities (59.6 GB) - natural behavior and activities
Subject Face/Gait Database (101 MB) - face pictures and video of subjects walking in front of the camera
Metadata:
Extensive groundtruth is also provided. Entrance and exit times for individuals in each camera, foreground segmentation, and activity labeling is all part of the dataset.
This is a publicly available benchmark dataset for testing and evaluating novel and state-of-the-art computer vision algorithms. Several researchers and students have requested a benchmark of non-visible (e.g., infrared) images and videos. The benchmark contains videos and images recorded in and beyond the visible spectrum and is available for free to all researchers in the international computer vision communities. Also it will allow a large spectrum of IEEE and SPIE vision conference and workshop participants to explore the benefits of the non-visible spectrum in real-world applications, contribute to the OTCBVS workshop series, and boost this research field significantly.
There are 7 datasets:
1) Dataset 01: OSU Thermal Pedestrian Database
2) Dataset 02: IRIS Thermal/Visible Face Database
3) Dataset 03: OSU Color-Thermal Database
4) Dataset 04: Terravic Facial IR Database
5) Dataset 05: Terravic Motion IR Database
6) Dataset 06: Terravic Weapon IR Database
7) Dataset 07: CBSR NIR Face Dataset
Metadata:
Contextual info:
Comments:
Copyrights:
Register (name, institution, email) to download the datasets.
Hereby the eyes ground truth in Viper format of face YaleB database containing 5760 single light source images of 10 subjects each seen under 576 viewing conditions (9 poses x 64 illumination conditions) + 650 viper files. Ground truth developed in the context of CANTATA project, developed by BARCO
Metadata:
All the images are annotated with Viper XML files. Each “.bmp” image is associated with a “.xml” annotation file with the same name, containing the iris positions. The position corresponds to crosses. The path of the bmp image should be changed in the viper file.
Contextual info:
For every subject in a particular pose, an image with ambient (background) illumination was also captured. Hence, the total number of images is in fact 5760+90=5850. The total size of the compressed database is about 1GB.
Comments:
The dataset already exists without the ground truth in Viper format. The ground truth was either generated or converted in Viper format in the context of Cantata project. The metadata were generated by Arnaud Joubel.
Copyrights:
Dataset YaleB: You are free to use the Yale Face Database B for research purposes. If experimental results are obtained that use images from within the database, all publications of these results should acknowledge the use of the "Yale Face Database B" and reference to “Georghiades, A.S. and Belhumeur, P.N. and Kriegman, D.J. From Few to Many: Illumination Cone Models for Face Recognition under Variable Lighting and Pose", IEEE Trans. Pattern Anal. Mach. Intelligence, 2001, 23, number, 643-660”.
Ground truth in Viper: Requested citation acknowledgment about the ground truth:
Courtesy of ITEA2 funded Cantata project
Contact:
Quentin Besnehard, quentin.besnehard@barco.com or Cedric Marchessoux, cedric.marchessoux@barco.com
Set of bitmap images containing anti-aliased text in the context of CANTATA project, developed by BARCO. Number of images in the archive (2400 available in the archive)
Metadata:
All the images are annotated with Viper XML files. Each “.bmp” image is associated with a “.grid.xml” annotation file with the same name. The annotation takes the form of a grid of 32x32 pixels bounding boxes. The path of the bmp image should be changed in the viper file if you want to open it in viper-gt.
Contextual info:
The text is represented in different colors: black on white, white on black, random dark color on white, white on random dark color, black on random light color, random light color on white, random dark color on random light color and, finally, random light color on random dark color.The annotation takes the form of a grid of 32x32 pixels bounding boxes.
Comments:
The dataset and the ground truth were generated by Quentin Besnehard and Arnaud Joubel. To obtain the complete dataset, send an e-mail to the contact person
Copyrights:
The fonts used are available under the GNU General Public License version 2.0. These fonts are free clones of the original fonts provided by URW typeface foundry.
Requested citation acknowledgment about the dataset and the ground truth : Courtesy of ITEA2 funded Cantata project.
Contact:
Quentin Besnehard, quentin.besnehard@barco.com or Cedric Marchessoux, cedric.marchessoux@barco.com
Set of bitmap images containing aliased text (2 colors) in the context of CANTATA project, developed by BARCO. Number of images in the archive (1250 available in the archive)
Metadata:
All the images are annotated with Viper XML files. Each “.bmp” image is associated with a “.grid.xml” annotation file with the same name. The annotation takes the form of a grid of 32x32 pixels bounding boxes. The path of the bmp image should be changed in the viper file if you want to open it in viper-gt.
Contextual info:
The text is represented in different colors: black on white, white on black, random dark color on white, white on random dark color, black on random light color, random light color on white, random dark color on random light color and, finally, random light color on random dark color. Fonts used (from 7 to 42 points):
Helvetica
Optima
AvantGarde
Times
Palatino
Courier
Century
Comments:
The dataset and the ground truth were generated by Quentin Besnehard and Cédric Marchessoux.
Copyrights:
The fonts used are available under the GNU General Public License version 2.0. These fonts are free clones of the original fonts provided by URW typeface foundry. Requested citation acknowledgment about the data set and the ground truth: Courtesy of ITEA2 funded Cantata project
Smart meeting, that includes facial expressions, gaze and gesture/action. The environment consists of three cameras: one mounted on each of two opposing walls, and an omnidirectional camera positioned at the centre of the room. The dataset consists of four scenarios.
Metadata:
a) Eye positions of people in Scenarios A, B and D. (every 10th frame is annotated).
b) Facial expression and gaze estimation for Scenarios A and D, Cameras 1-2.
c) Gesture/action annotations for Scenarios B and D, Cameras 1-2.
This website contains a multiple links to medical datasets.
TRECVID
The TRECVID conference series is sponsored by the National Institute of Standards and Technology (NIST) with additional support from other U.S. government agencies. The goal of the conference series is to encourage research in information retrieval by providing a large test collection, uniform scoring procedures, and a forum for organizations interested in comparing their results. In 2001 and 2002 the TREC series sponsored a video "track" devoted to research in automatic segmentation, indexing, and content-based retrieval of digital video. Beginning in 2003, this track became an independent evaluation (TRECVID) with a 2-day workshop taking place just before TREC.
The USC-SIPI image database is a collection of digitized images. It is maintained primarily to support research in image processing, image analysis, and machine vision. The first edition of the USC-SIPI image database was distributed in 1977 and many new images have been added since then.
The database is divided into volumes based on the basic character of the pictures. Images in each volume are of various sizes such as 256x256 pixels, 512x512 pixels, or 1024x1024 pixels. All images are 8 bits/pixel for black and white images, 24 bits/pixel for color images. The following volumes are currently available:
所有评论(0)