PeopleFinderTech
From Katrina Help Info
What we are doing
There are over 50 sites out there that help people find lost loved ones from new orleans. The problem is none of the sites talk to one another. We are solving this problem by building automated data interchange systems and scraping data sets. We need your help!
Here are the goals of this project
- Implement automated data interchange systems around the PFIF spec
- Scrape and merge data from sets that will not implement PFIF
- Minimize duplicate records
- Make the central database avaliable to be searched
We are in contact with KatrinaSafe and they will be using data we collect.
Coordination & Leadership
PFIF spec leaders:
- Ka-Ping Yee (ping [at] zesty ca)
- Josh Kleinpeter (kleinpeterj [at] corp.earthlink.net)
- Jon Plax (salesforce.com)
Scraping Effort Leader:
- Zack Rosen <zack at civicspacelabs.org>
PFIF implementation coordinator:
- Zack Rosen <zack at civicspacelabs.org
Discussions
- join the mailing list: katrinadev-subscribe@civicspacelabs.org (mailto:katrinadev-subscribe@civicspacelabs.org)
- if you are going to help scrape data sets join the katrinascraper's mailinglist mailto:katrinascrapers-subscribe@civicspacelabs.org
- list archives (http://orwant.com/katrina/) are now online and searchable. This covers messages from 5 September 2005 at 18:00 (6 pm) GMT onward.
- IRC: irc://irc.freenode.net/katrinadev (for technical folks, archive here (http://katrina.blueoxen.net/irc/katrinadev/))
- IRC: irc://irc.freenode.net/globalvoices (for non-techie volunteers, archive here (http://katrina.blueoxen.net/irc/globalvoices/))
- Periodic phone conference calls are sometimes used.
Master Database
http://katrinalist.net - doesn't yet include any scraped or PFIF fed data
Data interchange spec
We have an FINAL PeopleFinder Interchange Format avaliable at
- cilibrar has used this as the basis for the [removed SQL] and [removed XML] he made, however he had to make two small adjustments:
- in both tables the primary key is an int whose name is id for easy ruby + rails (http://www.rubyonrails.org/) compatibility. this was added and the other key (the incoming source db key string) is still named as it was before but it is no longer the real primary key for our db
- the entry_date was converted to an integer in order to be maximally portable, since different databases and SQL flavors handle dates differently. the integer is just seconds passed since the epoch and therefore can be converted easily to human readable form in any language using standard time functions
The SQL database schema is included with the SQL dump. Both are compressed using bzip2 (http://www.bzip.org/). If people are willing to dump their records in the same SQL schema I have made above (based on PFIF) then please do so and I can merge all dumps. The ftp site, username, and password are available from the katrinadev list.
How to get involved
Scrape data sets
- Sign up on the katrina-scrapers mailinglist mailto:katrinascrapers-subscribe@civicspacelabs.org
- Choose a set from the list PeopleFinderTech#Sites_that_need_to_be_scraped
- Move it under the "Sites that are currently being scraped" heading
- Update it's status on PeopleFinderTechStructuredDataSets
- Let people know you are scraping the set on the KatrinaScrapers mailinglist
- When you are done scraping, validate the data by
- uploading a single record of the data to: http://www.katrinalist.net/uploadPFIF/
- run the set through the PFIF validator: http://www.w3.org/2001/03/webdata/xsv
- Link to your data on the pass word protected wiki (email zack [at] civicspacelabs.org for access)
- Move your wiki listing under the PeopleFinderTech#Validated section after it is validated
- Let the KatrinaScrapers mailinglist know you have succesfully scraped and validated a data set
If you have trouble getting your data to validate feel free to ask PFIF questions on the KatrinaDev mailinglist.
Validate data sets
- Choose a set and move it under the PeopleFinderTech#Being_validated section and add your email address and name below it on the listing
- Notify the KatrinaScrapers mailinglist as to which data set you are validating
- Get access to the file on the password protected wiki (email zack [at] civicspacelabs.org for access)
- Validate the data by
- uploading a single record of the data to: http://www.katrinalist.net/uploadPFIF/
- run the set through the PFIF validator: http://www.w3.org/2001/03/webdata/xsv
- If you have problem with file size and this interface, there is a *NIX command line utility which has been recommended:
- Get xmllint (comes with most unix distros and cygwin - go to http://xmlsoft.org/downloads.html for source, binaries, etc)
- Download the XSD file at http://zesty.ca/pfif/1.1/pfif-1.1.xsd
- Invoke xmllint on your XML file (assume we call it pfif.xml):
xmllint --noout --schema pfif-1.1.xsd pfif.xml
- If the feed is valid move it under the PeopleFinderTech#Validated section. If it is invalid then move it under the PeopleFinderTech#Invalid_sets heading and contact the data set scraper and help them fix their set
- Notify the KatrinaScrapers of your results
Helping site admins implement PFIF feeds
- Choose a site from this list PeopleFinderTech#Sites_that_need_help_implementing_PFIF_feeds
- Contact the site admin and offer assistance
- Move the listing under the heading "Sites currently implementing PFIF feeds"
- When the site is putting out a validated PFIF feed send a note to the KatrinaDev mailinglist
Also, we have a task list accessible here: Task List
Data Sets
A list of structured data sets and contact information for the owners is up on PeopleFinderTechStructuredDataSets
PFIF Feeds
PFIF/RDF TRANSFORM
Courtesy of Peter Mika pmika at cs.vu.nl (http://prauw.cs.vu.nl:8080/pfif/)
Feedback Welcome
Sites that have PFIF feeds
- http://katrina.earthlink.net/people/list 2,861 records
- PeopleFinderTechStructuredDataSets#Hurricane_Help (more information)
- http://www.hurricanerefugee.com/ 3,500+ records
- I have a realtime PFIF feed available - Please email me at -- content [at] hurricanerefugee.com for access
- PeopleFinderTechStructuredDataSets#Hurricane_Refugees (more information)
Sites currently implementing PFIF feeds
- http://www.houmashelters.com/ 2,800 records
- PeopleFinderTechStructuredDataSets#Houma_Shelters (more information)
- http://www.familymessages.org/index.php 2,110 records
- http://www.theinfozone.net/NOLAmissing2.html/ 1790 records
- http://www.katrinatracker.com/ 3,031 records
- PeopleFinderTechStructuredDataSets#Katrina_Tracker (more information)
Sites that need help implementing PFIF feeds
- http://www.hurricanesurvivors.org/database.html 595 records
- PeopleFinderTechStructuredDataSets#Hurrican_Survivors.org (more information)
Sites that agreed to implement PFIF but have unknown status
- http://www.katrinadataproject.com/index.aspx 33,743 records
- PeopleFinderTechStructuredDataSets#Katrina_Data_Project (more information)
- http://www.katrinafinder.us/ 4,223 records
- PeopleFinderTechStructuredDataSets#Katrina_Finder (more information)
- "Implementing RSS spec" - not sure if that means PFIF
- http://www.gwid.com/katrina.php 1,654 records
- http://www.theinfozone.net/NOLAmissing2.html/ 1265 records
PFIF Implementation Volunteers
If you are avaliable to help site admins implement PFIF please add your name and email address to the list below
- Tony Chang: tony [at] ponderer.org - email me if you want help implementing PFIF
- Andy Schmitz: andy.schmitz [at] gmail.com - at school most of the day, but can help in the evening.
- Gordon E. Amond: Gordon [at] amonds.net - I would be proud to help my american neighbors.
- Geoff Webb: geofflwebb [at] yahoo.com - I have time in the evenings and weekends.
Scraping
- Mark sets that have been scraped.
- Mark sets that have been uploaded to the salesforce.com repository with the date/time of the scrape and the date/time of the upload.
- Uploads MUST conform to PFIF.
- Source Name MUST be clear, unique explicit and the same across all records from a single source and include the time OF THE SCRAPE (For example: Scrape-gulfcoastnews-bycilibrar-9/5/2005-10am).
Sites that have been scraped
Imported
- http://www.msnbc.msn.com/id/9159961/ 143,000+ records
- Data has been scraped, converted to PFIF format and uploaded - Brent (brent [at] bjohnson.net)
- PFIF File URL:
- Uploaded by Andy Schmitz, using data fetched at 17:38, 13 Sep 2005 (EDT).
- PeopleFinderTechStructuredDataSets#MSNBC_.22Looking_for.22_and_.22Safe.22_lists (more information)
- http://www.familylinks.icrc.org/katrina/people 135,222 records
- removed 100% scraped
- Scraped by Brent L Johnson <brent at bjohnson.net>
- PeopleFinderTechStructuredDataSets#Red_Cross_.28ICRC.29 (more information)
- Imported 135,222 records into Katrinalist.net by Steve Fisher 09-11-05 8:17AM PST
- http://wx.gulfcoastnews.com/katrina/status.aspx 42,477 records
- [Removed (Download OLD DATA, see below)]
- Scraped by Rudi Cilibrasi <cilibrar at gmail.com>
- Rescraped by NacreData <devin at nacredata.com> and Wendy <mrscake at gmail.com>
- [Removed DOWNLOAD: (starting data, JavaScript Hack, PFIF XML output, Perl code)]
- PeopleFinderTechStructuredDataSets#Gulf_Coast_News_Survivor_Connector (more information)
- Validated by devin at nacredata . com
- Record One validates successfuly at http://www.katrinalist.net/uploadPFIF/
- Entire file validates with libxml2 and Xerces
- Imported into Katrinalist.net by Jon Plax 09-09-05 8:35PM PST
- 61927 records imported, 0 errors
- http://www.publicpeoplelocator.com/ 37,259 records
- [Removed Download] (Be warned: 3 Megabytes over a DSL line)
- Scraped by Andy Schmitz <andy.schmitz at gmail.com>
- Data should be valid (validated by xmllint, uploaded two test records).
- PeopleFinderTechStructuredDataSets#Public_People_Locator (more information)
- Imported by Andy Schmitz on 09-11-05 at 9:10 PM CDT.
- http://www.katrina-survivor.com/ 9,071 records
- scraped by Zach Berke <zktb at twotacos.com>
- PeopleFinderTechStructuredDataSets#Hurricane_Katrina_Survivor_Registry (more information)
- [Removed Download .tgz]
- Notes:
- - ignored 5499 records that have links to gulfcoastnews.com for details -- assuming they were scraped from there
- - Set validated with xmllint but was too big to send to w3.org. Singlet record did validate at katrinalist.net. Please let me know if there are any PFIF validation errors on the set.
- Imported into Katrinalist.net by Jon Plax 09-09-05 6:35 PM PST
- 95 data validation errors due to blank or whitespace-only first_name field
- http://www.lnha.org/katrina/default.asp 4,500 records (roughly)
- scraped by Zach Berke <zktb at twotacos.com>
- PeopleFinderTechStructuredDataSets#LANH_Katrina_Evacuee_Directory (more information)
- [Removed Download .tgz]
- Imported into Katrinalist.net by Jon Plax 09-09-05 5:52PM PST
- 61 data validation errors due to blank or whitespace-only first_name
- Manually removed non-XML header and footer to allow parsing
- http://connect.castpost.com/fulllist.php 2,871 records
- [Removed Download] (not validated)
- Scraped by Andy Schmitz <andy.schmitz at gmail.com>
- PeopleFinderTechStructuredDataSets#Hurricane_Katrina_Persons_DB (more information)
- [Removed Download] (Validated after manual corrections with XMLSpy by Darci Hanning <darci.hanning at gmail.com>)
- Imported into Katrinalist.net by Jon Plax 09-09-05 6:15 PM PST
- 61 data validation errors, details lost due to user error. Some records on the source site had no first name or no last name.
- http://www.findkatrina.com 2,474 records
- PeopleFinderTechStructuredDataSets#Find_Katrina (more information)
- Scraped by David Dwiggins <david at dwiggins.net> approx. 3 AM CST, 09-11-2005.
- Validated using xmllint and single record upload by Dmdwiggi 06:13, 11 Sep 2005 (EDT)
- Current XML at: Removed
- [PHP scraping script (http://felix.dwiggins.net/katrina/findkatrina_php.txt)], [Perl export script (http://felix.dwiggins.net/katrina/findkatrina_xml_pl.txt)].
- Uploaded by Andy Schmitz on 09-12-05 at 6:00 AM CDT. Must have been uploaded previously though, because everything was reported as already imported.
- Update: This was probably me -- I had attempted to upload it, but the connection timed out while processing. I guess it made it after all. Sorry for the duplicated work. -- Dmdwiggi 13:45, 12 Sep 2005 (EDT)
- Not a problem. I needed to test my PFIF file uploader anyway (see the scrapers/utils list at the bottom for a link to the source). --Aschmitz 20:35, 12 Sep 2005 (EDT)
- http://www.katrinasurvivor.net/find.cfm?PageNum_GetAll=1&sort=name 2,400 records
- Scraped by Leonard Lin <lhl at usc.edu> @ 2005-09-08 05:43 PDT
- [Removed/ DOWNLOAD: Perl code and valid PFIF XML output]
- PeopleFinderTechStructuredDataSets#Katrina_Survivor (more information)
- Validation completed <darci.hanning @ gmail.com>
- First record uploaded without error.
- XMLSpy validation completed (no errors / no corrections).
- Imported into Katrinalist.net by Jon Plax 09-09-05 5:34 PM PST
- 35 data validation errors due to blank or whitespace-only last_name.
- theinfozone.net (http://www.theinfozone.net/NOLAmissing2.html) 1,300 records
- The site owner provided a CSV file (http://www.theinfozone.net/NOLAmissing.csv)
- I converted to PFIF which can be found here: [removed PFIF XML]
- I tried a single record and it validates <tony at ponderer.org>
- Conversion source code (http://ponderer.org/cvs/index.pl/python/katrina/src/)
- Imported into katrinalist.net by Jon Plax 09-09-05 5:08 PM PST
- http://www.cnn.com/SPECIALS/2005/hurricanes/list 1,120 records
- [removed Download] (not validated)
- Scraped by Nick Easler <easlern at gmail.com> 09-07-05 5:35am EST
- PeopleFinderTechStructuredDataSets#CNN_Safe_List (more information)
- Beginning validation process <darci.hanning @ gmail.com>
- Data uploader being fixed with relaxed secondary validation; will retry when re-released.
- Validation successful with both XMLSpy and W3 site after minor fix.
- [removed Download] (Validated)
- Imported into katrinalist.net by Jon Plax 09-09-05 5:02 PM PST
Being Uploaded to SalesForce
- Family messages - 20,000 records
- PeopleFinderTechStructuredDataSets#Family_Messages (more information)
- Being uploaded --Aschmitz 21:13, 14 Sep 2005 (EDT). Status is here (http://lardbucket.org/projects/katrina/status_sul.txt)
- Validation complete <darci.hanning @ gmail.com> (xmllint, XMLSpy and one record uploaded successfully) with the following outstanding questions by Dan <chaney @ dcre-labs.com>:
- Q1: Zipcodes The first unresolved error involves the zipcode field. It demands an integer (which I suspect will change in PFIF 1.2) so for now, is it appropriate to put in 00000 when the zipcode is unavailable (and strip out +4 zip codes for now?)
- A1: Yes.
- Q2: Null date fields Null date fields aren't allowed, nor the unsightly "unknown" so, when given that I have no date field for source or entry dates, is the preferred action o not list the tagset at all?
- A2: Source date should be the current date(?), entry date should either be provided or an old date(?). I'm not sure about this, if Ping could take a look and give an authoritative answer, that would help.
- Q3: In general, if I do not have data for a field, should I just not print a tagset for it?
- A3: It's not clear. I would add it with blank data, otherwise SalesForce may choke on it.
- http://www.wecaretexas.com/ >200,000 records
- PeopleFinderTechStructuredDataSets#wecaretexas (more information)
- http://www.scribedesigns.com/tulane/ 1,933 records
- PeopleFinderTechStructuredDataSets#Tulane_Safe_Registry (more information)
Validated
- None that aren't being/haven't been uploaded
Invalid sets
- None
Being validated
- None
Need to be validated
- None
Sites that are currently being scraped
- http://www.searchformissing.org/ 184 records
- PeopleFinderTechStructuredDataSets#Search_for_Missing_People (more information)
- Currently being scraped by Mmondok 20:48, 8 Sep 2005 (EDT)
- http://www.kare.arkansas.gov/ 23,000+ records
- PeopleFinderTechStructuredDataSets#Operation_Kare (more information)
- Excel spreadsheet being converted to PFIF < darci.hanning @ gmail.com >
- http://www.safe.textamerica.com/ 69 records
- Currently being scraped by --Joe 20:59, 14 Sep 2005 (EDT)
Sites that need to be scraped
- http://co.harrison.ms.us/assistance/missing/ 1132 records
- http://callhome.textamerica.com/ 643 records
- PeopleFinderTechStructuredDataSets#Missing_Katrina (more information)
- http://www.missingkids.com/missingkids/servlet/PageServlet?LanguageCountry=en_US&PageId=2077 333 records (plus a few dozen records on photo pages)
- PeopleFinderTechStructuredDataSets#NCMEC_Hurricane_Katrina_Children (more information)
- http://www.emergency-database.com/guide/ 200(?) records
- PeopleFinderTechStructuredDataSets#emergency-database (more information)
- http://www.survivorregistry.com/cgi-bin/show_all.pl 193 records
- PeopleFinderTechStructuredDataSets#Survivor_Registry (more information)
Sites that can't be scraped
- http://katrina.streetlampsoftware.com/ 456 records asked to take down 9/8/05 and redirect...
- PeopleFinderTechStructuredDataSets#Katrina_Survivor_Database (more information)
- Scraping looked at by Gabe Wachob (gwachob@wachob.com)
- Not very scrapable - there is no field for missing person data, for example -- it appears to get stuck in the freeform notes section
- http://findourfamily.com/ record count unknown
- PeopleFinderTechStructuredDataSets#Find_Our_Family (more information)
- Not scrapable - freeform Invision BB. (log in with user+pass "zlvypjkmku" to view)
Scraping volunteers
Please sign up on the Katrina Scrapers mailinglist: mailto:katrinascrapers-subscribe@civicspacelabs.org and introduce yourself
Tools
PFIF XML Generators
- PFIF XML Generation (http://ponderer.org/cvs/index.pl/python/katrina/src/) (Python) - objects that can easily be serialized into PFIF XML.
- PFIF XML Generation (http://katrina.internet2.edu/~cilibrar/pfifmake.rb) (Ruby) - This is based on single function call with array of Person objects. Based on Josh's script sent to list
- Perl XML::PFIF module (http://erislabs.net/ianb/projects/pfif/) (Perl) - problems to <ianb [at] erislabs.net>.
- PFIF XML exporter (http://www.hurricanerefugee.com/pfif_asp_code/) (ASP) - sample code for generating PFIF from SQL Server <egvandell at hotmail.com>
Scrapers
- ICRC scraper (http://www.billglover.com/software/katrina/scrape_ICRC) (Perl) - This is deprecated, Brent has a new Java version with fixes for some problems.
- CNN scraping code (http://www.summertime-software.com/CNNScrape.090705.0526.zip) (Python)
- Gulf Coast Scraper (http://homepages.cwi.nl/~cilibrar/projects/a/gulfcoast/process.rb) (Ruby)
- Gulf Coast Scraper (http://nacredata.org/katrina/perl_gulf_format.pl) (Perl)
- connect.castpost.com Scraper (http://lardbucket.org/projects/katrina/scrape_ccc.phps) (PHP) (andy.schmitz [at] gmail.com)
- publicpeoplelocator.com Scraper (http://lardbucket.org/projects/katrina/scrape_ppl.phps) (PHP) (andy.schmitz [at] gmail.com)
- OO PHP Scraping Tool hacked together by Jonathan Lambert (PHP) The main scraper script (http://workhabit.com/framework/scraper.phps), which in this case was used to hack http://www.publicpeoplelocator.com and the http class that does the work (http://workhabit.com/framework/class_http.phps). This should be really to adjust to scrape virtually any sites. Automatically rips tables to arrays, generates header and footer, cleans up, etc... in a couple of lines of code. Does not appear to support notes.
Misc
- PFIF Uploader (http://lardbucket.org/projects/katrina/split_upload.phps) (PHP) Splits a large XML file into smaller (30 people) chunks and uploads them. Requires libCurl for PHP and write access to the current directory. Edit the second line to refer to your PFIF file. Andy Schmitz <andy.schmitz at gmail.com>
- [removed Combined ICRC and Gulf Coast aggregated SQL]
- Makefile for conversion (http://www.katrinahelp.info/~cilibrar/pfifproj/Makefile)
- Ruby script to convert from PFIF to simple SQL (http://www.katrinahelp.info/~cilibrar/pfifproj/conv.rb)
- Database schema (http://www.katrinahelp.info/~cilibrar/pfifproj/create.sql)
- FindAPlace Application (Drupal)
- Missing image
Http://civicspacelabs.org/home/files/images/FindAPlace.jpeg
Image:

