MakeAssign
From Katrina Help Info
Status update
Nick Branstator and Ethan Zuckerman have been working on assigning posts from bulletin boards over the past 24 hours or so. We've assigned, or are assigning, twenty so far, including the huge Craigslist. Those that we've assigned automatically, we can generally update daily and add new entries. For the next couple of days, we'll take responsibility for doing this.
We're working from a not-definitive, but pretty good, list (http://mirror.voxelsoft.com/list.txt) that's been posted online. Of the sites that have more than 10s or 1s of entries, we've NOT done the following:
To-Do and currently claimed
http://forums.chron.com/n/pfx/forum.aspx?webtag=hc-katrina
http://www.katrina-messages.org/ [recommend this be handled directly by a data entry volunteer; few pages, not worth scraping, but contains a fair number of posts - NickB, 12:13, 5 Sep (EDT) ]
http://www.fullcircle.net/Blah.pl
Done
Moved to structured data
http://www.familylinks.icrc.org/katrina/locate
- Seems like the scraper team has covered this one.
http://wx.gulfcoastnews.com/katrina/status.aspx [claimed]
- I believe this is being scraped directly into the DB. If someone knows otherwise, please correct this. [ Skoczen 18:15, 6 Sep 2005 (EDT)]
Parsed and Wiki Pages made
Props to MattH and Skoczen.
http://forums.sunherald.com/n/mb/listsf.asp?webtag=kr-biloxkatrina [ Scraped to SunHerald, at ~17:00-19:00, 6 Sep 2005 (EDT)]
- Big. 757 chunks, including ~50 meta-chunks of 100-400 posts each
http://208.137.136.144/phpbb2/index.php [clarionledger.com] [scraped to Clarion-Ledger at 12:00, 6 Sep (EDT) ]
http://boards.news.yahoo.com/boards/ - YahooKatrina --KCIvey 02:03, 6 Sep 2005 (EDT)
http://www.wdsu.com/news/4918877/detail.html - WDSU --KCIvey 00:44, 6 Sep 2005 (EDT)
http://picayunekatrina.blogspot.com/ check blog comments - PearlRiver broken up into reasonable chunks --KCIvey 22:46, 5 Sep 2005 (EDT)
- Was also listed above as http://www.blogger.com/comment.g?blogID=16027876&postID=112544983260581601 (you probably don't want to follow that link -- it's a 7 MB page that may take your browser some time to render)
http://directv.com/see/pdf/DirecTV_Crawl_219-262_log.pdf (source link is at http://directv.com/see/landing/reconnect.html in case PDF link goes stale) Jaysauls 10:20, 5 Sep 2005 (EDT) - DirecTV now HTMLified and broken up --KCIvey 21:22, 5 Sep 2005 (EDT)
http://www.websitetoolbox.com/tool/mb/katrinacom?forum=29374 [Scraped to WebSiteToolbox at 12:16, 5 Sep (EDT) ]
http://www.hurricanekatrinasurvivors.com/home/ [scraped to HurricaneKatrinaSurvivors at 11:58, 5 Sep (EDT) ]
http://callhome.textamerica.com/ same as http://missingkatrina.com/ [with photos] [Scraped to MissingKatrina at 11:09, 5 Sep]
http://connect.castpost.com/listing.php?l=a [scraped to CastPost at 23:58 4 Sept]
http://eyesonkatrina.blogspot.com/2005/08/im-ok-line.html [scraped to EyesOnKatrina at 23:29 4 Sept ]
http://findkatrina.com/profiles/ [parsed to Find Katrina at ~22:00 4 Sept ]
http://pub11.bravenet.com/forum/875918109 [scraped to Bravenet at 01:28, 5 Sep ] ]
http://www.blogherald.com/2005/08/31/hurricane-katrina-missing-persons-appeal/ [scraped to The Blog Herald at 00:15, 5 Sep ]
http://www.hurricanerefugee.com/
http://www.katrinasurvivor.net/find.cfm?PageNum_GetAll=1&sort=name [Scraped at 02:00, 5 Sep KatrinaSurvivor ]
http://www.nowpublic.com/node/17228?from=0&comments_per_page=1000 [Scraped at 00:28, 5 Sep NowPublic ]
http://messageboards.weather.com/3/OpenTopic?a=frm&s=253291764&f=457105398 [Scraped 4 Sep NickB ]
http://www.wwltv.com/forums/viewforum.php?f=15 [Scraped 4 Sep NickB ]
http://www.cnn.com/SPECIALS/2005/hurricanes/list/ [ 01:37, 5 Sep CNN ]
- Linked directly to their pages for this one, since they're about 20 each, and CNN doesn't have single entries. This might cause problems if new records are added to their list, however.
http://www.gwid.com/katrina.php
- This is machine-parsable from two pages, and should be added directly:
- Missing: http://www.gwid.com/missinglist.php
- Found: http://www.gwid.com/foundlist2.php
- Right - but currently we've got a lot of vols and Matt did a succesful scrape, so we're going to bring it live. EthanZ 09:28, 5 Sep 2005 (EDT)
- Brought it up, thanks to Matt's code EthanZ 10:36, 5 Sep 2005 (EDT)
http://nokr.org/phpBB/index.php done EthanZ 10:36, 5 Sep 2005 (EDT)
http://msnbc.msn.com/apps/connect/search.aspx?n=A&i=1/
http://www.faunaclassifieds.com/forums/forumdisplay.php?f=468 [ Ignored, this looks mostly to be people looking for random people in the reptile industry/hobby. There are no recent posts, or Katrina info. Skoczen 18:02, 6 Sep 2005 (EDT) ]
Tools and How to Index
Some of these sites will be extremely easy to index using the chunker.pl (http://www.ethanzuckerman.com/chunker.pl) program. If you're inclined to use chunker and build wiki pages for these sources, please, please do and make the changes to this wiki page to acknowledge what you've done. Others - ones without sequential post IDs - will require more work.
Of the databases we haven't entered, GulfCoastNews is probably the biggest target. That said, it's probably a problem for the structured data guys, not for us. If you feel like making assignment pages for any of these, please do - model them on the existing pages...
EthanZ 20:20, 4 Sep 2005 (EDT)
For non-sequential sites, you might be able to use the [scraper.php (http://www.quantumimagery.com/downloads/scraper.php.txt)] program. If all of the links on a site can be hit with a string search for pre and post text, it will work. Output and config is very similar to chunker.pl. Since this is a scraper, pulling directly from the site, it can be re-run to grab new links. Scraper can handle sites with sequences of pages, in addition to full lists. Skoczen 02:23, 5 Sep 2005 (EDT)

