Tip
-
Topic
-
How to share a dataset
LockedI have built some genealogy websites to help genealogists locate specific facts when building and documenting their family trees. Some of the data come from public record requests, while other data come from website downloads.
One frustration I encounter is when developers take a perfectly good dataset and make it “available” on their website in the form of a .pdf, which requires OCR analysis to extract the data into a form usable in developing my site. Now, a .pdf is perfectly good for humans to read, (while a .txt file is simpler because it does not require a special plug-in to open it.) but to share it with those who would help others find what they need, the best format is a tab-delimited .csv file. (Using a comma to separate and using quotation marks around fields with embedded commas is more complicated when you have embedded quotation marks and commas. Really, a tab-delimited file is the easiest.)
Let’s say you run a cemetery and you put your burial list on the website. I find it somewhat frustrating when they only provide a query-based search box. Unless the data are proprietary, you should also provide a link to download the whole dataset as a .csv file and another link to provide a file easier for humans to read. If your burial list is not public, beware that I have learned a few tricks for downloading the whole file anyway. Then I can add the contents to a website where search engines will slurp it down and where end-users can learn where their great-uncle was buried. This is the benefit to the cemetery because the end-user visits the cemetery website and perhaps even visits the cemetery. Search engines will not enter query terms to acquire data that are trapped behind a search box. You worry about running your cemetery and its website, and let other webmasters download the data for bigger compilations in which relatives can find it more quickly.
This is true for websites about anything else. Don’t just provide a query-based search box, but provide a link because somebody might want the whole dataset, in human-friendly form and in pipe-delimited .csv form. If it is proprietary, make sure it is secure. Some search boxes will give the whole dataset when the end-user simply hits SEARCH without entering any search terms. Others require more work. Some of them I cannot download with my limited knowledge but others can.