Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 3 Next »

Converting a word processed or pdf coding key into a set of tDAR CSV coding sheets

The goal is to get from the text of a formatted document that describes a coding system into tDAR coding sheets. In a coding sheet each line has a code followed by a comma, followed by is value, followed optionally by a description (e.g. for how this value is distinguished). For example,

1,Male
2,Female
9,Indeterminate

Is a perfectly good coding sheet consisting for three line for the column SEX.

M,Male
F,Female
?,Indeterminate

The codes, the first entry on each line of a coding sheet, must either be integers or strings. Leading zeros on integer codes are ignored, so 001 and 01 and 1 are interpreted the same (and appear as 1). Codes can also be standardized string values including strings of numbers that include decimal points.  Thus, Flr and Fil could be context codes and 01.1, 1.1, and 1.2.1 could be also all be codes that would be expressed as strings.  However, string codes must match exactly.  Thus, with a string code flr and Flr are different and 01.1 and 1.1 are different. The values and optional descriptions can have embedded spaces and can include most special characters (this is not true for ontologies, but never mind that for now). The values and optional descriptions cannot have double quotes within them (if you must they must appear as two quotes in a row"") and if they include a comma then the whole value needs to be enclosed in double quotes. The order of the lines (and hence the values) within the coding sheet doesn't matter. The codes need not be in numerical or alphabetical order.

Coding Sheets in tDAR

The set of lines that decodes all of the values for a given column of your spreadsheet or database represents a "Coding Sheet for that column. Each column that is a coded integer, real, or string needs a separate coding sheet that you will need to create in tDAR (these coding sheets would usually reside within the same project that contains your database or spreadsheet). When you create a coding sheet in tDAR, you ned to give it a title.  Use something that identifies your project and the variable being coded as when you need it you will only see the title.  For example:

Heshotauthla Fauna Coding Sheet - Taxon
Heshotauthla Ceramic Coding Sheet - Paste

There are two ways to get the coding sheet content into tDAR's coding sheet resource. As you create the coding sheet resources in tDAR,the Submit As box allows you to choose whether to upload a file in csv format from your computer (as described above), or cut and paste the lines that represent the coding sheet into a text box on the coding sheet entry. Depending upon whether you choose to upload a file or enter the information in a text box, tDAR will allow you to browse to locate the file or will give you the text box to type (or cut and paste) into. Unless it is a very simple coding sheet you will want to maintain it on your computer rather than just type it in, in case you need to change it later.

If you have a coding key already typed up, you will want to convert that into csv format to avoid tedious retyping.

Coding Key in a Spreadsheet

If your coding key sheet was created in Excel or another spreadsheet, you need to arrange it so that the codes are in the first column (A), their values in the second (B), and more elaborate descriptions are optionally in the third column. There should be nothing to the right of the third column. If you have multiple coding sheets in one spreadsheet you can select the rows that correspond to your coding sheet and say "Save As" and select "CSV" format and name and save the file in an appropriate folder on your computer. If there is just one coding sheet in the spreadsheet, be sure to eliminate any blank lines, titles or text other than the coding sheet information and Save As in CSV format without selecting rows. (When you Save As into csv in Excel you will get a message about losing features, just say OK. )

If your coding key is an Adobe Acrobat (pdf) or word processed document, conversion is a bit more work but for a long coding key it will probably be worth it. You can't save a either a word processed (e.g. Word) or pdf file directly to CSV.

Coding Key in a Word Processor Document

What you want to do with a word processed file is to open the document and use the word processor's search and replace function to help edit the document into a series of sets of lines each of which represents a coding sheet in csv format. For example, you will want to convert tabs to spaces, convert multiple spaces to single spaces, eliminate leading spaces, and eliminate text within the lines of a specific coding key that should not be part of the key. Note that Word and most word processors have means to replace formatting characters such as tabs (in Word 2007, at the bottom of the find and replace box click on "More" and then with the Find tab open click on Format ). You then use Save As to save the modified coding key document in a text format. When you Save AS you must first selected Plain text (.txt) in the dropdown box of formats in Word (or perhaps ASCII format in other word processors). If Word asks, "Windows Default" is OK.

Next you will want to open the plain file in a text editor and fix any problems. Things may show up in the text editor that you don't see in the word processor's view. For Windows machines the supplied text editor is Notepad which is found in the Accessories folder of All Programs. (There are a other much better ones free on the web. One that I have used is http://download.jgsoft.com/editpad/SetupEditPadLite.exe.) Once the document is in the proper shape, be sure to save it (as a txt file is fine).

If your document includes multiple coding sheets (i.e. for multiple columns) then it is easiest to cut and paste each subset into the appropriate tDAR coding sheet. Alternately you can save each set of lines as a separate file (which will probably require deleting all the lines above and below and using Save As, saving it (with a new name so you don't lose your work) with a csv extension, such as "species codes.csv". In this case you use the upload rather than text box option in tDAR.

To convert a pdf document (that has been character recognized), the process is similar. You will want to select all the text in the document, then open a text editor such as Notepad (see above) and paste the text into it. You can then use search and replace and edit the document into the proper format as you would have done, above, with your word processor.

Once You have Submitted the Coding Sheet

Be sure to look at the result. If you had a comma inside a value, you will see that the comma is gone and the text after the comma now appears as the description. You will need to go back and enclose the value in double quotes or eliminate the comma.

Note for more Geeky Types

If you have a lot of consistently lines that need to be converted, you may want to find a fancier text editor that deals with regular expressions that will allow MUCH more complex search and replace operations. I use TextPad ($16.50) that I like a lot http://www.textpad.com. For example, you can do a single search and replace operation that would convert lines of the form:

Bunny (sylvilagus)      010
Jackrabbit (lepus)       020

To a proper coding sheet form:

10,sylvilagus sp. (bunny)
20,lepus sp. (jackrabbit)

However, there is a learning curve here. 

  • No labels