Friday, July 21, 2017

How to import and export Marcedit files into Open Refine

 Updated 8/22/17 - I noticed older versions of Refine imported the 005  and the 008 as a number instead of string. (You can tell that it's imported as a number if it's in green) There's a tweak for that, too.

Updated 8/14/17 : to take care of the non-Roman characters issue and added a couple of if statements if you use my column transform.

I was looking for Open Refine articles and I found a super-informative post on how to import and export Marcedit files into Open Refine.



Import and adjust Marcedit format:
http://www.meanboyfriend.com/overdue_ideas/2015/07/worked-example-fixing-marc-data-4/

Exporting:
http://www.meanboyfriend.com/overdue_ideas/2015/07/worked-example-fixing-marc-data-5/

However, I have a few tweaks - the initial import screenshot took me a while to figure out, so I took a new one. In order to import a Marcedit file, you would do the usual create project, but right before you hit the "Create Project" button, you need to make a couple of adjustments.

First, under "parse data", select the "fixed width field text files" line. You will then set the column widths yourself. I've found what works for Marcedit files is to set the column widths to 4,2,2, and the remaining width of the file (I extended the line for my screenshot, but Rachel found that it's better to use .)
(click to get a closer view)










This still has the problem of some of the data in the LDR and the 008 winding up in the Indicators column, but you can fix by faceting them out, concatenating the Indicators column with the Contents column, and then blanking down the Indicators column.

Or, if you can just transform and use two if statements:
On the Contents column, transform with this statement to append the stuff that accidentally wound up in the Indicators column :if(isNull(value.match(/(^\$.*)/)),cells.Indicators.value+value,value)

Then transform on the Indicators column to clean out the values that don't belong there:
if(isNull(match(cells.Contents.value, /(^\$.*)/)),"",value)

Tweak number two - if you have non-Roman characters, make sure to select "UTF-8" for the character encoding. Just click on the "character encoding" box and select. Unfortunately, this is only an option under Open Refine and not LODrefine.

Tweak three - Older versions of Refine are imported the 005 and 008 fields as numbers (you can tell it's a number because it's green) You can either upgrade to the latest version of Refine, 2.7, or just do this transform:  toString(floor(value))

The fourth tweak is with the export - if you haven't removed the "=" before each tag, use this export template:
{{if(isBlank(cells["Tag"].value),"",cells["Tag"].value+"  ")}}{{if(isBlank(cells["Indicators"].value),"",cells["Indicators"].value)}}{{if(isBlank(cells["Content"].value),"",cells["Content"].value)}}

1 comment:

Rachel Jaffe said...

In testing this out, I made a couple modifications to how Lisa parsed her MARC data in her example. Using the "Parse data as ... Fixed-width field text files" form:

1) I set my column widths to 1,3,1,2,. [1 (for the pesky =), 3 (MARC tag), 1 (pesky space), 2 (indicator values), . (everything else)] My test set of records contained some rights statements that blew past the 68 character limit Lisa defined in her example; using "." my rights statements can be paragraphs long! I also wanted to keep my pesky non-data (equal sign and spaces) so that I could avoid having to potentially add them back in later.

2) I went ahead and added my column names Equal,Tag,Space,Indicators,Content.

3) I un-ticked the "Parse cell text into numbers, dates, ..." because when I left it ticked, OpenRefine converted my MARC tags, e.g. 001 to 1.

Update preview, create project, done :)