Thursday, July 27, 2017

How to do a Marcedit Find/Replace with capturing groups in Open Refine

One of the things that's really frustrated me with Open Refine is that there didn't seem to be a way to mimic the capturing group rearranging from the Find/Replace function in Marcedit.

For example, in this import of Marcedit records, some of the subfield a's have an indicator in front of them:
Click on image to enlarge









In Marcedit, I'd normally do this:
click on image to enlarge
















But I couldn't figure out how to do the equivalent in Open Refine. value.match seemed like a contender, but I couldn't figure out how to access the array elements. I finally figured it out today -- 

 
the array is not stored in a variable that you need to name to access, instead, you directly have to use the value.match expression and an index to get a hold of the elements. (you'll see down below what I'm talking about.)

Step 1 - Facet your data first, if you can. If you can't, use an if statement (which I'll cover in another post)

Step 2 -is pretty familiar if you're used to Marcedit - do a transform on the "Contents" column and use the same regular expression you would use in Marcedit for value.match. Note, the syntax is value.match(/<reg ex>/) not value.match(<reg ex>)

Click to enlarge
























Now, if you see lines 23-24, you'll notice that instead of capturing groups, the data has been pushed into an array. This is the equivalent of the Marcedit capturing groups. The difference between Marcedit and Open Refine is that instead of 1, 2, and 3 for the group labels, you will use an array index of one less - indexes 0, 1, 2. And instead of using $ to denote the group, you will use value.match(/<capturing group reg ex>/)

Or:
$1 = value.match(/<capturing group reg ex>/)[0]
$2 = value.match(/<capturing group reg ex>/)[1]
$3 = value.match(/<capturing group reg ex>/)[2] 

Step 3: Now that you have a way of accessing the array elements, all you have to do to leave out the first group is to do a concatenate in Open Refine: value.match(/<capturing group reg ex>/)[1] +
value.match(/<capturing group reg ex>/)[2]


Click to enlarge
















Now that there's a way to mimic the capturing groups functionality of Marcedit, that means you can do the same type of rearranging. So, say that in addition to removing the indicator, I wanted the $b in lines 22-24 to be moved to the end, I would first write my regex and check to see that the groups are divided up correctly:

click to enlarge














Now that I know that everything is grouped correctly, it's just a matter of putting together the array elements in the order I want: 

click to enlarge














And if you wanted to add some text between subfields, that's easy enough to do, too.

Click to enlarge












 

No comments: