Thursday, August 24, 2017

Anatomy of an if function

An if function is pretty simple - you write a test condition that will produce a true or false value when evaluated. If true, it will execute one set of functions that you designate, if false, another.

Or, in pseudocode:
if (<test condition>,
<execute these functions if true>,
<execute these functions if false>)

An if function is great if you want to only edit certain cells in a column, but not others. The hardest part, though, is writing the test condition. Basically, all you're really doing is trying to stack functions together to make GREL spit out a boolean. Here are a few hints to help write the test conditions.



Hint 1 - Get comfortable with value.match() and value.contains()
The first thing to do when you're building your test condition is try to say in English what data you only want to work on. If you can verbalize some sort of pattern, you can turn that into a test condition.

Most string test conditions can be covered by using value.contains() or a combination of  isNull(), isNotNull(), and value.match(). I would advise at the start to just sticking to these combinations until you get more comfortable with GREL.

I wrote in detail about how value.match with capturing groups and turning it into a single boolean here.


Hint 2 - If you're looking to edit on multiple patterns, you can wrap the and() and or() functions around the test conditions.

Say you only want to edit if the cell data contains <string1> or <string 2>. What you can do to turn that into a boolean is to insert it into an or() statement: or(value.contains("<string1>"), value.contains("<string2>"))

Or statements can take more than two arguments:
or(value.contains("<string1>"), value.contains("<string2>"), value.contains("<string3>"), value.contains("<string4>"))

(tests to see if the cell data contains either <string1>, <string2>, <string3>, or <string4>.)

If you only want to edit the cell data if it contains both <string1> and <string2>, then you can write an and test condition:
and(value.contains("<string1>"), value.contains("<string2>"))

Like or, and can take any number of string matches.

You could also drop value.match() into and() and or(). Just make sure to wrap isNull or isNotNull around it, first.

So, for example, say I had a series of  MARC records in Open Refine, which are for online resources. I only want to prepend "1 online resource" to the 300 $a. However, a number of records are incomplete and some of them start with subfield $c. Some of the records also already have "1 online resource", and I want to not prepend on those.

First thing is to facet the 300 fields. My criteria for the test conditions are:
1) must start with subfield $a
and
2) must not already start with 1 online resource.

So, I just write the test conditions one by one:
1) isNotNull(value.match(/(^\$a.*)/))
and
2) must not already start with "1 online resource".

Again, one more value.match:
1) isNotNull(value.match(/(^\$a.*)/))
and
2) isNull(value.match(/(^1 online resource.*)/))

Now, put it in the and() function:
and(isNotNull(value.match(/(^\$a.*)/)), isNull(value.match(/(^1 online resource.*)/)))


Hint 3 - Any valid test condition can also be used as a facet
Or, conversely, any true/false custom facet can also be used as a test condition. If you use your test condition as a custom facet and it doesn't generate true/false facets, you need to rewrite your condition.

Hint 4 - Always test each layer of your test condition
Before you start using isNull() or isNotNull(), take advantage of the fact that Open Refine previews your GREL. Preview your value.match() function and make sure that the elements in the array correspond to the capturing groups you've written.

Then, when you wrap isNull or isNotNull around your value.match(), make sure that the preview screen shows either "true" or "false".

Hints for the "do something" phase

Hint 1 - If you only want to edit data for one condition, use "value" if you want to preserve the other data.

Computers are very literal. If you tell it to edit only if your test condition is true, you need to tell it what to do with the data if your test condition is false.

If you want to just leave other data as is, your if statement would be:
if(<test condition>, <edit if true>, value)

If you only want to edit if your test condition is false, but otherwise want to preserve the data if your test condition is true, you would write:
if(<test condition>, value, <edit if false>)

Hint 2 - If your test condition gave you true/false values and yet your if statement still isn't working, test everything bit of your <do something> step by step.
Sometimes the if statements can get kind of complex, especially if you're using match() with capturing groups and the values from other columns. It's too easy to forget a parentheses or an array index. Or it might even be something as simple as a typo.

Even if you have something as simple as a concatenation, it's worth going step by step, because you might have overlooked something.

So, what I will often do is drop the test condition into an if statement, and just have placeholders for the <do something if true> and <do something if false> portions.

Example: A side effect of importing Marcedit files is that the first two characters of any field without a Marc tag (005, 008) wind up in the indicator field. So, I wrote an if statement to take care of it. Basically, I looked for any contents fields beginning with a subfield indicator. If there was one, that field didn't need to be edited. If there was no subfield at the beginning, concatenate whatever was in the indicator column (named Column 3, because I forgot to give it a more meaningful name) with the contents (Column 4) field.

However, as you can see, it didn't work.













I had checked my test condition out as a facet before I dropped it into the if, so I knew that it worked. Everything  looked good, so I needed to go step by step and see what I did wrong.

The first thing to do when debugging is in the <do something if true> and <do something if false> areas, just put a simple placeholder string to output. That will tell you if you have the right number of parentheses in your if statement.

Example of a placeholder string:













 Both of my placeholders outputted to the preview pane, so I knew that I had the right number of parentheses. Next was to check the first part of my concatenation:













I thought for sure that I had written that piece of the concatenation command correctly, so it was useful to see that I had it wrong. I looked at it further and realized that it was cells["Column 3"].value, not cells.["Column 3"].value

So, a quick correction, and the preview pane verifies that I have it correct:













Now on to the second part of my concatenation statement:














As you can see, I have the second part of my concatenation statement wrong. Just to help with my debugging, I cut and pasted the first part of my concatenation to notepad, and decided to isolate out the second part of the concatenation:














I realized at that point that I forgot the array index, but more importantly, I was over complicating things. If I wanted to concatenate all of what was in the contents column, why not just use plain old value?

So, dropping that in:












 And then re-insert the first part of the concatenation:














And then re-insert the statement to do nothing if there's a subfield present:

The if statement is now complete and correct.

Hint 3- If something weird happened to your data, it may not be your if statement. Try faceting away the data and just doing the bare operation. There may be an issue with the data itself that you're not aware of.

After I did the edit above, I noticed this oddity:




On the surface, it looks like the if statement is responsible. However, I starred a couple of rows with the bad transform, and then tried the edit again without an if. It did the same thing.

So, I took a look at the data again:







After trying to trim for excessive whitespace and other manipulations, I took a look at the preview again. That's when I realized that the data was a number (which is why it was green) instead of a string.

Hint 4- Start out doing very simple transforms when you write your first if statement 
You need to walk before you can run. If you can, try writing some simple concatenations using an if. Or replaces.







1 comment:

scottdave said...

I did not realize how useful that OpenRefine can be. Thanks for sharing these tips.