Friday, July 28, 2017

Custom faceting using booleans



So, here's some sample data:









For the editing I want to do, I would like only the lines with v.<number>(<year>)-v<number>(<year>)

Since all of the notations vary, I tried the custom facet:
value.startsWith("v")

The problem is that lines containing data like this weren't excluded:
v.34(2008)-;v.1(1974)-v.26(2000)

NOTE: This post covers faceting for Open Refine 2.8 and below. Its changed in Open Refine 3.1 (and possibly 3.0), so if you have those versions, be sure to read the update at the bottom.

I could technically do a second custom facet with value.contains(";"), and then work on whatever facets for false, but fortunately, you can work with booleans in the custom facets.

In order to write my boolean, it's easiest if I write down my criteria two criteria in English first:
lines that start with v
and
lines that don't have a semicolon

We've already done the first part:
value.startsWith("v")
and
lines that don't have a semicolon

Finding lines that do contain a semi-colon is just a simple value.contains(";"). Fortunately, to do the opposite, all I need to do is wrap GREL's not() function around my value.contains().

In full, it's : not(value.contains(";") )

Just to be sure I have it right, I tested my not function against my data:

 









It's working - the lines without the semicolon are faceting on true. I recommend testing if you're ever unsure about whether or not anything you're writing will work.

Now that I know that it works, I drop the not into my criteria:

value.startsWith("v")
and
 not(value.contains(";")

To put it all together, I can use GREL's and () function. Instead of the classic and statement format, which is :
<boolean function 1>
and
<boolean function 2>
(The whole statement is evaluated and returns true only if all are true)

GREL's format is: and(<boolean function 1>, <boolean function 2>).

So, after dropping in my boolean functions from my criteria, my custom facet is:
and(value.startsWith("v"), not(value.contains(";")))












And now, when I click on the "true" facet, all I have are the lines with the
v.<number>(<year>)-v<number>(<year>) format.

By the way, GREL's and is not limited to two boolean functions*. You can put in as many functions as you want, as long as you separate each with a comma.

GREL also has an or() function. It's the same format as and(),
 i.e. or(<boolean function 1>, <boolean function 2>) 
but it returns true if one or more of your boolean functions are true. This also can be dropped into the custom text facet.

And yes, you can drop in isNotNull(value.match()) and isNull(value.match()) into the or and and functions.

I also discovered that you can use if statements in custom text facets, but that'll be covered in a future post.

*Note: Your mileage may vary depending on which version of Refine you're using. For me, I've found that Lodrefine and Open Refine 2.7-2.8 allows you to chain as many booleans in or and and functions, but my version of Open Refine (2.6.1) limits you to two.

However, for transforms and add columns, you are not limited.

I can get around some of the limitations for the or  using value.match() and regex. For example, if you had this or custom facet:
or(value.contains("v."), value.contains("no."), value.contains("series"))

If you get an error such as "expecting two booleans", you can rewrite it as:
isNotNull(value.match(/(.*v\..*|.*no\..*|.*series.*)/))

Update for Open Refine 3.1:
It appears that Open Refine 3.1 (and possibly 3.0) no longer facets true/false as a word. If you try the above and the facet comes out as <boolean> only, you need to convert the facet to a string using the toString() function.

So, this custom facet in 2.8:
or(value.contains("v."), value.contains("no."), value.contains("series"))

Needs to be this in 3.1 in order to work the same way:
toString(or(value.contains("v."), value.contains("no."), value.contains("series")))

Same for isNotNull:
toString(
isNotNull(value.match(/(.*v\..*|.*no\..*|.*series.*)/)))

No comments: