Open Refine for Librarians: July 2017

Monday, July 31, 2017

How to calculate the number of words in a sentence and how to use that in facets/the if function

I had a set of data where I needed to only process sentences that were more than a certain number of words. Fortunately, doing so is relatively easy.

First of all, the words in a sentence can be split up into an array using value.split(" ").

Secondly, the length() function in Open Refine returns the length, or total number of elements in an array. Since each element is a word, put these two together and the number of words will be calculated:

Using splitByCharType to extract years of coverage in an 856 field

GREL's splitByCharType() is actually quite a useful function. It takes a string, splits it up, and then groups consecutive like characters together in an array. Some of the more common like characters are all lowercase letters, all uppercase letters, all numbers, and all spaces. Special characters and punctuation are only grouped together with the exact same character (e.g. $$ would be grouped together, but $! would not.)

For example, for the string: $a300 pages ;$c20 cm, it'll fill the array this way:
Index                Element
0                        $
1                         a
2                        300
3                        <space>
4                        pages
5                         <space>
6                         ;
7                        $
8                         c
9                        20
10                     <space>
11                      cm

Custom faceting using booleans

So, here's some sample data:

For the editing I want to do, I would like only the lines with v.<number>(<year>)-v<number>(<year>)

Since all of the notations vary, I tried the custom facet:
value.startsWith("v")

The problem is that lines containing data like this weren't excluded:
v.34(2008)-;v.1(1974)-v.26(2000)

Using the Marcedit capturing group Find/Replace in Open Refine, pt.2 : putting it into an if function.

This is a continuation from this post about how to mimic Marcedit capturing group Find/Replace in Open Refine.

The one problem with mimicking the Marcedit Find/Replace is that it will wipe out the data you don't want to alter unless you facet it out first. This is kind of cumbersome and still doesn't have the convenience of Marcedit's Find/Replace.

However, the if function takes care of that problem. I'm going to post the recipe here at the top and then go into detail further down. So, for your capturing group find of value.match(/<capturing group regex>/), do a transform cells, and drop it into this statement:

if(isNotNull(value.match(/<your find>/)), <your replace>, value)

Unpacking value.match with reg ex and how to use it in boolean statements

The value.match documentation on the Open Refine github wiki is a bit daunting, so I'm going to try to break it down in this post.

Some programming speak first: GREL functions usually yield some sort of data after they're done doing their thing. That data is called a return value. So, that's what I'm referring to whenever I make statements like, "match returns <whatever>".

Anyway, the biggest hurdle to my understanding was figuring out how match and regular expressions worked together. Here's what I've determined:

Unlike Marcedit, you have to account for the whole line if you're using regex in match.

How to do a Marcedit Find/Replace with capturing groups in Open Refine

One of the things that's really frustrated me with Open Refine is that there didn't seem to be a way to mimic the capturing group rearranging from the Find/Replace function in Marcedit.

For example, in this import of Marcedit records, some of the subfield a's have an indicator in front of them:

Click on image to enlarge

In Marcedit, I'd normally do this:


click on image to enlarge

But I couldn't figure out how to do the equivalent in Open Refine. value.match seemed like a contender, but I couldn't figure out how to access the array elements. I finally figured it out today --

Comments are now enabled and a label widget has been added.

I didn't realize that comments were disabled for non-Google users. I've changed that, so comment away if you need clarification on my posts or if you would like me to cover something.

I also didn't realize that a label widget wasn't displaying. So I fixed that.

Wednesday, July 26, 2017

Intro to Open Refine

In the spirit of keeping everything all in one place, here's a pointer to a workshop about getting started with Open Refine:
https://www.youtube.com/watch?v=wGVtycv3SS0

Friday, July 21, 2017

How to import and export Marcedit files into Open Refine

Updated 8/22/17 - I noticed older versions of Refine imported the 005 and the 008 as a number instead of string. (You can tell that it's imported as a number if it's in green) There's a tweak for that, too.

Updated 8/14/17 : to take care of the non-Roman characters issue and added a couple of if statements if you use my column transform.

I was looking for Open Refine articles and I found a super-informative post on how to import and export Marcedit files into Open Refine.

How to remove duplicate rows

Here's a pointer to a page that gives detailed instructions on how to remove duplicate data: http://kb.refinepro.com/2011/08/remove-duplicate.html

Thursday, July 20, 2017

Objects, pt. 2

If you haven't read part 1 of this tutorial here, do so before you proceed further.

When you open a project in Open Refine, the entire spreadsheet is broken up into rows. Each cell in the row is mapped into a cell object. What comes next is my best guess because I couldn't find documentation -- I believe that the cell objects are initialized into an array and that becomes part of object CellTuple.

Or, in pseudo-code:

object CellTuple
array list = {array of cell objects for the current row}

Using facets to debug search differences for large files

I was using Millennium's Create List search, but then I needed to tweak it a bit. The problem I had is that the file was rather huge (15K worth of records) and it was going to be a bear to figure out which records had fallen out of the old search and which new records had been picked up.

Usually, I would dump both lists of record numbers into Excel, highlight the duplicate values, and scroll through, but that wasn't feasible for quickly checking 15K records.

Instead, I was able to use Open Refine's faceting ability to help me filter through the data. I dumped the record numbers from the old search into one column in Excel (labeled "Column 1"), dumped the record numbers from the new search into a second column (labeled "Column 2"), saved, and then sent the sheet over to Open Refine.

I hit the drop down menu on any column and chose Facet->Custom Text Facet.

Then I entered this GREL:

cells["Column 1"].value == cells["Column 2"].value

Then I clicked on "false" in the facet to see what wasn't matching. Note that you have to use the object cells and not cell. I should also note that this expression isn't too useful if the two columns are wildly out of sync. It's mostly useful for needle in a haystack situations.

Do <operation> only if test condition is false	if (test condition, value, <operation>)
Do <operation> only if test condition is true	if (test condition, <operation>, value)
Do <operation1> if test condition is true, otherwise do <operation 2>	if (test condition, <operation 1>, <operation 2>)
Do a Marcedit Find/Replace on string1 string2 string3 to transform it into string2; string3; string1	if (isNotNull(value.match(/(regex for string1)(regex for string2)(regex for string3)/)), value.match(/(regex for string1)(regex for string2)(regex for string3)/)[1] + “;” + value.match(/(regex for string1)(regex for string2)(regex for string3)/)[2] + “;” + value.match(/(regex for string1)(regex for string2)(regex for string3)/)[0], value)

Test Condition Cheat Sheet

Example data to be used is
auto
autos
auto mechanics
automechanics
auto mechanics and aviation
planes
trains
automobiles
aviation
aviation and automechanics

NOTE : Test conditions can also be used for custom text faceting

Test Condition	Code
String must exactly match auto	value == “auto”
String can be auto, autos, auto mechanics, automechanics, or auto mechanics and aviation	value.contains("auto")
String can be auto mechanics, automechanics, auto mechanics, or auto mechanics and aviation.	isNotNull(value.match(/(.auto.mechanics.*)/))
String can be auto, autos, planes, trains, or automobiles.	isNull(value.match(/(.auto.mechanics.*)/))
String must match auto, autos, or planes.	or(isNotNull(value.match(/(^auto[s]/))), value == "planes")
String must start with some variation of auto.	value.startsWith("auto")
Find all of the variations of auto mechanics and aviation	and(isNotNull(value.match(/(.auto\smechanics.*)/)), value.contains("aviation"))
All strings that begin with "a"	value.startsWith("a")
All strings that end with "s"	value.endsWith("s")
String must be less than 4 words	length (split(value, “ “)) < 4
String must contain any substring with variable spacing of “auto mechanics” and must be less than 4 words	and(isNotNull(value.match(/.auto.mechanics.*/)), length(split(value, " ")) < 4)
Cell is completely blank	isBlank(value)
Cell is not blank	isNonBlank(value)

Test conditions for dates

Test string is a single range of years, such as 1938-1968. No spaces surrounding the hyphen.

First year must be greater than 1938	toNumber(substring(value, 0,4)) > 1938
First year must equal 1938	toNumber(substring(value, 0,4)) == 1938
First year must be less than 1938	toNumber(substring(value, 0,4)) < 1938
Second year must be less than 1968	toNumber(substring(value,5,9)) < 1968
Second year must be equal 1968	toNumber(substring(value,5,9)) == 1968
Second year must be less greater than 1968	toNumber(substring(value,5,9)) > 1968
Second year is greater than the first	toNumber(substring(value,5,9)) > toNumber(substring(value,0,4))

Open Refine for Librarians

Monday, July 31, 2017

How to calculate the number of words in a sentence and how to use that in facets/the if function

Using splitByCharType to extract years of coverage in an 856 field

Friday, July 28, 2017

Custom faceting using booleans

Using the Marcedit capturing group Find/Replace in Open Refine, pt.2 : putting it into an if function.

Unpacking value.match with reg ex and how to use it in boolean statements

Thursday, July 27, 2017

How to do a Marcedit Find/Replace with capturing groups in Open Refine

Comments are now enabled and a label widget has been added.

Wednesday, July 26, 2017

Intro to Open Refine

Friday, July 21, 2017

How to import and export Marcedit files into Open Refine

How to remove duplicate rows

Thursday, July 20, 2017

Objects, pt. 2

Tuesday, July 18, 2017

Using facets to debug search differences for large files

I hit the drop down menu on any column and chose Facet->Custom Text Facet.

Then I entered this GREL:

cells["Column 1"].value == cells["Column 2"].value

Cleaning out old Open Refine projects

forRange cheat sheet

If structure cheat sheet

Favorite Sci-Fi Shows

Contributors

Labels

Blog Archive