Monday, July 31, 2017

How to calculate the number of words in a sentence and how to use that in facets/the if function

I had a set of data where I needed to only process sentences that were more than a certain number of words. Fortunately, doing so is relatively easy.

First of all, the words in a sentence can be split up into an array using value.split(" ").

Secondly, the length() function in Open Refine returns the length, or total number of elements in an array. Since each element is a word, put these two together and the number of words will be calculated:

Using splitByCharType to extract years of coverage in an 856 field

 GREL's splitByCharType() is actually quite a useful function. It takes a string, splits it up, and then groups consecutive like characters together in an array. Some of the more common like characters are all lowercase letters, all uppercase letters, all numbers, and all spaces. Special characters and punctuation are only grouped together with the exact same character (e.g. $$ would be grouped together, but $! would not.)

For example, for the string: $a300 pages ;$c20 cm, it'll fill the array this way:
Index                Element
0                        $
1                         a
2                        300
3                        <space>
4                        pages
5                         <space>
6                         ;
7                        $
8                         c
9                        20
10                     <space>
11                      cm

Friday, July 28, 2017

Custom faceting using booleans



So, here's some sample data:









For the editing I want to do, I would like only the lines with v.<number>(<year>)-v<number>(<year>)

Since all of the notations vary, I tried the custom facet:
value.startsWith("v")

The problem is that lines containing data like this weren't excluded:
v.34(2008)-;v.1(1974)-v.26(2000)

Using the Marcedit capturing group Find/Replace in Open Refine, pt.2 : putting it into an if function.

This is a continuation from this post about how to mimic Marcedit capturing group Find/Replace in Open Refine.

The one problem with mimicking the Marcedit Find/Replace is that it will wipe out the data you don't want to alter unless you facet it out first. This is kind of cumbersome and still doesn't have the convenience of Marcedit's Find/Replace.

However, the if function takes care of that problem. I'm going to post the recipe here at the top and then go into detail further down. So, for your capturing group find of value.match(/<capturing group regex>/), do a transform cells, and drop it into this statement:

if(isNotNull(value.match(/<your find>/)), <your replace>, value)

Unpacking value.match with reg ex and how to use it in boolean statements

The value.match documentation on the Open Refine github wiki is a bit daunting, so I'm going to try to break it down in this post.
Some programming speak first: GREL functions usually yield some sort of data after they're done doing their thing. That data is called a return value. So, that's what I'm referring to whenever I make statements like, "match returns <whatever>".
Anyway, the biggest hurdle to my understanding was figuring out how match and regular expressions worked together. Here's what I've determined:
Unlike Marcedit, you have to account for the whole line if you're using regex in match.  

Thursday, July 27, 2017

How to do a Marcedit Find/Replace with capturing groups in Open Refine

One of the things that's really frustrated me with Open Refine is that there didn't seem to be a way to mimic the capturing group rearranging from the Find/Replace function in Marcedit.

For example, in this import of Marcedit records, some of the subfield a's have an indicator in front of them:
Click on image to enlarge









In Marcedit, I'd normally do this:
click on image to enlarge
















But I couldn't figure out how to do the equivalent in Open Refine. value.match seemed like a contender, but I couldn't figure out how to access the array elements. I finally figured it out today -- 

Comments are now enabled and a label widget has been added.

I didn't realize that comments were disabled for non-Google users. I've changed that, so comment away if you need clarification on my posts or if you would like me to cover something.

I also didn't realize that a label widget wasn't displaying. So I fixed that.

Wednesday, July 26, 2017

Intro to Open Refine

In the spirit of keeping everything all in one place, here's a pointer to a workshop about getting started with Open Refine:
https://www.youtube.com/watch?v=wGVtycv3SS0

Friday, July 21, 2017

How to import and export Marcedit files into Open Refine

 Updated 8/22/17 - I noticed older versions of Refine imported the 005  and the 008 as a number instead of string. (You can tell that it's imported as a number if it's in green) There's a tweak for that, too.

Updated 8/14/17 : to take care of the non-Roman characters issue and added a couple of if statements if you use my column transform.

I was looking for Open Refine articles and I found a super-informative post on how to import and export Marcedit files into Open Refine.

How to remove duplicate rows

Here's a pointer to a page that gives detailed instructions on how to remove duplicate data: http://kb.refinepro.com/2011/08/remove-duplicate.html

Thursday, July 20, 2017

Objects, pt. 2



If you haven't read part 1 of this tutorial here, do so before you proceed further.

When you open a project in Open Refine, the entire spreadsheet is broken up into rows. Each cell in the row is mapped into a cell object. What comes next is my best guess because I couldn't find documentation -- I believe that the cell objects are initialized into an array and that becomes part of object CellTuple.

Or, in pseudo-code:

object CellTuple
         array list = {array of cell objects for the current row}

Tuesday, July 18, 2017

Using facets to debug search differences for large files

I was using Millennium's Create List search, but then I needed to tweak it a bit. The problem I had is that the file was rather huge (15K worth of records) and it was going to be a bear to figure out which records had fallen out of the old search and which new records had been picked up.

Usually, I would dump both lists of record numbers into Excel, highlight the duplicate values, and scroll through, but that wasn't feasible for quickly checking 15K records.

Instead, I was able to use Open Refine's faceting ability to help me filter through the data. I dumped the record numbers from the old search into one column in Excel (labeled "Column 1"), dumped the record numbers from the new search into a second column (labeled "Column 2"), saved, and then sent the sheet over to Open Refine.

I hit the drop down menu on any column and chose Facet->Custom Text Facet.

Then I entered this GREL:

cells["Column 1"].value == cells["Column 2"].value


Then I clicked on "false" in the facet to see what wasn't matching. Note that you have to use the object cells and not cell. I should also note that this expression isn't too useful if the two columns are wildly out of sync. It's mostly useful for needle in a haystack situations.

Cleaning out old Open Refine projects

This page has a nice guide for cleaning out old Open Refine projects: http://kb.refinepro.com/2012/01/delete-multiple-project-at-once.html

forRange cheat sheet

Click to enlarge images:


If structure cheat sheet


Updated 8/24/17

If structure cheat sheet
<operation> can be any function – reprinting a cell value, concatenating cells values, any string processing function listed in the wiki, and even a foreach, with, or another if function.


Do <operation> only if test condition is false
if (test condition, value, <operation>)
Do <operation> only if test condition is true
if (test condition, <operation>, value)
Do <operation1> if test condition is true, otherwise do <operation 2>
if (test condition, <operation 1>, <operation 2>)
Do a Marcedit Find/Replace on string1 string2 string3 to transform it into string2; string3; string1
if (isNotNull(value.match(/(regex for string1)(regex for string2)(regex for string3)/)), value.match(/(regex for string1)(regex for string2)(regex for string3)/)[1] + “;” + value.match(/(regex for string1)(regex for string2)(regex for string3)/)[2] + “;” + value.match(/(regex for string1)(regex for string2)(regex for string3)/)[0], value)

Test Condition Cheat Sheet

Example data to be used is
auto
autos
auto mechanics
automechanics
auto mechanics and aviation
planes
trains
automobiles
aviation
aviation and automechanics

NOTE : Test conditions can also be used for custom text faceting

Test Condition
Code
String must exactly match auto
value == “auto”
String can be auto, autos, auto mechanics, automechanics, or auto mechanics and aviation
value.contains("auto")
String can be auto mechanics, automechanics, auto  mechanics, or auto mechanics and aviation.
isNotNull(value.match(/(.*auto.*mechanics.*)/))

String can be auto, autos, planes, trains, or automobiles.
isNull(value.match(/(.*auto.*mechanics.*)/))
String must match auto, autos, or planes. or(isNotNull(value.match(/(^auto[s]/))), value == "planes")
String must start with some variation of auto.
value.startsWith("auto")
Find all of the variations of auto mechanics and aviation
and(isNotNull(value.match(/(.*auto\s*mechanics.*)/)), value.contains("aviation"))
All strings that begin with "a"
value.startsWith("a")
All strings that end with "s"
value.endsWith("s")
String must be less than 4 words
length (split(value, “ “)) < 4
String must contain any substring with variable spacing of “auto mechanics” and must be less than 4 words
and(isNotNull(value.match(/.*auto.*mechanics.*/)), length(split(value, " ")) < 4)
Cell is completely blank
isBlank(value)
Cell is not blank
isNonBlank(value)

Test conditions for dates
Test string is a single range of years, such as 1938-1968. No spaces surrounding the hyphen.

First year must be greater than 1938
toNumber(substring(value, 0,4)) > 1938
First year must equal 1938
toNumber(substring(value, 0,4)) == 1938
First year must be less than 1938
toNumber(substring(value, 0,4)) < 1938
Second year must be less than 1968
toNumber(substring(value,5,9)) < 1968
Second year must be equal 1968
toNumber(substring(value,5,9)) == 1968
Second year must be less greater than 1968
toNumber(substring(value,5,9)) > 1968
Second year is greater than the first
toNumber(substring(value,5,9)) > toNumber(substring(value,0,4))