Open Refine for Librarians: recipes

Showing posts with label recipes. Show all posts

Thursday, April 12, 2018

Recipe: Computing the Millennium checkdigit

More detail will be posted later. I just wanted to note this down.

We needed to extract our course reserves in Millennium into .csv format and the check digit for the item records associated with the instructors was cut off. To further complicate matters, one of our classes had about 106 associated items.

In order to calculate the check digit, you had to do the following calculation:

for item record i1234567, you had to multiply the 7 by 2, the 6 by 3, the 5 by 4, the 4 by 5, the 3 by 6, the 2 by 7, and the 1 by 8. Then, you had to add all those values together, divide by 11, and the remainder was the check digit. Anything that was remainder 10 became check digit x. So, for this item record, the summation is 112. Divide that by 11, and the remainder is 2 - so the full item record number is i12345672.

1. First, I changed the , delimiter right before the item record numbers to } in Notepad++. That way, when I imported to Open Refine, item records i3070797,i3054061,i3054062,i3054064,i3070931,,,,,,,,,,,,,

would all be in the same cell.

2. I split the column of item records into multi-valued cells using the , as a delimiter. My column was named "Column2" for this example.
3. I used add column based on this column and created column "remainder" with this GREL: value.replace ("i","")
This is so I could do the calculation for the check digit in a separate column to make it easier to concatenate later.
3. To do the multiplication and summation calculation, I did a transform on the remainder column and input this GREL:
sum(forEachIndex(reverse(splitByLengths(value,1,1,1,1,1,1,1)),i,v, (i+2)*toNumber(v)))
4. Unfortunately, this inserted a lot of zeros into the fields I wanted to leave blank, so I did another transform on remainder and did this to blank out the zeros: if(value == 0, "", value)
(You could probably just facet on the 0 and blank it out, too). This step has to be done, otherwise the calculation of the remainder will fill in the cells with the zeros with unwanted values.
5. Do another transform on column remainder to calculate the remainder:
mod(value,11)
6. Now concatenate Column2 and remainder together (can be either a transform or a new column):
if(isBlank(value), "",if (cells.remainder.value < 10, value + cells.remainder.value, value + "x"))
(Leaving out the check for isBlank gave me funky values again.)
7. Now remove any extra columns and join multi-valued cells on the concatenated column.

Friday, August 18, 2017

Recipe and walkthrough: counting up repeats of a word in a sentence

How do you count up the occurrences of <word> in a sentence? (or, in this case, find how many times "Nintendo" is in a 538?

sum(forEach(value.split(" "), temp, if(temp.contains("Nintendo"), 1, 0)))

Explanation:

Recipe: Text filtering with regular expressions

An alternative to writing a custom facet, especially if you want to filter based on a single word, is to use Open Refine's text filtering. It has regex capability, which is extremely useful.

For example, I have a file of videogame MARC records. The platforms under system requirements vary in verbiage and spelling. If I wanted to isolate out two platforms, such as all Xbox, Playstation, and Nintendo games, I could just use the text filtering function to find them instead of building a custom facet.

Caveats with the text filtering function - if you have a large file, it may hang. My file unfortunately came out to be about 69,000 rows, which Open Refine didn't like at all when I tried a text filter. I stopped the hanging by switching the view to 10 rows at a time and faceting on the 538 first.

So, to activate the text filter, all I have to do is pull down the menu on the Contents column and select "Text Filter"

A box will pop up in the left hand pane. All I have to do is enter my regular expression for finding all the Xbox and Playstation games: (.*[Nn]intendo.*|.*[Xx]box.*|.*[Pp]laystation.*), check the "regular expressions" box underneath where I entered my regex, and I'm good to go:

Friday, August 4, 2017

Two recipes for extracting the last n words in a string

Method 1: rpartition

Again, I'm going to post the recipe up here, and go into depth below. If you want the last <n> words in any sentence, you can do a transform with:
rpartition(value, /(\s\S+){<n>}$/)[1]

Be sure that all your data is more than <n> words. If you have sentences that are exactly n words, you'll get a null. You can either custom facet out that data with a :
length(value.split(" ")) ><n>

(what the length GREL does is described in this post.)

or use an if function:
if (length(value.split(" ")) ><n>, rpartition(value, /(\s\S+){<n>}$/)[1], value)

Method 2: Array arithmetic with split
Since you can access any element in split() by using the expression split()[<index>], you can use length(value.split(" ")) to calculate the indexes you need.

To get the last <n> words in any sentence, do a transform with:
value.split(" ")[length(value.split(" "))-<n>] + " " +
value.split(" ")[length(value.split(" "))-<n-1>] + " " +
value.split(" ")[length(value.split(" "))-<n-2>] + " " +
value.split(" ")[length(value.split(" "))-<n-3>] +
etc. until <n-whatever> = 1

ex. If I want the last 4 words in any sentence, my transform GREL would be:
value.split(" ")[length(value.split(" "))-4] + " " +
value.split(" ")[length(value.split(" "))-3] + " " +
value.split(" ")[length(value.split(" "))-2] + " " +
value.split(" ")[length(value.split(" "))-1]

Again, you will get unexpected results if your sentence is less than n words because negative indexes will wrap around. Either facet the short sentences away or use an if function:

if (length(value.split(" ")) ><n>, value.split(" ")[length(value.split(" "))-<n>] + " " + value.split(" ")[length(value.split(" "))-<n-1>] + " " + value.split(" ")[length(value.split(" "))-<n-2>] + " " + value.split(" ")[length(value.split(" "))-<n-3>] + <etc.>, value)

How to calculate the number of words in a sentence and how to use that in facets/the if function

I had a set of data where I needed to only process sentences that were more than a certain number of words. Fortunately, doing so is relatively easy.

First of all, the words in a sentence can be split up into an array using value.split(" ").

Secondly, the length() function in Open Refine returns the length, or total number of elements in an array. Since each element is a word, put these two together and the number of words will be calculated:

Using splitByCharType to extract years of coverage in an 856 field

GREL's splitByCharType() is actually quite a useful function. It takes a string, splits it up, and then groups consecutive like characters together in an array. Some of the more common like characters are all lowercase letters, all uppercase letters, all numbers, and all spaces. Special characters and punctuation are only grouped together with the exact same character (e.g. $$ would be grouped together, but $! would not.)

For example, for the string: $a300 pages ;$c20 cm, it'll fill the array this way:
Index                Element
0                        $
1                         a
2                        300
3                        <space>
4                        pages
5                         <space>
6                         ;
7                        $
8                         c
9                        20
10                     <space>
11                      cm

Using the Marcedit capturing group Find/Replace in Open Refine, pt.2 : putting it into an if function.

This is a continuation from this post about how to mimic Marcedit capturing group Find/Replace in Open Refine.

The one problem with mimicking the Marcedit Find/Replace is that it will wipe out the data you don't want to alter unless you facet it out first. This is kind of cumbersome and still doesn't have the convenience of Marcedit's Find/Replace.

However, the if function takes care of that problem. I'm going to post the recipe here at the top and then go into detail further down. So, for your capturing group find of value.match(/<capturing group regex>/), do a transform cells, and drop it into this statement:

if(isNotNull(value.match(/<your find>/)), <your replace>, value)

How to do a Marcedit Find/Replace with capturing groups in Open Refine

One of the things that's really frustrated me with Open Refine is that there didn't seem to be a way to mimic the capturing group rearranging from the Find/Replace function in Marcedit.

For example, in this import of Marcedit records, some of the subfield a's have an indicator in front of them:

Click on image to enlarge

In Marcedit, I'd normally do this:


click on image to enlarge

But I couldn't figure out how to do the equivalent in Open Refine. value.match seemed like a contender, but I couldn't figure out how to access the array elements. I finally figured it out today --

How to remove duplicate rows

Here's a pointer to a page that gives detailed instructions on how to remove duplicate data: http://kb.refinepro.com/2011/08/remove-duplicate.html

Tuesday, July 18, 2017

Using facets to debug search differences for large files

I was using Millennium's Create List search, but then I needed to tweak it a bit. The problem I had is that the file was rather huge (15K worth of records) and it was going to be a bear to figure out which records had fallen out of the old search and which new records had been picked up.

Usually, I would dump both lists of record numbers into Excel, highlight the duplicate values, and scroll through, but that wasn't feasible for quickly checking 15K records.

Instead, I was able to use Open Refine's faceting ability to help me filter through the data. I dumped the record numbers from the old search into one column in Excel (labeled "Column 1"), dumped the record numbers from the new search into a second column (labeled "Column 2"), saved, and then sent the sheet over to Open Refine.

I hit the drop down menu on any column and chose Facet->Custom Text Facet.

Then I entered this GREL:

cells["Column 1"].value == cells["Column 2"].value

Then I clicked on "false" in the facet to see what wasn't matching. Note that you have to use the object cells and not cell. I should also note that this expression isn't too useful if the two columns are wildly out of sync. It's mostly useful for needle in a haystack situations.

forRange cheat sheet

Click to enlarge images:

If structure cheat sheet

Updated 8/24/17

If structure cheat sheet

<operation> can be any function – reprinting a cell value, concatenating cells values, any string processing function listed in the wiki, and even a foreach, with, or another if function.

Do <operation> only if test condition is false	if (test condition, value, <operation>)
Do <operation> only if test condition is true	if (test condition, <operation>, value)
Do <operation1> if test condition is true, otherwise do <operation 2>	if (test condition, <operation 1>, <operation 2>)
Do a Marcedit Find/Replace on string1 string2 string3 to transform it into string2; string3; string1	if (isNotNull(value.match(/(regex for string1)(regex for string2)(regex for string3)/)), value.match(/(regex for string1)(regex for string2)(regex for string3)/)[1] + “;” + value.match(/(regex for string1)(regex for string2)(regex for string3)/)[2] + “;” + value.match(/(regex for string1)(regex for string2)(regex for string3)/)[0], value)

Test Condition Cheat Sheet

Example data to be used is
auto
autos
auto mechanics
automechanics
auto mechanics and aviation
planes
trains
automobiles
aviation
aviation and automechanics

NOTE : Test conditions can also be used for custom text faceting

Test Condition	Code
String must exactly match auto	value == “auto”
String can be auto, autos, auto mechanics, automechanics, or auto mechanics and aviation	value.contains("auto")
String can be auto mechanics, automechanics, auto mechanics, or auto mechanics and aviation.	isNotNull(value.match(/(.auto.mechanics.*)/))
String can be auto, autos, planes, trains, or automobiles.	isNull(value.match(/(.auto.mechanics.*)/))
String must match auto, autos, or planes.	or(isNotNull(value.match(/(^auto[s]/))), value == "planes")
String must start with some variation of auto.	value.startsWith("auto")
Find all of the variations of auto mechanics and aviation	and(isNotNull(value.match(/(.auto\smechanics.*)/)), value.contains("aviation"))
All strings that begin with "a"	value.startsWith("a")
All strings that end with "s"	value.endsWith("s")
String must be less than 4 words	length (split(value, “ “)) < 4
String must contain any substring with variable spacing of “auto mechanics” and must be less than 4 words	and(isNotNull(value.match(/.auto.mechanics.*/)), length(split(value, " ")) < 4)
Cell is completely blank	isBlank(value)
Cell is not blank	isNonBlank(value)

Test conditions for dates

Test string is a single range of years, such as 1938-1968. No spaces surrounding the hyphen.

First year must be greater than 1938	toNumber(substring(value, 0,4)) > 1938
First year must equal 1938	toNumber(substring(value, 0,4)) == 1938
First year must be less than 1938	toNumber(substring(value, 0,4)) < 1938
Second year must be less than 1968	toNumber(substring(value,5,9)) < 1968
Second year must be equal 1968	toNumber(substring(value,5,9)) == 1968
Second year must be less greater than 1968	toNumber(substring(value,5,9)) > 1968
Second year is greater than the first	toNumber(substring(value,5,9)) > toNumber(substring(value,0,4))

Open Refine for Librarians

Thursday, April 12, 2018

Recipe: Computing the Millennium checkdigit

Friday, August 18, 2017

Recipe and walkthrough: counting up repeats of a word in a sentence

Thursday, August 17, 2017

Recipe: Text filtering with regular expressions

Friday, August 4, 2017

Two recipes for extracting the last n words in a string

Monday, July 31, 2017

How to calculate the number of words in a sentence and how to use that in facets/the if function

Using splitByCharType to extract years of coverage in an 856 field

Friday, July 28, 2017

Using the Marcedit capturing group Find/Replace in Open Refine, pt.2 : putting it into an if function.

Thursday, July 27, 2017

How to do a Marcedit Find/Replace with capturing groups in Open Refine

Friday, July 21, 2017

How to remove duplicate rows

Tuesday, July 18, 2017

Using facets to debug search differences for large files

I hit the drop down menu on any column and chose Facet->Custom Text Facet.

Then I entered this GREL:

cells["Column 1"].value == cells["Column 2"].value

forRange cheat sheet

If structure cheat sheet

Favorite Sci-Fi Shows

Contributors

Labels

Blog Archive