I had a set of data where I needed to only process sentences that were more than a certain number of words. Fortunately, doing so is relatively easy.
First of all, the words in a sentence can be split up into an array using value.split(" ").
Secondly, the length() function in Open Refine returns the length, or total number of elements in an array. Since each element is a word, put these two together and the number of words will be calculated:
Monday, July 31, 2017
Using splitByCharType to extract years of coverage in an 856 field
GREL's splitByCharType() is actually quite a useful function. It takes a string, splits it up, and then groups consecutive like characters together in an array. Some of the more common like characters are all lowercase letters, all uppercase letters, all numbers, and all spaces. Special characters and punctuation are only grouped together with the exact same character (e.g. $$ would be grouped together, but $! would not.)
For example, for the string: $a300 pages ;$c20 cm, it'll fill the array this way:
Index Element
0 $
1 a
2 300
3 <space>
4 pages
5 <space>
6 ;
7 $
8 c
9 20
10 <space>
11 cm
For example, for the string: $a300 pages ;$c20 cm, it'll fill the array this way:
Index Element
0 $
1 a
2 300
3 <space>
4 pages
5 <space>
6 ;
7 $
8 c
9 20
10 <space>
11 cm
Friday, July 28, 2017
Custom faceting using booleans
So, here's some sample data:
For the editing I want to do, I would like only the lines with v.<number>(<year>)-v<number>(<year>)
Since all of the notations vary, I tried the custom facet:
value.startsWith("v")
The problem is that lines containing data like this weren't excluded:
v.34(2008)-;v.1(1974)-v.26(2000)
Using the Marcedit capturing group Find/Replace in Open Refine, pt.2 : putting it into an if function.
This is a continuation from this post about how to mimic Marcedit capturing group Find/Replace in Open Refine.
The one problem with mimicking the Marcedit Find/Replace is that it will wipe out the data you don't want to alter unless you facet it out first. This is kind of cumbersome and still doesn't have the convenience of Marcedit's Find/Replace.
However, the if function takes care of that problem. I'm going to post the recipe here at the top and then go into detail further down. So, for your capturing group find of value.match(/<capturing group regex>/), do a transform cells, and drop it into this statement:
if(isNotNull(value.match(/<your find>/)), <your replace>, value)
The one problem with mimicking the Marcedit Find/Replace is that it will wipe out the data you don't want to alter unless you facet it out first. This is kind of cumbersome and still doesn't have the convenience of Marcedit's Find/Replace.
However, the if function takes care of that problem. I'm going to post the recipe here at the top and then go into detail further down. So, for your capturing group find of value.match(/<capturing group regex>/), do a transform cells, and drop it into this statement:
if(isNotNull(value.match(/<your find>/)), <your replace>, value)
Unpacking value.match with reg ex and how to use it in boolean statements
The value.match documentation on the Open Refine github wiki is a bit daunting, so I'm going to try to break it down in this post.
Some programming speak first: GREL functions usually yield some sort of data after they're done doing their thing. That data is called a return value. So, that's what I'm referring to whenever I make statements like, "match returns <whatever>".
Anyway, the biggest hurdle to my understanding was figuring out how match and regular expressions worked together. Here's what I've determined:
Unlike Marcedit, you have to account for the whole line if you're using regex in match. Thursday, July 27, 2017
How to do a Marcedit Find/Replace with capturing groups in Open Refine
One of the things that's really frustrated me with Open Refine is that there didn't seem to be a way to mimic the capturing group rearranging from the Find/Replace function in Marcedit.
For example, in this import of Marcedit records, some of the subfield a's have an indicator in front of them:
In Marcedit, I'd normally do this:
But I couldn't figure out how to do the equivalent in Open Refine. value.match seemed like a contender, but I couldn't figure out how to access the array elements. I finally figured it out today --
For example, in this import of Marcedit records, some of the subfield a's have an indicator in front of them:
Click on image to enlarge |
In Marcedit, I'd normally do this:
click on image to enlarge |
But I couldn't figure out how to do the equivalent in Open Refine. value.match seemed like a contender, but I couldn't figure out how to access the array elements. I finally figured it out today --
Comments are now enabled and a label widget has been added.
I didn't realize that comments were disabled for non-Google users. I've changed that, so comment away if you need clarification on my posts or if you would like me to cover something.
I also didn't realize that a label widget wasn't displaying. So I fixed that.
I also didn't realize that a label widget wasn't displaying. So I fixed that.
Wednesday, July 26, 2017
Intro to Open Refine
In the spirit of keeping everything all in one place, here's a pointer to a workshop about getting started with Open Refine:
https://www.youtube.com/watch?v=wGVtycv3SS0
https://www.youtube.com/watch?v=wGVtycv3SS0
Friday, July 21, 2017
How to import and export Marcedit files into Open Refine
Updated 8/22/17 - I noticed older versions of Refine imported the 005 and the 008 as a number instead of string. (You can tell that it's imported as a number if it's in green) There's a tweak for that, too.
Updated 8/14/17 : to take care of the non-Roman characters issue and added a couple of if statements if you use my column transform.
I was looking for Open Refine articles and I found a super-informative post on how to import and export Marcedit files into Open Refine.
Updated 8/14/17 : to take care of the non-Roman characters issue and added a couple of if statements if you use my column transform.
I was looking for Open Refine articles and I found a super-informative post on how to import and export Marcedit files into Open Refine.
How to remove duplicate rows
Here's a pointer to a page that gives detailed instructions on how to remove duplicate data: http://kb.refinepro.com/2011/08/remove-duplicate.html
Thursday, July 20, 2017
Objects, pt. 2
If you haven't read part 1 of this tutorial here, do so before you proceed further.
When you open a project in Open Refine, the entire spreadsheet is broken up into rows. Each cell in the row is mapped into a cell object. What comes next is my best guess because I couldn't find documentation -- I believe that the cell objects are initialized into an array and that becomes part of object CellTuple.
Or, in pseudo-code:
object CellTuple
array list = {array of cell objects for the current row}
Tuesday, July 18, 2017
Using facets to debug search differences for large files
I was using Millennium's Create List search, but then I needed to tweak it a bit. The problem I had is that the file was rather huge (15K worth of records) and it was going to be a bear to figure out which records had fallen out of the old search and which new records had been picked up.
Usually, I would dump both lists of record numbers into Excel, highlight the duplicate values, and scroll through, but that wasn't feasible for quickly checking 15K records.
Instead, I was able to use Open Refine's faceting ability to help me filter through the data. I dumped the record numbers from the old search into one column in Excel (labeled "Column 1"), dumped the record numbers from the new search into a second column (labeled "Column 2"), saved, and then sent the sheet over to Open Refine.
Then I clicked on "false" in the facet to see what wasn't matching. Note that you have to use the object cells and not cell. I should also note that this expression isn't too useful if the two columns are wildly out of sync. It's mostly useful for needle in a haystack situations.
Usually, I would dump both lists of record numbers into Excel, highlight the duplicate values, and scroll through, but that wasn't feasible for quickly checking 15K records.
Instead, I was able to use Open Refine's faceting ability to help me filter through the data. I dumped the record numbers from the old search into one column in Excel (labeled "Column 1"), dumped the record numbers from the new search into a second column (labeled "Column 2"), saved, and then sent the sheet over to Open Refine.
I hit the drop down menu on any column and chose Facet->Custom Text Facet.
Then I entered this GREL:
cells["Column 1"].value == cells["Column 2"].value
Cleaning out old Open Refine projects
This page has a nice guide for cleaning out old Open Refine projects: http://kb.refinepro.com/2012/01/delete-multiple-project-at-once.html
If structure cheat sheet
Updated 8/24/17
If structure cheat sheet
If structure cheat sheet
<operation> can be any function – reprinting a cell
value, concatenating cells values, any string processing function listed in the
wiki, and even a foreach, with, or another if function.
Do <operation> only if test
condition is false
|
if (test condition, value,
<operation>)
|
Do <operation> only if test
condition is true
|
if (test condition,
<operation>, value)
|
Do <operation1> if test
condition is true, otherwise do <operation 2>
|
if (test condition, <operation
1>, <operation 2>)
|
Do a Marcedit Find/Replace on string1 string2 string3 to transform
it into string2; string3; string1
|
if (isNotNull(value.match(/(regex for string1)(regex for string2)(regex
for string3)/)), value.match(/(regex
for string1)(regex for string2)(regex for string3)/)[1] + “;” + value.match(/(regex for string1)(regex for string2)(regex
for string3)/)[2] + “;” + value.match(/(regex for string1)(regex for string2)(regex for string3)/)[0],
value)
|
Test Condition Cheat Sheet
Example data to be used is
auto
autos
auto mechanics
automechanics
auto mechanics and aviation
planes
trains
automobiles
aviation
aviation and automechanics
auto
autos
auto mechanics
automechanics
auto mechanics and aviation
planes
trains
automobiles
aviation
aviation and automechanics
NOTE : Test
conditions can also be used for custom text faceting
Test Condition
|
Code
|
String must exactly match auto
|
value == “auto”
|
String can be auto, autos, auto mechanics, automechanics, or auto mechanics and aviation
|
value.contains("auto")
|
String can be auto mechanics, automechanics, auto mechanics, or auto mechanics and aviation.
|
isNotNull(value.match(/(.*auto.*mechanics.*)/))
|
String can be auto, autos, planes, trains, or automobiles.
|
isNull(value.match(/(.*auto.*mechanics.*)/))
|
String must match auto, autos, or planes. | or(isNotNull(value.match(/(^auto[s]/))), value == "planes") |
String must start with some variation of auto.
|
value.startsWith("auto")
|
Find all of the variations of auto mechanics and aviation
|
and(isNotNull(value.match(/(.*auto\s*mechanics.*)/)), value.contains("aviation"))
|
All strings that begin with "a" |
value.startsWith("a")
|
All strings that end with "s"
|
value.endsWith("s")
|
String must be less than 4 words
|
length (split(value, “ “)) < 4
|
String must contain any substring with variable spacing of “auto mechanics”
and must be less than 4 words
|
and(isNotNull(value.match(/.*auto.*mechanics.*/)),
length(split(value, " ")) < 4)
|
Cell is completely blank
|
isBlank(value)
|
Cell is not blank
|
isNonBlank(value)
|
Test conditions for dates
Test string is a single range of years, such as 1938-1968.
No spaces surrounding the hyphen.
First year must be greater than 1938
|
toNumber(substring(value, 0,4)) > 1938
|
First year must equal 1938
|
toNumber(substring(value, 0,4)) == 1938
|
First year must be less than 1938
|
toNumber(substring(value, 0,4)) < 1938
|
Second year must be less than 1968
|
toNumber(substring(value,5,9)) < 1968
|
Second year must be equal 1968
|
toNumber(substring(value,5,9)) == 1968
|
Second year must be less greater than 1968
|
toNumber(substring(value,5,9)) > 1968
|
Second year is greater than the first
|
toNumber(substring(value,5,9)) > toNumber(substring(value,0,4))
|
Labels:
cheat sheet,
if,
recipes,
test conditions for if
Subscribe to:
Posts (Atom)