1907-1913
1910-1911
1908-1911
etc.
The cells have to be converted to:
1907; 1908; 1909; 1910; 1911; 1912; 1913
1910; 1911
1908; 1909; 1910; 1911
etc.
Open Refine can be used to convert this. Using add column based on this column, you would use the following GREL command:
forRange(toNumber(substring(value, 0,4)), toNumber(substring(value, 5,9)) + 1, 1, currentYear, currentYear).join("; ")
Yes, it looks daunting. But let's break it down, starting with the forRange command:
forRange (<starting value>,
<ending value>, <integer to increment by>, <variable
name>, <do something to variable>)
forRange is an example of a function that loops. In English, a loop has a defined starting condition, a test, and an exit condition. It will execute commands until the exit condition is reached. How it does so is by testing the starting condition after each iteration. Once the starting condition agrees with the exit condition, the loop stops. Most tests are mathematical, and forRange is one of them.
In short, forRange takes the starting integer value and increments it by one as long as it's less than the ending value. For each value, a function is executed.
In short, forRange takes the starting integer value and increments it by one as long as it's less than the ending value. For each value, a function is executed.
Here's a detailed walkthrough of forRange:
1. forRange will take <starting value> and copy it into <variable name>. A variable is a way for the computer to store and manipulate data. A variable name is basically a human readable label for a variable. It can be anything, but best practice is to not name it "variable" and to make it meaningful (so other people can figure out what you're manipulating)
And why does Open Refine need to copy it in the first place? This is a programming best practice. When you want to manipulate data, you always make a copy of the thing you want to manipulate, in case you need the original data for something else.
And why does Open Refine need to copy it in the first place? This is a programming best practice. When you want to manipulate data, you always make a copy of the thing you want to manipulate, in case you need the original data for something else.
2. <variable name> is evaluated to see if it's less than <ending value>.
3a. If it is, forRange exits and returns the temporary array.
3b. If
not, <do something to variable> is executed.
"do something" can be any GREL command. You could simply output the value, you could do some text manipulation, you could do array manipulation, and you could even drop in a conditional function.
"do something" can be any GREL command. You could simply output the value, you could do some text manipulation, you could do array manipulation, and you could even drop in a conditional function.
4. Whatever's generated by that expression is then stuffed into the first empty
element in a temporary array.
5.<variable name> is then incremented by <integer to increment by>.
6. Repeat steps 2-5 until <variable name> equals <ending value>.
7. When <variable name> equals <ending value> forRange exits
immediately and doesn't execute for <ending value> because the test is
less than, not less than or equals to. Therefore, make sure that ending value =
<the last value you want to process> + 1
Okay, that's theoretically great. How does this apply to actual data?
Example:
I want to generate the
string: 1903; 1904; 1905
To do this, I first tried:
forRange(1903, 1906, 1, currentYear, currentYear
+ ";").join(" ")
1. Notice that the end date is 1906 and not 1905. You will see the reason for this in steps 10-15.
2. Open Refine copies
1903 into currentYear, then it tests to see if currentYear < 1906. currentYear
= 1903, so forRange starts to execute.
3. Expression currentYear
+ ";" is evaluated - that creates the string 1903;
currentYear remains 1903
because the <do something to currentYear> expression doesn't overwrite
the variable.
4. 1903; is stuffed into
a temporary array. Array is now ["1903;"]
5. currentYear is still
1903
6. forRange increments currentYear
by 1, so currentYear becomes 1904.
7. currentYear < 1906
is tested. It's true, so forRange continues to execute.
8. Expression currentYear
+ ";" is evaluated - that creates the string 1904;
9. 1904; is stuffed into
the temporary array. Element 0 is full, so it's copied into element 1, instead.
Array is now ["1903;", "1904;"]
10. forRange increments currentYear by 1, so currentYear becomes 1905.
11. currentYear < 1906 is tested. It's true, so forRange continues to execute.
12. Expression currentYear
+ ";" is evaluated - that creates the string 1905;
13. 1905; is stuffed
into the temporary array. Element 0-1 are full, so it's copied into element 2,
instead. Array is now ["1903;", "1904;", "1905;"]
14. forRange increments currentYear by 1, so currentYear becomes 1906.
15. Expression currentYear
< 1906 is evaluated. 1906 < 1906 is false, so the temporary array
["1903;", "1904;", "1905;"] is returned.
16. Open Refine will
just hang onto the array until you execute join(" ").
17. As you noticed, when you do forRange(1903, 1906,
1, currentYear, currentYear + ";").join(" "), 1903; 1904;
1905; is printed to the cell. Not really desirable. What would work better is
to just stuff currentYear into the array and then insert "; " as a
delimiter in the join. Or:
forRange(1903, 1906, 1, currentYear, currentYear).join("; ")
But I have variable data. How does forRange help?
Easy. You use a GREL function on value to generate the
starting and ending values.
Since all of my strings are <4 character year>-<4 character year>, I'm going to use the substring function to extract the starting and ending years to use in forRange.
Specifically:
<starting value> = substring(value,0,4)
<ending value> = substring(value,5,9) +1
(remember that the ending value has to be one year later because otherwise the loop will exit before the last year in the range is processed)
Plugging everything in, I come up with this GREL:
forRange(substring(value, 0,4)), substring(value, 5,9) + 1, 1, currentYear, currentYear).join("; ")
It should work, right?
Wrong. It generates an
error because the substring of something is counted as a string. In order for the looping to work, forRange has to be able to increment an integer.
Fortunately, GREL can
convert a string to an integer using toNumber(<integer or number string or
something that generates an integer or number string>)
Therefore:
<starting value> = toNumber((substring(value, 0,4))
<ending value> = toNumber(substring(value, 5,9)) + 1
Now, drop this into forRange:
forRange(toNumber(substring(value, 0,4)), toNumber(substring(value, 5,9)) + 1, 1, currentYear, currentYear).join("; ")
this one will work, too:
forRange(toNumber(value.substring(0,4)), toNumber(value.substring(5,9)) + 1, 1, currentYear, currentYear).join("; ")
No comments:
Post a Comment