Monday, July 3, 2017

Introduction to Arrays

Update 7/28/17: I've fixed up this post because when I first wrote it, I didn't realize what Open Refine's convention was for naming arrays.

You've probably heard a lot about arrays when working with Open Refine, but what exactly is it?

First of all, I need to explain how software computing works before I can explain the array functionality. When any sort of software needs to process data, it has to break it down into chunks that it can understand. The chunks are in the computer's equivalent of headspace -- much like when you're trying to figure things out in your head.

However, a computer can only process data using certain structures. There are several of these data structures, but I'm just going to cover the array.




An array is basically a set of boxes, with each box containing some sort of data. Each box has a numeric label, which in programmer speak is called an index. The data inside a box is called an element. The unusual thing with all of these structures in programming is that the first index is 0, instead of 1.

For example, here's a sample array:


Arrays have to have names, so let's call this one fruit_box.

Now here's something I have to emphasize -- computers are stupid. If you saw the same fruit arranged in boxes, and I asked which one have the apples, you'd say, "The first and the third box."

However, here's some fundamental differences between humans and computers -- we would put the fruit in open boxes so we can see everything at a glance. Computers won't do so.

Instead, it stuck the first bunch of apples in a box labelled 0 and closed the lid, stuck the oranges in a box labelled 1 and closed the lid, stuck the second set of apples in a box labelled 2 and closed the lid, kiwis went into box 3 and the lid was closed, and grapefruit went into box 4 and that lid was closed.

In the computer's headspace, it has no clue which boxes have the apples. All it knows is that it has a bunch of boxes (or indexes) labelled 0 through 4. The only way it knows what's inside is that it opens the lid and looks. It closes the lid right after it looks -- unless you tell it to, it will also forget whatever was in the box it just looked inside.

So, to do the same one glance human operation of finding out which boxes have apples, this is what the computer has to do - it will open box 0, check for apples. If it sees there are apples and you didn't tell it to pull out the apples, it'll just close the lid. So, you have to tell it to pull out the apples (or pull out the element).

The computer does so and closes the lid. Then it will check box 1 for apples and then close the lid. Then it will go on to box 2, find the apples and again, unless you told it to do something with the apples, it'll just close the lid. So again, you tell it to pull out the apples and it does so and closes the lid.

However, since the computer doesn't have open boxes, it won't stop after it finds the apples in box 2. It wants to be thorough, so it'll go on to check box 3 and box 4 for apples, closing the lid after each check. You can't tell it to stop after finding all the apples, because it has no idea what's in box 3 or 4.

The terminology for "checking the box" in the programming world is called "getting (or returning) the element from the array".

Most documentation out there doesn't draw the boxes as I did above. Instead, in standard documentation, this is what the fruit_box array would be written as:
[apples, oranges, apples, kiwis, grapefruit]

The command to tell it check the first box (a.k.a getting the first element) is written as:
fruit_box[0]

The syntax is <name of array>[<index number>]. Invoking this will return the value of the element in fruit_box[0], a.k.a apples.

fruit_box[1] will return value oranges, etc.

If you're having problems manipulating arrays, I recommend going back and drawing the boxes with the indexes.

How does this apply to GREL?
Say you have a cell with these delimited subject headings:
Santa Cruz; California; Santa Cruz Beach Boardwalk; Surfing

If you do an Edit column -> add column based on this column and did a
value.split("; "), you would create this array:
value.split("; ")=
[Santa Cruz, California, Santa Cruz Beach Boardwalk, Surfing]

or:
0
1
2
3
Santa Cruz
California
Santa Cruz Beach Boardwalk
Surfing

We'll get back to the naming convention later.

I know that the Open Refine preview shows the array with double quotes around each element. As far as I can tell, that's a cosmetic thing.

Now, the slice function is listed on the Open Refine wiki as "slice (array a, number from, optional number to). Returns the sub-array of a from its index from up to but not including index to. If to is omitted, it is understood to be the end of the array a. "

Let's unpack this -- the slice function is basically creating a smaller chunk of the value.split array. It can be any chunk, but it must be a continuous one. I can't do a slice and have value.split("; ") = [Santa Cruz, Santa Cruz Beach Boardwalk ], for example.

array a just means the name of the array you want to manipulate. In this case, the one you just created, value.split("; ")

Number from - This is the first index number of the element you want to keep. For example, if I wanted to get rid of Santa Cruz, I would put "1" as the number from, since I want to keep "California". If I wanted to keep "Santa Cruz Beach Boardwalk and Surfing, but ditch Santa Cruz and California, I would use 2 as the number from. If I just wanted to keep Surfing, I would use 3 as the number from.

optional number to - What they're saying is, if you want to just slice off  the first part of the array and keep the rest, you don't need to use this number. So, if I wanted to just get rid of "Santa Cruz" from my array, I would use slice(value.split("; "), 1).

On the other hand, if I wanted just "California" and "Santa Cruz Beach Boardwalk", I would fill in the number to field. What they want you to use is the index number where you want to stop + 1. E.g. I want the last element on my slice to be "Santa Cruz Beach Boardwalk". That's index 2, but the number to requires index +1, so the slice command to generate [California, Santa Cruz Beach Boardwalk] is slice(value.split("; "), 1, 3).

I've generated an array using split and I can see it in the preview. How come all I see is a blank column once I click okay? 
It's because the array is all in the computer headspace.

A good correlation would be when you try to calculate a tip in your head -- you could figure out the amount, but if you wanted to add more money to the tip, you wouldn't doodle out the amounts on your receipt and do the math, you'd still do it all in your head.

It's the same with OpenRefine - once you do the split, until you actually access the contents, it never makes it onto the spreadsheet. One way to do it is to chain another command onto your split.

So, in the split command I detailed above, you can't use add another column to create a new column with value.split("; ") and then use new column->edit column->add another column to do a slice(1). Instead, you'd have to do:

<column you want to process>->edit column->add another column
value.split("; ").slice(1)

I did the above and I'm still getting a blank column once I hit okay.

You're not done yet. Going back to the tip analogy: you could calculate as much of a tip as you want, but until you write it on the receipt, it's purely theoretical. If you don't write it out, and just leave, the tip never becomes a reality.

Same thing in Open Refine. Until you add a join command to output the contents of the array, all that manipulation is purely theoretical and it never makes it onto the spreadsheet.

So, if I wanted to remove the first word from my delimted subject list, here's the GREL command I would use:
<column you want to process>->edit column->add another column
value.split("; ").slice(1).join("; ")

You could also directly access the individual elements of the array, and they'll output to the spreadsheet.

How do you do that? I don't see an array name.

Unlike other software, you don't access array elements in Open Refine by using <array name>[index]. Instead, access is granted by writing out the entire GREL expression that generated the array, with an index.

Ex. value.split("; ")[0] will output Santa Cruz to the new column. value.split("; ")[0] + "--" + value.split("; ")[1] will output Santa Cruz--California to the spreadsheet.

You can do the same for value.match, value.splitByCharType, value.splitByLengths, value.smartSplit, value.partition, and value.rpartition

What happens if I do value.split("; ")[-1]? 
The index wraps around and continues to count down until it reaches index -n, and then it stops. In other words, value.split("; ")[-1] will access the last element in the array, values.split("; ")[-2] will access the second to last element, etc.

If my value.split("; ")array was:
[Santa Cruz, California, Santa Cruz Beach Boardwalk, Surfing]

value.split("; ") [-1] = Surfing
value.split("; ") [-2]= Santa Cruz Beach Boardwalk
value.split("; ") [-3] = California
value.split("; ")[-4] = Santa Cruz
value.split("; ")[-5] = Santa Cruz

No comments: