Thursday, July 20, 2017

Objects, pt. 2



If you haven't read part 1 of this tutorial here, do so before you proceed further.

When you open a project in Open Refine, the entire spreadsheet is broken up into rows. Each cell in the row is mapped into a cell object. What comes next is my best guess because I couldn't find documentation -- I believe that the cell objects are initialized into an array and that becomes part of object CellTuple.

Or, in pseudo-code:

object CellTuple
         array list = {array of cell objects for the current row}



Row object
The next step that occurs during the opening of a project is that a row object is created for each row in the spreadsheet. When you try to access a field from the row object, by default it goes to the first row. (I'll cover why you can see all the values for a whole column later on.)

In addition, a cell object is created for each cell in the spreadsheet.

What follows is a breakdown of the contents of the row object.

row.index
The field row.index contains the row number of the current row being accessed.

row.cells
row.cells is an object within object row. Just typing in row.cells in a GREL window will not yield any values that can be manipulated. This is not at all related to the cell object listed in my previous blog post about objects.

When the project is created, the names of all of the columns are stored in array columnName. From that array, Open Refine creates fields for each column in the cells object, and each field's name corresponds to each column name.  Then, Open Refine creates a reference to the cell in row.index, columnName1 and maps it to the field <columnName1>. If there's a cell in row.index, columnName2, Open Refine creates a reference to that cell and maps it to field <columnName2>, etc.

Since a reference to the cell object is mapped to the appropriate <columnName> field, you can access the fields within the cell object by using row.cells.<columnName>.<field>

Example 1: for columns column1, column2 and column3, to access the value field for each column, you'd write:
row.cells.column1.value
row.cells.column2.value
row.cells.column3.value

This notation can be shortened to cell.value or just value if you are currently working with the column. Ex. If you did an add column on column2 and wanted to concatenate the values from column1, column2 and column3 in the new columen, the GREL expression would be:

row.cells.column1.value + " " + value + " " + row.cells.column3.value

Note: If your column name has a space in it, such as Column 1, the syntax is row.cells.["Column 1"].value (the square brackets and quotes are mandatory)

Example 2: If I wanted to concantenate the recon.best.id from my current column and column LCSH with a semicolon in between, the GREL is:
cell.recon.best.id + "; " + row.cells.LCSH.recon.best.id

row.cells can be shortened to cells.

row.columnName
An array consisting of all of the column names for the current row. When the project is created, the names of all of the columns are stored in array columnName. If a column is added, the name is added to this array and another field with the new columnName is created in object cells.

row.starred
This field is a boolean. It is either true, if the row is starred, or false if it isn't.

row.flagged
This field is a boolean. It is either true, if the row is flagged, or false if it isn't.



Here's the pseudo-code:


object row
      field index = {row number}
      object cells
           field <columnName1> = {maps reference to cell in row.index, columnName1 from object CellTuple to columnName1}
                  {the following fields are only created if the columns in question exist}
           field <columnName2> = {maps reference to cell in row.index, columnName2 from object CellTuple to columnName2}
                 
           field <columnName3> ={maps reference to cell in row.index, columnName3 from object CellTuple to columnName3}
            {more fields will be created if more columns are added}
       array columnName = {an array of all of the column names for the current row, if a new column is added, the name is added to this array}
       field starred = {true, false}
       field flagged = {true, false}



If row.cells.<columnName1>.value only accesses the current row, why do I see the whole column when I do an add column?

Simple. Behind the scenes magic. When you ask to add a column, Open Refine iterates down each row for you once you tell it to display the value for the cell in the current row. E.g., first it displays row.cells.<columnName1>.value for row 1. Then it goes to row 2, displays row.cells.<columnName1>.value for that row; then it goes to row 3, displays row.cells.<columnName1>.value for that row, etc., etc.
 
Argh. cell and cells are too similar, how do I keep them straight? 
My suggestion is to use cell for the column you're working with, and row.cells for any other column until you get used to the notation.



       

No comments: