Thursday, July 6, 2017

Tutorial on objects and the cell.recon object


What's an object?
You've probably seen this wiki page when trying to access the deeper reconciliation fields in Open Refine:

https://github.com/OpenRefine/OpenRefine/wiki/Variables

This is a pretty confusing page, so let me give a little background about objects. An object is a programming language structure that's basically a container for a bunch of other things, including other objects. The object itself cannot be manipulated, but everything contained in that object can.
Open Refine has a master object and then either objects or fields contained within it.

A field will either yield a string, a boolean, a null, a number, or it will have deeper fields.



So, let's say I have a dresser with three drawers, the top drawer, the middle drawer and the bottom drawer. In the top drawer I have blouses, the middle has t-shirts, and my bottom drawer has socks. The dresser isn't mine - I'm just borrowing it.

I also have stashed my jewelry in the top drawer, some spending money in the middle, and the bottom has an envelope full of photos.

The dresser would become an object, because it encapsulates all three drawers. I can manipulate the drawers, note their location, and access their contents, but since this is a borrowed piece of furniture, I can't do anything to the outside at all. The location of each drawer, whether it's pulled out or not, and the contents become fields within the dresser object.

However, fields can only yield one value. Since I have two things inside my drawers (clothes and miscellaneous items) and I want to have two separate categories, I'm going to make drawer_contents an object. This way, I can have separate fields for clothes and items.

In pseudo code, you would declare your dresser like this:
 object dresser
          field drawer_location = (can be "top", "middle", or "bottom")
         field  drawer_Pulled_out = (can be "true" or "false")
         object drawer_contents
                 field clothes = (can be "socks", "blouses", "t-shirts")
                 field items = (can be "jewelry", "money", "photos")

To access the values in each field, you would use the syntax <object name>.<field name>, or in the case of  encapsulated objects with fields: <object name>.<object name>.<field name>

If you tried to output the values in all fields, it would look like this:

References to objects
Before I continue, I also want to cover references to objects, because Open Refine uses them. A reference to an object is another way to access the object. When you create an object, it exists somewhere in the computer's headspace. A reference actually assigns a hexidecimal number to the object that can be used to indirectly access the fields and objects contained by your master object. The programmer can then map the reference to a human readable variable.

Variables in programming are just ways to store variable incoming data so that you can work with it. For example, if an online vendor has a coupon that you can use at checkout time, the shopping cart program stores your subtotal in a variable because the subtotal could be anything. Then the discount from the coupon is applied to that variable.

Variables can be of any type. So a programmer can map a reference to an object to a generic variable and use it to get access to the fields they need. You can use use the mapped field like an object to access the data. For example, if I return a reference to the dresser object to variable bureau, I can get the drawer location by using bureau.drawer_location, and access the clothes by using bureau.drawer_contents.clothes.

Why would you use a reference?  Say you want to manipulate a few fields within the object but you don't want to alter the data. You could do this by making a copy of the object. However, if you have a large program, it's very memory and computing intensive to make entire copies of the object. What's more efficient is to use a reference to access the few fields you need, make copies of those, and then work with them. Because you want to not conserve as many resources as possible, it's a programming best practice to use references.

The cell object in Open Refine

Now, Open Refine has basically divided the spreadsheet up into objects. Because of a complexity, I'm going to start from the smallest object and work my way out.

Open Refine maps each cell in a column to object cell, with field value. The data inside the cell, as you know, is copied into field value. Technically, in order to access value, you would have to type cell.value, but Open Refine allows you to just use value.

The recon and ReconCandidate objects
Two things happen, objectwise, once reconciliation is performed. First, object ReconCandidate for each potential match is created. Then, each field is initialized for each ReconCandidate with this data:

object ReconCandidate
      field id = {URL in the reconciliation database}
    field name = {name of the subject field}
       array type = {what types of terms were reconciled against}
       field score = {reconcilation match percentage}

The reconciliation function then figures out which subject terms match, and if there isn't a match, what the best recon candidate is, and also figures out the default 3 recon candidates. All types of terms are initialized to array type.

Then, Open Refine initializes the following values in cell:
cell.recon.judgment : This is the value that gets outputted to the reconciliation facet,  judgment. Can be "matched", "new", or "none".

cell.recon.matched : If judgement field is "matched", this field's value is "true". Otherwise, it's "false".



cell.recon.match : Not to be confused with cell.recon.matched. It's null if there's no match. If there is a match, a reference to the matching ReconCandidate is mapped to the match field. That means that fields id, name, type, and score can be accessed by cell.recon.match.<ReconCandidate field> 
      e.g. cell.recon.match.id = matching subject ReconCandidate.id
             cell.recon.match.name = matching subject ReconCandidate.name
             cell.recon.match.type = gives you the array containing the types. If you want to manipulate, you need to use array functions. If you want to output the entire array, you need to use cell.recon.match.type.join(<delimiter>). If you want to access one element of the array, it's cell.recon.match.type[<0,1, or 2>]
            cell.recon.match.score = matching subject ReconCandidate.score (should be 100)


cell.recon.best : Null if cell.recon.match has been mapped to ReconCandidate. If cell.recon.match is null, then a reference to the best ReconCandidate is mapped to the best field. The fields can be accessed by cell.recon.best.<ReconCandidate field>
               cell.recon.best.id = best recon candidate ReconCandidate.id
             cell.recon.best.name = best recon candidate ReconCandidate.name
            cell.recon.best.type = gives you the array containing the types for the best recon candidate. If you want to manipulate, you need to use array functions. If you want to output, you need to use cell.recon.best.type.join(<delimiter>)If you want to access one element of the array, it's cell.recon.best.type[<0,1, or 2>]
            cell.recon.best.score = best recon candidate ReconCandidate.score (should be 100)


cell.recon.features: This is an object with the following fields:
                     cell.recon.features.typeMatch= {"True"if best candidate matches the intended reconciliation type, otherwise, "false"}
                     cell.recon.features.nameMatch={"True", if best candidate's name matches the cell's text exactly, otherwise, "false"}
                     cell.recon.features.nameLevenshtein={number computed by comparing the best candidate's name with the cell's text; larger numbers mean bigger difference}
                      cell.recon.features.nameWordDistance={number computed by comparing the best candidate's name with the cell's text; larger numbers mean bigger difference}


cell.recon.candidates : An array of object references to the default three ReconCandidates. Because it's an array of object references, the reference is mapped to the elements of array candidates. Therefore, to access the fields .id, .name, .type, .score, for the individual ReconCandidates, you would need to use this syntax for the first default ReconCandidate:
              cell.recon.candidates[0].id
              cell.recon.candidates[0].name
              cell.recon.candidates[0].type (for the whole type array) or cell.recon.candidates[0].type[<0-2>] for the individual elements of the type array.
              cell.recon.candidates[0].score
For the second and third candidates, use the syntax above, but substitute [1] and [2] for the index. (for cell.recon.candidates.type, only substitute in the first index)



In pseudocode, the complete cell object looks like this:

object cell
          field value = {whatever value is in the cell}
         object recon (assuming reconciliation was performed, otherwise everything in object recon is null)
                   field judgment = {"matched", "new", or "none".}
                    field matched = {"true", or "false"}
                   field match = {if there's a match, map reference to the ReconCandidate object that matches, null if not}
                   field best = {if field match is null, map reference to best recon candidate of the ReconCandidate objects, null otherwise}
                   object features
                                   field typeMatch= {"true" or "false"}
                                   field nameMatch={"true" or "false"}
                                    field nameLevenshtein ={a computed number}
                                    field nameWordDistance={a computed number}
                    array candidates = {array of references to the first 3 default ReconCandidate objects}


I'll cover part of the rows object in another blog post.

No comments: