Identifying Unique Book Interests across NYPL’s Branches

Written by Amritansh KWatra

How can NYPL identify and serve unique category interests across its diverse branches beyond bestsellers and individual popular titles?

The New York Public Library’s (NYPL) branches span three boroughs, serving patrons with diverse reading interests. To better serve their patrons’ interests, the NYPL wants to learn which of their branches are uniquely interested in which kinds of books. This excludes best sellers and widely popular books, as those are popular across most NYPL branches. It also excludes individual books that are only popular at specific branches, as these books already reside at the branch and are available for patrons to check out. Instead, the NYPL is interested in deriving a categorization of books that can be optimized based on patterns of aggregate use across NYPL branches. Suppose a disproportionate interest in a specific category— such as historical biographies, international cuisine, or urban gardening— is detected at a certain branch. In that case, the NYPL could strategically send more books of the same category to the branch, enriching options for patrons.

This summer, as a Siegel PiTech PhD Impact Fellow, my project with the NYPL focused on three parts of this broader problem:

  1. Investigating methods for clustering books into categories

  2. Determining a metric for disproportionate interest.

  3. Surveying external data sources to improve the quality of clusters we could generate.

A Method for Clustering Books

First, we examined how we could establish a categorization for books. The categories had to be more specific than genres associated with the book, but not so specific that they could no longer be effectively grouped with enough other books. As a result, we looked for methods for which the specificity, or tightness, of a group could be varied based on input parameters.

Figure 1: (a) A sketched graph where nodes represent books and edges represent a book similarity. (b) A sketched representation of the desired output where the graph relationships can generate clusters and detect singletons (clusters with a single book in them).

Figure 2: A sampling of the clusters we could generate with our method. The orange group is a sample of the cluster topic about Italian Migration to the USA. The blue cluster is American Authors writing about depression. The green cluster is about Glassware Design.

We also could not rely on methods typically used by online content recommendation systems such as Netflix or Amazon. These providers can estimate which books are similar by leveraging their users’ viewing or purchase history. This method, called collaborative filtering, allows a platform to learn a categorization driven by customer usage. However, the NYPL and many other public libraries are historically committed to preserving patron privacy and do not store individual patron consumption data. This can complicate using off-the-shelf methods.

Instead, we chose to cluster the books into categories using a series of semantic tags that can be attached to each book and using these tags to compose a graph that defines the NYPL’s collection. Each node in this graph represents a book, and each edge represents how similar the two books it connects are. Initially, we use a series of tags that may be attached to books from MARC (MAchine Readable Catalog) bibliographic data sourced from the Libary of Congress.

Each book was represented by a vector of the tags that could be attached to it, with a different vector for each MARC data source. We then compute the similarity of a pair of books by computing the cosine similarity of each book’s vector. This acts as a similarity score between the two books. We can then repeat this process for all the data sources that provide a tag for the books. For this summary, we assume a single data source, i.e. one edge per pair of nodes.

To turn this graph into a series of disconnected clusters, we can use a class of Community Detection algorithms from the history of research on graph data structures. While we did not evaluate many options at this step, our initial choice based on prior experience worked well. We used a Connected Components algorithm that prunes edges under a certain similarity threshold and returns the sub-graphs that remain connected after the pruning. This results in clusters with transitive closures of similarity, i.e., containing members with a similarity lower than the provided threshold but connected via a peer between which the similarity of both nodes is over the threshold. Moreover, this satisfied our desire to have the tightness of each cluster be variable, as the similarity threshold could be tuned to achieve a desired level of inter-cluster similarity. Figure 2 shows examples of clusters we could generate with this method. See Appendix Item A for a detailed explanation of our method.

A Metric for Disproportionate Interest

A challenging part of this work was determining an estimate for disproportionate use in a branch for a cluster. This was complicated because branches were of different sizes, and most books had very few checkouts during any given year. Ultimately, we resolved these issues by normalizing the checkout share at a branch by the total number of checkouts at the branch and taking the log of that value. This allowed us to compare the volume of checkouts at smaller branches to larger ones.

Figure 3: An example of detected disproportionate interest. This cluster shows the distribution of interest in the topic of Italian Migration to the USA, which is not very popular across most branches except for the Huguenot Park branch in Staten Island.

To isolate disproportionate interest, we used a modified z-score value that would compare a value to the median value of a distribution instead of the mean. We asserted a cut-off for this modified z-score, which resulted in examples where most branches had low or no checkouts of books in a given cluster, while others had far more checkouts. Figure 3 shows one example of the distributions we could isolate.

Surveying External Data Sources

The data sources we use for our clustering emanated from the Library of Congress data. While this allowed us to validate this method for clustering, it only captured one way to classify and categorize books. To add dimensions of separability to our clustering approach, we wanted to investigate adding more data sources to the NYPL data. I investigated adding two different data sources and implemented a system of enriching book records with external data without requiring the two datasets to share unique identifiers using Entity Resolution.

We looked at adding data from BISAC categories available with e-books in the NYPL’s data and data from Goodreads, an online platform for readers. We found that we could extract high-quality categories from both sets of data. However, we identified that both had a common issue. The data did not contain a unique identifier we could use to join the data to the NYPL’s corpus of books (e.g. ISBN). To remedy this issue, I implemented an entity resolution system that could estimate which records across datasets referred to the same books. Then, we could inherit data from one set (in this case, categories) and put it into the other. See Appendix Item B for a detailed explanation of this process.

Amritansh Kwatra

Ph.D. Student, Information Science, Cornell University

Impact and Path Forward

At the culmination of the summer, we had identified a process that could generate clusters of books, detect disproportionate interest in those books, and finally, a means of expanding the dimensions the method could use to create clusters. These were all housed in computational notebooks in the NYPL’s infrastructure and shared with team members. These tools enable teams within the NYPL to probe unique interests at specific branches, which can be used to validate hypotheses relating to branch programming and purchasing choices. Unfortunately, our work over the summer could not begin to investigate these deployments. However, future work would extend the methods and practices we proved over the summer to identify opportunities for action across other library departments.

Acknowledgments

This work was done under the mentorship of Sarah Rankin at the New York Public Library with valuable feedback from the rest of the STRAD group. I am grateful to them for welcoming me over the summer and making me feel included in the broader functions of the team.

Appendix

  1. A presentation explaining the process of clustering books into coherent clusters.

  2. Deploying the clustering process and resolving external datasets to the NYPL’s data using Entity Resolution.

Next
Next

Crowdsourcing Urban Accessibility Information for the Neurodiverse City Project