LISTSERV 16.5 - CODE4LIB Archives

I was involved in some work on calculating and visualising this kind of 
word/phrase frequency on a project a few years ago - this was based on 
queries to a corpus indexed in ElasticSearch, and charting words/phrase 
frequency against each other using javascript - see 
https://ukmhl.historicaltexts.jisc.ac.uk/ngram for an example - but 
although it looks superficially similar it isn't anywhere close to as 
sophisticated as the Google n-gram viewer.

You probably already know, but I think it's worth stating, that the 
Google n-gram viewer (https://books.google.com/ngrams) is not simply 
visualising the frequency of word/phrase occurrences, but the frequency 
as a percentage of the frequency of all n-grams of the same size in the 
corpus https://books.google.com/ngrams/info. The Google n-gram viewer 
goes well beyond this as well, supporting ways of being more specific 
(e.g. you can limit by part of speech, and by language of the texts 
analysed). This suggests a sophisticated linguistic parsing of a large 
corpus with the ability to answer complex questions quickly at a scale - 
something we weren't able to do in our project.

In the project I was involved in, we are simply showing the percentage 
of texts in the corpus in which a word appears, not the frequency as a 
percentage of all same sized n-grams - which means our viewer is more 
about reflecting general book topics than it is about linguistic 
analysis. You can also see issues with the measurement at either end of 
the graph where there seem to be spikes in usage - but this actually 
reflects that the collection simply lacks large numbers of texts from 
those years, which means a term only has to appear in a few books to get 
a high percentage. Despite this, the tool is still useful (IMO) within 
those constraints - in the example search it is possible to see that the 
term "tubercolosis" rises in frequency, while the term "phthisis" (for 
the same condition) drops off - a trend also shown by Google n-gram 
viewer 
https://books.google.com/ngrams/graph?content=tuberculosis%2Cphthisis&year_start=1800&year_end=1930&corpus=15&smoothing=3&share=&direct_url=t1%3B%2Ctuberculosis%3B%2Cc0%3B.t1%3B%2Cphthisis%3B%2Cc0

Finally if you want to do more sophisticated analysis it may be worth 
looking at specialist tools - e.g. AntConc 
http://www.laurenceanthony.net/software/antconc/

Hope some of that is helpful

Owen


Fitchett, Deborah wrote on 08/11/2019 01:10:
> You might be interested in Chart.js (https://www.chartjs.org/) - it does the visualisation part, if you could do the search part.
>
> Deborah
>
> -----Original Message-----
> From: Code for Libraries <[log in to unmask]> On Behalf Of Vinit Kumar
> Sent: Thursday, 7 November 2019 6:17 PM
> To: [log in to unmask]
> Subject: [CODE4LIB] n-gram visualisation
>
> Dear Code4Libers,
>
> I have data with the following structure:
> ngrams   2009    2010  2011 2012   2013 2014 2015
> library        22        3          32     32       35      21       21
> technology  3         4          43     32       30     43      32
> and so on
>
> Is it possible to visualise this data in a similar manner as Google N-gram viewer displays? Wherein one can put the keyword in a search bar and the visualisation displays the year wise trend of that keyword in the corpus based on the above structured data.
> Any pointers or tools would be of help.
> Thanking you in anticipation.
>
>
> --
> Regards
> Vinit Kumar, Ph.D.
> Assistant Professor,
> Department of Library and Information Science Babasaheb Bhimrao Ambedkar University, Rae Bareilly Road, Lucknow, India 226025
> +919454120174
>
>
> ________________________________
>
> "The contents of this e-mail (including any attachments) may be confidential and/or subject to copyright. Any unauthorised use, distribution, or copying of the contents is expressly prohibited. If you have received this e-mail in error, please advise the sender by return e-mail or telephone and then delete this e-mail together with all attachments from your system."

-- 
Sent from Postbox <https://www.postbox-inc.com>