I was involved in some work on calculating and visualising this kind of
word/phrase frequency on a project a few years ago - this was based on
queries to a corpus indexed in ElasticSearch, and charting words/phrase
frequency against each other using javascript - see
https://ukmhl.historicaltexts.jisc.ac.uk/ngram for an example - but
although it looks superficially similar it isn't anywhere close to as
sophisticated as the Google n-gram viewer.
You probably already know, but I think it's worth stating, that the
Google n-gram viewer (https://books.google.com/ngrams) is not simply
visualising the frequency of word/phrase occurrences, but the frequency
as a percentage of the frequency of all n-grams of the same size in the
corpus https://books.google.com/ngrams/info. The Google n-gram viewer
goes well beyond this as well, supporting ways of being more specific
(e.g. you can limit by part of speech, and by language of the texts
analysed). This suggests a sophisticated linguistic parsing of a large
corpus with the ability to answer complex questions quickly at a scale -
something we weren't able to do in our project.
In the project I was involved in, we are simply showing the percentage
of texts in the corpus in which a word appears, not the frequency as a
percentage of all same sized n-grams - which means our viewer is more
about reflecting general book topics than it is about linguistic
analysis. You can also see issues with the measurement at either end of
the graph where there seem to be spikes in usage - but this actually
reflects that the collection simply lacks large numbers of texts from
those years, which means a term only has to appear in a few books to get
a high percentage. Despite this, the tool is still useful (IMO) within
those constraints - in the example search it is possible to see that the
term "tubercolosis" rises in frequency, while the term "phthisis" (for
the same condition) drops off - a trend also shown by Google n-gram
viewer
https://books.google.com/ngrams/graph?content=tuberculosis%2Cphthisis&year_start=1800&year_end=1930&corpus=15&smoothing=3&share=&direct_url=t1%3B%2Ctuberculosis%3B%2Cc0%3B.t1%3B%2Cphthisis%3B%2Cc0
Finally if you want to do more sophisticated analysis it may be worth
looking at specialist tools - e.g. AntConc
http://www.laurenceanthony.net/software/antconc/
Hope some of that is helpful
Owen
Fitchett, Deborah wrote on 08/11/2019 01:10:
> You might be interested in Chart.js (https://www.chartjs.org/) - it does the visualisation part, if you could do the search part.
>
> Deborah
>
> -----Original Message-----
> From: Code for Libraries <[log in to unmask]> On Behalf Of Vinit Kumar
> Sent: Thursday, 7 November 2019 6:17 PM
> To: [log in to unmask]
> Subject: [CODE4LIB] n-gram visualisation
>
> Dear Code4Libers,
>
> I have data with the following structure:
> ngrams 2009 2010 2011 2012 2013 2014 2015
> library 22 3 32 32 35 21 21
> technology 3 4 43 32 30 43 32
> and so on
>
> Is it possible to visualise this data in a similar manner as Google N-gram viewer displays? Wherein one can put the keyword in a search bar and the visualisation displays the year wise trend of that keyword in the corpus based on the above structured data.
> Any pointers or tools would be of help.
> Thanking you in anticipation.
>
>
> --
> Regards
> Vinit Kumar, Ph.D.
> Assistant Professor,
> Department of Library and Information Science Babasaheb Bhimrao Ambedkar University, Rae Bareilly Road, Lucknow, India 226025
> +919454120174
>
>
> ________________________________
>
> "The contents of this e-mail (including any attachments) may be confidential and/or subject to copyright. Any unauthorised use, distribution, or copying of the contents is expressly prohibited. If you have received this e-mail in error, please advise the sender by return e-mail or telephone and then delete this e-mail together with all attachments from your system."
--
Sent from Postbox <https://www.postbox-inc.com>
|