LISTSERV 16.5 - CODE4LIB Archives

Hi Eric,

A few random thoughts -- basically all in agreement with what you've
already heard.

Agreed that subshells and I/O are what cost (I suspect disk IO in your
case) and that playing with the number of parallel processes may help.
Rewriting will only help if the dependencies and whatnot you're
loading/invoking are less expensive than what you're doing.

How many entries is it processing? My gut reaction was to look at disk I/O,
but if you have a lot of entries, the subshells can also really add up. My
experience is that NFS is awful with anything that's IO intensive. Among
other things, if the find command has a lot to traverse, you may get better
results with the locate command filtered with a regex rather than find as
that will run in a fraction of a second.

Echoing and sedding your variables as is done in the script is surprisingly
expensive as are the piped sed commands which spawn separate processes. You
can roll them all in one very readable command which will run a lot faster
if you have a lot of processing to do

sed '
s/##ID##/$DOCUMENTID/
s|##SID##|$SID|
s|##TID##|$TID|
s|##TOKEN##|$TOKEN|
s|##LEMMA##|$LEMMA|
s|##POS##|$POS|
'

If you rewrite, what you're doing seems particularly well suited for perl.

kyle

On Mon, Jun 22, 2020 at 9:16 AM Eric Lease Morgan <[log in to unmask]> wrote:

> One of our colleagues wrote:
>
> > Instead of “cat file | tail -n +2”, do “tail -n +2 file”.
> >
> > Every one of those “$(…)” creates a subshell, with all the attendant
> overhead.  Many of those, run in parallel,, may be causing a traffic jam
> for resources.  Have you tried reducing the number of processes launched in
> parallel, see if overall performance improves?  If that find command
> returns hundreds of files, you may be overwhelming your system.  8 to 10
> parallel processes seems to be the optimum on most VMs I’ve worked on in
> the recent past with normal amounts of memory and 2 CPUs.
> >
> > It may be disk i/o is your enemy here – that is a kernel process that
> will put the CPU in a wait state, while it waits for the disk to deliver up
> the data.  Again, especial with disk i/o, less sometimes gives you more.
>
>
> Using tail -n +x file removes a subshell. I'll give that a whirl.
>
> Yes, the find command not only finds hundreds of files, it finds 10s of
> thousands of files.
>
> My shared file system is NFS, and I hear-tell NFS is not very good for
> parallel processing.
>
> Another colleague of ours suggested re-writing it in Perl, Python, Ruby,
> etc.
>
> Thanks for the input.
>
> --
> Eric Morgan
>