RLP's Computing Blog

ched: The Chunk Editor

Article Info

Recent Changes

About


PLEASE NOTE: This is currently ONLY AN IDEA, there is no code yet.

Can We Do Better Than sed?

(If you aren’t familiar with “sed” or “awk” (UNIX command line tools), this probably won’t interest you.)

I’ve been thinking for a while (writing this August 2017) that sed is very nearly a great tool, but there are so many cases it doesn’t handle, and many of them there is no obvious way to handle them in awk or using perl -pe tricks that doesn’t involve becoming a real expert in those things.

I love that sed is so compact. I love that the base case is trivial. I just feel like we can make the non-base-cases not require, you know, a full programming language.

Desires

Here’s things I’d like to see ched be able to handle:

  • Operate easily with NUL separated data (i.e perl’s -0)
  • Find me every line with foo in it, print from 2 lines before that (or beginning of file); continue printing until you reach the 3rd line after a line with bar
  • Do this as many times as there are pairs of lines with foo and bar
  • Do this exactly once
  • Do this the first/last N times you see such pairs
  • Replace a regex which includes multiple newlines without loading the entire file into RAM (assuming the regex has no “.*” in it; if it does you’re on your own)
  • Support both greedy and non-greedy pattern matching
  • Support both greedy and non-greed address ranges

TODO: write a section that is “here’s the things a competent ched user should be able to handle easily without explicit variables or conditionals”.

TODO: read the docs for rq, see if it has any good ideas

Ideas

  • read http://doc.cat-v.org/bell_labs/structural_regexps/se.pdf ; some good stuff in there
    • A thing I might add to that is the ability to bounce back and forth between various views of the structure; the sam system seems to only flow one way
    • see also https://9fans.github.io/plan9port/man/man1/ssam.html and http://sam.cat-v.org/ and https://github.com/deadpixi/sam which is a port

      jcowan> The concept is that you start by specifying a regex that can either specify what is included in a chunk or what is the boundary of a chunk. Then you can filter on a regular expression that either matches or doesn’t match the chunk. Repeat as many times as you want to process subchunks, sub-subchunks, etc. jcowan> The final step is either append after each bottom-level chunk, insert before it, replace it with something, or do a regex/substitute command on it. jcowan> on the way out all the chunks and delimiters are reaseembled, like sed without -n, or just the modified bottom-level chunks, like sed with -n. jcowan> For instance, ssam x/Emacs/ c/vi/ will match all chunks consisting solely of the string “Emacs” and replace them with the string “vi”. Everything else is output unchanged. jcowan> for another example, sed y// s/Google/Doodle/ will break the input into chunks delimited by blank lines and change the first “Google” in each chunk to “Doodle”. jcowan> sed y// g/Furgle/ s/Google/Doodle/ will make the change only in chunks that contain “Furgle” jcowan> s/sed/ssam/, of course jcowan> sssam -n x/[0-9]+/ a// will find all digit-strings in the input and output them one per line

  • ched operates on chunks; these might be lines, they might be NUL-separated strings, they might be successive matches of a regex, they might be a number of bytes, they might be the whole file if that’s what you really want
  • after processing each chunk, ched advances; the advancement need not be an entire chunk; it could for example be operating on 5 line chunks, but advancing 1 line at a time, or advancing 10 lines at a time for that matter
  • addressing should be much more robust than sed, and should be nestable; so for example “replace foo with bar if it occurs within 3 lines of a Ruby if statement, but not inside the statement itself” might, assuming various syntax details like “,? means non-greedy ranging”, be something like:

    /^\s+if\s/-3,?/^\s+end\s*$/+3 {
      1,/^\s+if\s/-1 s/foo/bar
      /^\s+end\s*$/+1,$ s/foo/bar
    }

(good luck doing that in a compact way in any language)

For bonus points, have automatic variable type thingies that store the next-outer range’s targets, so this becomes:

/^\s+if\s/-3,?/^\s+end\s*$/+3 {
  1,$START-1 s/foo/bar
  $END+1,$ s/foo/bar
}
  • more on addressing:
    • so chunks can be addressed, obviously, and it should be easy to user-define what a chunk is, but it should be possible to specify and then easily address sub-chunks and sub-sub-chunks and so on. So, like, if you define a chunk as paragraphs, and sub-chunks as lines, and those as broken up by commas, then /foo/.3.-2 is the second previous comma-separated chunk before the third line of the first paragraph starting with “foo”
    • addresses always cover an entire unit; operations specify (either by nature or as specified by the user) whether the operate on the entire unit, or occur before or after the unit
    • there should be an addressing debug mode that shows the results of addressing commands; it should be possible to specify whether this debug mode highlights the whole unit, the point before, or the point after
  • should be able to have two lines that operate on each chunk in succession, or two lines that first go through the whole file for the first line and then go through it again for the second; maybe even be able to say how many chunks to process before continuing?; The specific example is wanting to remove the last line and then operate on the new last line
    • similarly, it’d be nice to be able to turn:
        a
        b
        c
        a
        b
        c
        a
        b
        c

    into

        a b c
        a b c

    or

        a a a
        b b b
        c c c
    , and vice versa; arbitrary interleaving. This implies being able to explicitely loop over chunks, sub-chunks, files, etc.
    • One way to achieve this would be to allow tagging of chunks, so you could loop over the chunks tagging them, and then loop over the tags pulling the next chunk of that tag
    • On a related note: vertical chunks? i.e. in the last case, can each of those “abc” columns be a chunk?
  • Tene says: It might be nice if whatever ched ends up being could also support generic data formats like json, xml, html, etc. Maybe even plugins for all kinds of data formats, like image metadata and crap like that. Eh, that might be getting too far out of scope. But, handling the same kind of functionality as jq seems plausibly in-scope, for doing substitutions and transformations in structured data.

  • chunk boundaries can be specified flexibly: “this is what the separator looks like”, flag for whether separator can come before or after chunks start, whether separators can repeat, etc, OR: “here’s what chunk start looks like” , but you can also separately specify what chunk end looks like, meaning a file can have non-chunk areas
  • mode where all replacements and transformations are only applied to chunks and all non-chunk data is preserved unchanged
  • example: chunk is everything on each line not counting leading and trailing whitespace, so in the mode just described swapping line 2 and 3 gets you line 3’s content with line 2’s whitespace
  • should be able to do the same thing without the special mode: layer 1 is each line, layer 2 is the inner bits as above, swap layer 2 bits on layer 1 chunks 2 and 3
  • Separators can be not part of the chunks, but they can also explicitely be part of the chunks, either “separators glom to the previous chunk” or “separators glom to the next chunk” (are there other options here?; split in half seems too weird/hard)
  • Columnar regexes need to have a newline-like character. Example: in linewise mode, “any number of lines of just whitespace” is “^(+)*$" or something (maybe we can make that cleaner?, I don't like needing the parens there); need to be able to do "any number of columns of just whitespace"; something like "^(\S+\c)*$"
  • Example that drove the above two lines: it should be possible to swap columns 2 and 3 in a whitespace-separated columnar file. The idea I have there is layer 1 is columns separated by columnar chunks of whitespace, layer 2 is lines; for each line in column 2 swap with line in column 3, but during the swap strip and sprintf the data to the width of the destination so that you preserve the columnar position
  • Implied by the above is something sprintf-ish; a very important aspect of that would be that it should error out if you use fill specifiers and it’s going to overfill; i.e. however you say “pad this out to 20 characters with whitespace”, if you give it a 21 character string there should be a clear error