Kusto: Creating an IfExists function to check if a table exists

https://youtu.be/__HrV3ckOSk

One of the things I find lacking in Kusto is an explicit way to test for the existence of a table: in both Azure SQL and Azure Data Lake, the ifexists function and exists compiler directive, respectively served this purpose.

Kusto doesn’t seem to have an explicit statement supporting this, but you can roll your own using the isfuzzy union argument. The isfuzzy argument basically says that a union should run as long as at least one table exists.

So here’s how to create your own “ifexists” function for kusto.

Write a function that takes a table name as a string input. Within the function itself, create a datatable with a single row named Exists with a value of 0.

Then union that datatable with the function input using the table operator like so and do a count of rows alias with the name Exists. I’d limit this to a single row, so you minimize execution.

Finally, sum up the Exists column result of your union so that you have only a single row. If the tablename from the input exists and has at least one row, your function will return 1. If it doesn’t exist or is empty, your function will return 0.

When using this, you convert the table output of the function to a scalar value using the toscalar function, like so. That’s it.

Of course, you don’t have to use this as a distinct function: you could simply have a fuzzy union in your code.

There is another way to do this if you’re writing an ETL function that acts differently depending on whether the table exists. We do this with very large telemetry sets when we just want to pull new values rather than pulling months’ worth of data and overwriting.

When we’re actively developing such a function, we may need to change the schema or do a compete reload. So if the table doesn’t exist, we want to pull a much larger set of data rather than just the past day or so.

Here’s the trick: once a function is compiled, it will run regardless of whether the tables upon which it relies exist. When you aggregate a non-existent table to a single row and convert the output to a scalar—say, the most recent datetime—Kusto returns a null value without any error. We check if that value isempty, and if it is, we grab several months’ worth of data, rather than just the new rows.

Kusto: Creating Dimensions with the datatable operator

I’m Mike O’Neill and I’m a data nerd who uses Azure Data Explorer, or Kusto, every day to glean insights into Azure’s developer and code-to-customers operations.

In my last video, I talked about how to assess a new data source to identify potential simple dimensions.

Kusto presents a singular problem for creating simple dimensions, especially when you’re used to storing dimensions as a table. You can’t create a little table as you would in Azure SQL and just insert and update values as you please. You also can’t repeatedly overwrite a stream as I used to do in Azure Data Lake.

In fact, you can’t really update anything in Kusto, nor can you overwrite a table that easily either. But Kusto does have an interesting workaround: you can write a list of comma-delimited values, use the datatable operator to make it into a table, and embed that into a function. Need to make changes? Just edit the text.

Let me give you a real example I wrote last week. One of the things I need to track for Azure’s engineering pipeline work is the name of the datacenter to which we’re deploy. Now, there’s nothing secret about this dimension: I pulled the info from these two public webpages.

Instead of creating a table like I would in Azure SQL, I just format all the values as a CSV, and then wrap it with the datatable operator. That operator requires three things: the operator itself, the schema immediately after it, and then a pair of square brackets around the CSV values.

Create that as a function, et voila, you’ve got a dimension. Anytime I need to update the values, I just update the function.

It’s not a perfect solution, however: because you’re just working with text, you’ve got none of the referential integrity capabilities of Azure SQL. Nor can you easily rebuild the dimension programmatically so that you don’t make silly mistakes.

I made that mistake last week, though it was with a mapping function with about 300 values rather than a pure, simple dimension. My colleague left a comment on my pull request, asking me to “find Waldo.” It was a friendly tease because I hadn’t bothered to do a simple check for duplicates.

If you use this method, you’ll need to be extra careful to maintain and run regular unit tests every time you alter the function.

And since early March, all of us in the State of Washington have been living through social distancing for the novel corona virus, my colleague teased me a little bit more, with this new version of “where’s Waldo.”

Stay healthy.

Kusto & Data quality: identifying potential dimensions

I’m Mike O’Neill and I’m a data nerd who uses Azure Data Explorer, or Kusto, every day to glean insights into Azure’s developer and code-to-customers operations.

A data engineering lead I work with recently asked me what his team should do to deliver high quality data. I’ve lucky enough to spend most of my career in organizations with a data-driven culture, so it’s the first time I’ve been in a position to teach rather than learn.

And so I didn’t have an easy answer. Data quality isn’t something I’ve classified in any way, it’s just something I’ve learned how to do through trial and many, many, many errors. I discussed it with my boss and she tossed down the metaphorical gauntlet: “Michael, list out the different data quality tests you think are necessary.”

My first reaction was that the most important thing for exceptional data quality is to have as many people look at the data as possible. It’s a bit of a dodge, I’ll admit. Linus’ Law really does apply here: “given enough eyeballs, all bugs are shallow.” But there are never enough eyeballs.

So here goes. There are a lot of things to look for, and I’ll create a video for each one. If you’ve been in the data engineering space for a while, you should probably skip this series.

Identifying potential dimensions is one of my top tasks, and for this video, I’m just looking at simple dimensions with a handful of values, maybe a few hundred at the extreme. I’m not looking at complex dimensions such as product catalogs which could have tens of thousands of values.

All I do is examine every column that might be a dimension of some sort. Generally, I ignore datetimes and real numbers, but short strings and bytes are good candidates. Then I aggregate, getting a count of rows for each potential value.

At this point, it becomes art rather than science, but a histogram can help push the scales a bit more into science.

Take a look at this histogram: we’re looking at how different teams at Microsoft name Azure’s data center regions. I get 168 values from this query, but I know that Azure has 56 regions. Different service teams are naming the regions differently.

Now look at the shape of that data: while I have a long tail of values, the bulk of my data looks like it falls within a small set. In fact, just 23 of 168 account for 80% of the values. To me, that feels like a decent candidate for a dimension.

In contrast, look at deployment actions—these values represent all the complexity of delivering new features to Azure. And I’m not using the word complexity lightly. There are over 9,000 different actions here. I can’t even generate a histogram in Kusto: it’s got a max of 5,000 values.

Excel doesn’t have that 5k limitation, but… even then, you can’t even see the histogram there: the long tail of values is so long, it’s basically meaningless. This is not a good candidate for a simple dimension.

Of course, this is art, not science. Azure Regions feels like a good candidate for a simple dimension, but deployment actions doesn’t.

Still, with 168 values for 56 regions, I also have to make a call about how to handle that. Short-term, it’s easy to manually go through the list and normalize those values down to 56 regions. For example, japaneast, jpe and eastjp are clearly all the same thing. That solves my problem in the short-term, but what do I do if a team decides to add a new value such as ejp or japane?

Again, we come to art. I really only have two choices about how to handle this long term. My first option is to create a job that monitors values in this dimension and alerts me whenever a new value appears so I can normalize it. My second option is to go to the team that owns this tool and insist that their users not have the choice to enter in whatever value they want.

The second option, in this particular case, is the right choice. As of March 2020, we have 56 data centers, and there’s no legitimate reason for one service team to have a different set of characters to represent the japan east region than all the other Azure service teams.

To put it another way, there’s a moral hazard here. If I go about cleansing that data once, the team owning that telemetry won’t have to. And I’ll keep having to cleaning it again and again and again.

But that’s not always the right answer. There’s not always a moral hazard.

Setting up an alert to manually handle the new value may be the right choice. For example, I was once responsible for providing data to Surface marketers. And every year, two or three months before Black Friday, I saw a brand-new Surface model name pop up in Windows telemetry. Now, I wasn’t supposed to know about that new device until October, but I considered my job to make sure that the data engineers, analysts, marketers and data scientists that used the data my team produced didn’t know about those new Surface models via the data my team produced.

We made the investment to monitor for new Surface model names, and when we found them, not only did we alert the people who were in the know, but we made sure those new model names didn’t appear in our data until it was time to execute marketing campaigns promoting them.

Kickstarting a data-driven culture: Same meeting, many conversations

I’m Mike O’Neill and I’m a data nerd who uses Azure Data Explorer, or Kusto, every day to glean insights into Azure’s developer and code-to-customers operations.

I’ve worked in data-driven organizations for most of the last decade plus, so it’s been a bit of a culture shock to work in an organization that doesn’t have data and analytics in its DNA.

I knew coming in that my team in Azure didn’t have the foundation of a data-driven culture: a single source of truth for their business. That was my job, but I naively expected this to be purely a technical challenge, of bringing together dozens of different data sources form a new data lake.

I learned very quickly that people are the bigger challenge. About a year into the role, my engineering and program management Directors both hired two principal resources to lead the data team. I was excited, but the three of us were constantly talking past each other, and now I think I know why.

I was working to build a single source of truth so that many people, including our team, could deliver data insights, but those new resources focused on us delivering data insights.

The difference is subtle, I realize, but it’s a big deal: if we only deliver data insights, we’ll end up as consultants, delivering reports to various teams on what they need. That’s work for hire, and it doesn’t scale.

If we build a single source of truth, those various teams will be able to self-serve and build the reports that matter to them. Democratizing data like that is a key attribute of a data-driven culture.

So why were the three of us talking past each other when we were talking about same business problems? I think it was a matter of perspective and experience.

To oversimplify the data space, I think there are four main people functions, and the experience learned from each function guides how people view the space. Our v-team had people from all of those spaces, and the assumptions we brought to the conversation were why we were speaking past each other.

First is this orange box, which is about doing, well, useful stuff with data. This is what everybody wants. This is where data scientists and analysts make their money.

The risk with this box is an overfocus on single-purpose business value. It’s great to have a useful report, but people who live only in this box don’t focus re-usable assets. Worst case scenario, it’s work-for-hire, where you shop your services from team to team to justify your existence.

Second is this yellow box, which represents telemetry. I’m using that word far more broadly than the dictionary defines it: to me, it’s the data exhaust of any computer system. But that exhaust is specific to its function and consumption is typically designed only for the engineers that built it.

The risk here is around resistance to data democratization. If you’re accustomed to no one accessing the data from your system, you won’t invest the time to make it understandable by others. When you do share the data, those new users can drown you in questions. Do that a few times, and you learn to tightly control sharing, perhaps building narrowly scoped, single-purpose API calls for each scenario.

Third, you need tools to make sense of the data: this is the blue box in my diagram. There’s a bajillion dollars in this space, ranging from visualization tools such as PowerBI and Tableau, to data platforms such as Azure Data Lake, AWS Redshift and Oracle Database. The people in this space market products to the people in my orange box, whether it’s data scientists or UI designers.

The risk in the blue box is in the difference between providing a feature relevant to data and delivering data as a feature. It’s easy to approach gap analysis by talking to teams about what data their services emit and taking at face value they’re assessment of what their data describes.

If you are in the data warehouse space, however, a gap analysis is about the data itself, not in the tools used to analyze that data. Hata gaps tend to be much more ephemeral than gaps in data product features, especially when you are evaluating a brand-new data source. In my experience, a conversation is just a starting point. You can trust, but you need to verify by digging deeply into the data.

And that brings me to the green box. In current terminology, that’s a data lake, and it’s where I’ve lived for the past decade. The green box is all about using tools from the blue box to normalize and democratize data from a plethora of telemetry sources from the yellow box, such that people in the orange box can do their jobs more easily. It’s about having a single source of truth that just about anyone can access, and that’s one of the foundations of a data-driven culture.

So what’s the risk in the green box? I like to say that data engineers spend 90% of their time on 1% of the data. Picking that 1% is not easy, and perfecting data can be a huge waste of money. Data engineering teams are very, very good at cleaning up data, and they are also very, very good at ignoring the law of diminishing returns. But they are not very good at identifying moral hazards, at forcing upstream telemetry sources to clean up their act.

Kusto: outliers and Tukey fences

One of the challenges I face is handling outliers in the data emitted by the engineering pipeline tools that thousands of Azure developers use. For example, our tools all have queues, and a top priority is that queue times are brief.

This is the shape of data you’d expect to see for a queue. A quick peak after a few seconds, and then this long tail.

What queue times should look like

But that’s not what our actual distribution of queue times looks like. The long tail is so long, it looks like a right angle.

What they actually look like

So what’s happening? Well, the mean queue time is 197 seconds, but if I remove the outliers, it’s just 18 seconds. Why the huge difference in averages? My max queue time is almost ten days, but when I exclude outliers, the max is just 81 seconds.

Six percent of my queue time values are outliers, ranging from minutes to days. I asked the team that manages this tool, and several things could be happening, from teams pausing a job while it’s in the queue, to misconfigurations by users. In short, none of these values represent valid queue times.

So how do I exclude those extreme values? I could pick an arbitrary line: everyone here in Azure loves the 95th percentile, because everyone remembers that’s two standard deviations from the mean in a Normal distribution. The problem is that the 95th percentile isn’t relevant for this type of distribution: it’s just luck that in this case, the 95th percentile is 106 seconds. It could just as easily be thirty minutes.

The better way to do this is to identify outliers based on the data. In fact, that’s the definition of an outlier: a data point that differs significantly from other observations. One common method of doing is called Tukey’s fences. I don’t have a Ph.D. in statistics, so I’m not going to explain how it works. In fact, that’s not what this channel. It’s about showing how to do powerful things in Kusto without a lot of effort, and Kusto’s series_outliers operator is just that.

Here’s how you do it.

Step 1: pack all the QueueTime values into a list using the make_list operator.

Step 2: feed that list into the series_outliers operator.

Step 3: unpack it all with mv-expand.

Step 4: is the one bit that Kusto doesn’t do automatically for you. The values in the Outliers column aren’t self-explanatory, but they represent how far the measurement is from the bulk of your data. Anything greater than 1.5 or less than -1.5 is an outlier, and beyond plus-minus three, the values are really, really out there.

You can’t have negative queue time, so I just look at values below 1.5 and I’m set.

Kusto: Seasonality and Holidays

I’m Mike O’Neill and I’m a data nerd who uses Azure Data Explorer, or Kusto, every day to glean insights into Azure’s developer and code-to-customers operations.

One of the challenges I face is handling seasonality and outliers. For example, large numbers of Microsoft employees take vacation three weeks every year: Thanksgiving week, Christmas and New Year’s.

I analyze what thousands of developers do and those weeks always have low activity, so I have to figure out how to gracefully handle that seasonality.

In this video, I’m going to show you how I used two built-in features of Kusto: startofweek and range, to develop a little function that finds those holiday weeks no matter what year we’re looking at, and making it easier for those engineering managers to do it as well at the same time.

Here’s a visualization of a certain type of developer activity related to bringing new features to production. Those dips are the weekends. Yeah, sometimes we work weekends, but Microsoft prides itself on work/life balance, and so the bulk of activity happens Monday through Friday.

Step one is to group the data to match the activity I’m measuring: by week, not by day. The startofweek function does this nicely, and… while it is a simple function, it’s also really powerful because of that simplicity. There’s no need to decide whether Sunday or Monday is the first day of the week, and those engineering managers can pick it up in seconds and use without being reminded. Puts us all on the same page.

Startofweek smooths things out quite nicely, doesn’t it? But now you can see my problem: those six sharp downward spikes: that’s U.S. Thanksgiving, Christmas and New Year’s. Whether I exclude them or replace the values with an average, I need to identify them dynamically.

But I can’t just exclude the 24th and 25th of December, for example: I have to exclude the entire work week, and each year, those holidays either fall on different days of the week, or on a particular Thursday in November.

Let’s tackle Christmas and New Year’s first:

Step 1: Create a little data table with numbers for month and day.

Step 2: we use the range operator, which lets you create a list of numbers or dates in series. It’s created as a blob of structured text in a new column.

Step 3: we need to explode that blob of text out so that I have a row for each date, like it was a cross join. For that, we use the mv-expand operator.

Step 4: is the easy part. I restrict the rowset to only dates where the month and day match my list of holidays, and then…

Step 5: use startofweek again.

This pretty much works. Except when Christmas and New Year’s fall on a weekend or on Monday. Look at 2016 for example: both these holidays fall on the first day of the week, meaning that New Year’s week ends on January 7th. That was a regular work and school week. To fix that, all I need to do is switch my holidays to Christmas Eve and New Year’s Eve.

What about Thanksgiving? In the U.S., that’s the Thursday in the fourth week of November.

Step 1: Again, I start with a little datatable, but in this case, instead of the numbered day of the month, I need the week for that month, and the day of the week. Kusto uses a timespan of 4 day to represent Thursday, rather than an integer.

Step 2: Again, use the range operator to generate a set of date and…

Step 3: Use the mv-expand operator to explode this out in a cross join.

Step 4: is where things change from the previous example. I grab only Thursdays from the month in question, November.

Step 5: I order the data. This is critical: kusto won’t order things for you. You might think the range operator would land things in order, but it may not.

Step 6: use the row_number operator so that you know which is the fourth Thursday in November.

Step 7: Use startofweek to find the Sunday before Thanksgiving.

That’s it. Now, all I need to do is train my engineering execs to use startofweek, and then do a left anti-join to remove data from those weeks.

Here’s the full code for the function

.create-or-alter function with (folder = @'') SeasonalityWeeks
(
rangeStart:datetime = datetime("2016-01-01")
,rangeEnd:datetime = datetime("2022-01-01")
)
{
let _rangeStart = iif(rangeStart > rangeEnd,rangeEnd,rangeStart);
let _rangeEnd = iif(rangeEnd < rangeStart,rangeStart,rangeEnd);
let _majorFixedHolidays =
datatable(Month:int,Day:int,Name:string)
[
12,24,"Christmas",
12,31,"New Year",
];
let _majorVariableHolidays =
datatable(Month:int,DayOfWeek:timespan ,WeekOfMonth:int,Name:string)
[
11,timespan(4d),4,"US Thanksgiving",
];
_majorVariableHolidays
| extend Date = range(_rangeStart,_rangeEnd,1d)
| mv-expand Date to typeof(datetime)
| where Month == datetime_part('Month',Date)
| extend Weekday = dayofweek(Date)
,Year = datetime_part('Year',Date)
| where Weekday == DayOfWeek
| order by Year asc, Date asc
| extend RowNum = row_number(1,prev(Year) != Year)
| where RowNum == WeekOfMonth
| project HolidayWeek = startofweek(Date)
, Name
| union kind = outer
(
_majorFixedHolidays
| extend Date = range(_rangeStart,_rangeEnd,1d)
| mv-expand Date to typeof(datetime)
| where Month == datetime_part('Month',Date)
and Day == datetime_part('Day',Date)
| project HolidayWeek = startofweek(Date)
, Name
)
| project Name
, HolidayWeekStart = HolidayWeek
, HolidayWeekEnd = datetime_add('Day',6,HolidayWeek)
}

Improving Ancestry com’s MyTreeTags feature

I’ve been so focused on DNA ThruLines and the hints system that I didn’t notice Ancestry.com’s new tagging feature. Tags have been around a long time, and it’s nice that Ancestry.com added this capability.

But… it seems a half-baked effort.

  1. There’s no obvious warning to other researchers when I flag something as unverified or a hypothesis.
  2. Ancestry isn’t helping me ignore the “old” method of using icons.
  3. Tags have no visual impact in tree view: for example, the “no children” tag doesn’t replace my “no children” gender neutral child on my tree.
  4. The “direct line” tag isn’t well thought out. Why can’t I just click myself and activate this along my direct line?

My biggest beef is about the research tags.

On a personal level, I’d love the research tags to appear in more places. One example: I complain a lot about ancestry.com’s poor hint quality, and I will go to “All Hints” page and ignore hundreds of hints for people I don’t care about. But I don’t always remember which profiles I stopped looking at because they’re brick walls. The research tags should appear next to the profile name here to remind me.

But the biggest miss for research tags is communication to other genealogists. One of my biggest fears is that someone else will take a wild guess of mine and copy it. I have one hypothesis in my tree by the name of “Wild Speculation Chew.” And then after I discover my wild guess was wrong and remove it, someone else will copy the copy of my wild guess. And ten years later, there are dozens of trees with my random guess.

Someone actually contacted me and made a joke about the crazy names they gave people back then.

 Add an example of family tree search and tags

My second complaint is centered around how this feature feels like Ancestry.com is attempting to standardize all the crazy little hacks we all make to help track our research. But ancestry isn’t making an effort to help us ignore those little hacks now that there’s a better option.

For example, I put question marks as the suffix of a person’s name when I’m not convinced I have the relationship right, and create a gender-unknown child named “No Children” when a person didn’t have any kids.

Other people add little icons of angels and immigrant ships. I hate little icons. Well, no, I hate that ancestry serves up those icons as hints. I hate that so much I have a whole video about how easy it would be for ancestry.com to use artificial intelligence to categorize images, and give me the choice of suppressing hints for the image categories I don’t want to see.

My third suggested area of improvements is visualizations for the tags. The central experience in ancestry.com, for me at least, is the tree view. Sometimes I start by searching for an individual, but at some point, I traverse my tree in tree view.

Take a look at Thomas Kirk Plummer, here. I put some tags, but at a glance, all I can see is my ?. To see those tags, I have to click on his name and then expand the tags section.

OK, that’s not too bad. But for standard, un-customizable tags, why not create an additional visualization that is immediately visible for a handful of tags?

For example, research status tags could appear in on the right side of the profile pic, with unverified as a question mark, verified a check mark, hypothesis a light bulb, actively researching a magnifying glass, etc.

No children could have a small stop sign at the bottom. A brick wall could have… well, a little brick wall across the top.

My fourth area for improvement is about the “relationships” bucket of tags, specifically the “Direct Line” tag. That is just screaming out to me as a place for improvement.

On my wife’s tree, for example, I can trace back to fifth-great grandparents on almost every branch. That’s 254 people to tag as “Direct Line” and at five clicks per person, that’s 1,270 clicks, just for my wife. And I have several other lines of ancestry, including my own. Really, I’m never going to use that tag. Too much work, too little value.

But what if I could click my wife’s profile and choose an outline color for her direct line ancestry? Two to three clicks, and this could turn into this.

Oh, and why colors? Because there’s a point in my wife’s family tree, ten generations back, where she intersects with my sister-in-law’s family tree. In that case, the square around their common tenth-great-grandparents could show both colors. And I did not realize they were distant cousins for months.

Why your Scotch-Irish ancestors moved so frequently

Do you have ancestors who move frequently but not far? Say, showing up in 1790s Shelby County, Kentucky, then Bullitt County in 1800, then Grayson County in 1810? Or perhaps Hamilton County, Ohio in the late 1790s then Montgomery County in 1803 and finally Darke County in 1810?

There are two factual scenarios at play here:

Fist, your ancestors stayed in place but the map changed: that’s what happened in my Ohio example. I covered this in a previous video, check it out.

Second, your ancestors really did move a lot. But why did that family move so frequently when another family in your tree stayed put for decades?

I want to thank Karla York for suggesting this as a topic for a video. She was responding to a comment where I noted that ethnic German immigrants to the United States practiced a crop rotation strategy which kept their land productive and fertile, while Scotch-Irish backcountry pioneers would farm a patch of land for a few years until it was deprived of nitrogen, and then move on to the next.

To be honest, that story is something my mother has told me for years, not something I had researched. Turns out it’s true, but it was just one factor in why some of your ancestors made lots of little moves.

What really drove this, I think, was culture, specifically Scotch-Irish culture, and specifically in the geographical region dubbed Greater Appalachia where the Scotch-Irish settled.

By culture I mean how people lived their lives, from marriage and sex, to how you built your house, to what you cooked. It’s the stuff you learn from your parents and your community about how to survive.

My favorite author on colonial culture, David Hackett Fischer, summarized Scotch-Irish culture in my favorite book on colonial culture, Albion’s Seed this way:

The [Scotch-Irish] were a restless people who carried their migratory ways from Britain to America… The history of these people was a long series of removals—from England to Scotland, from Scotland to Ireland, from Ireland to Pennsylvania, from Pennsylvania to Carolina…

Fischer cites the example of the village of Fintray: between 1696 and 1701, three-quarters of the population turned over. The same pattern showed up in Appalachian Virginia, where 80% of the people living in Lunenburg County in 1750 were gone by 1769, with half of that movement occurring between 1764 and 1769. Fischer asserts that “these rates of movement were exceptional by eighteenth-century standards.”

Those migrations, in both the borderlands between England and Scotland, and in the colonial backcountry, were short-distance, “as families search for slightly better living conditions. Frequent removals were encouraged by low levels of property-owning.”

A folk-saying from the southern highlands gives you a better idea of how people felt. “When I get ready to move, I just shut the door, call the dogs and start.”

That feels pretty extraordinary. What will you eat? How could you just walk away from your labor investment in crops? What about your tools, your plow?

The answer is culture once again. The Scotch-Irish weren’t farmers the way we might think of colonial farming, with acre after acre of corn and wheat. They combined livestock herding with vegetable gardens and some grain. And they didn’t have a lot of tools: Fischer cites an early 1700s primary source that colonial backcountry Scotch-Irish had “one axe, one broad hoe and one narrow hoe.”

When you picked up and moved, you packed up some produce, a few tools, and then herded your livestock a few miles to a new spot. In Scotland, it was sheep, in the colonial backcountry, pigs or cattle.

Of course, it wasn’t quite so unplanned as it sounds. In The Monongalia Story a history of one region of West Virginia, Earl Core wrote that:

“A small group of men might come in winter or early spring, build their first cabins, clear and fence their little fields, plant potatoes, corn, beans and pumpkins. After the crops were well started… the men would ride their horses back [to their family’s current residence], again load them with [the rest of their possessions], and return with their family.”

The collaborative nature of this migration shouldn’t be discounted. American culture lionizes the rugged individualist pushing back the frontier, but that was a myth. Frontier migrations were a community affair, and the greater the distance, or the deeper into the territory of another culture that would try to repel what to them was an invasion, the more critical it was to band together.

Fischer notes that the first settlements in Tennessee and Kentucky were centered around military-like forts and stations, where settlers living nearby could retreat for mutual defense.

Core noted that the forts were also the center of the community, where “young couples danced and courted, where marriages were performed and funerals held, where land claims were recorded and justice meted out.”

As the native populations were pushed out & settler control secured, the Scotch-Irish spread out. As one North Carolina congressman put it, “no man ought to live so near another as to hear his neighbor’s dog bark.”

There’s only so much I can pack into a video of less than five minutes. If you want to learn more, get a copy of Albion’s Seed. It’s dense and long, but I think it’s worth it.

So… what of the bit about the Scotch-Irish moving because they wore out the land? It’s a bit of a chicken-and-egg scenario, isn’t it? If your culture is to move frequently, you didn’t need to maintain the fertility of your land.

The Scotch-Irish did have a way to re-fertilize land, however. Fischer quoted a traveler to the southern backcountry who noted “A fresh piece of ground… will not bear tobacco past two or three years unless cow-penned; for they manure their ground by keeping their cattle… within hurdles, which they remove when they have sufficiently dunged one spot.